[Enhancement](pyudf) Enhance Python process poll init and repair#64039
[Enhancement](pyudf) Enhance Python process poll init and repair#64039linrrzqqq wants to merge 3 commits into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
I found two lifecycle/concurrency issues that should be fixed before merge.
Critical checkpoint conclusions:
- Goal: The PR bounds Python UDF process-pool initialization and allows partial availability. The main flow and new tests cover stuck startup and repair behavior, but shutdown races are not covered.
- Scope: The implementation is focused on Python UDF process management, but the detached worker lifecycle adds shutdown coordination requirements.
- Concurrency: Blocking issue. Detached init/repair workers and health-check startup can race with shutdown, leaving unmanaged Python processes and racing on
_health_check_thread. - Lifecycle: Blocking issue. Pools created during shutdown are not marked stopped and detached workers can publish processes after the manager snapshot has already been drained.
- Configuration/compatibility: No new config or storage/protocol compatibility concern found.
- Parallel paths: Init, foreground repair, and health-check repair are mostly aligned, but shutdown is not synchronized with pool creation/start.
- Tests: Positive and timeout cases are covered, but concurrent shutdown during initialization/health-check start is missing.
- Observability/performance: Existing logs are sufficient for the reviewed paths; no additional blocking observability issue found.
- User focus: No additional user-provided review focus was specified.
03dc422 to
4f8e5ae
Compare
|
/review |
There was a problem hiding this comment.
I found one additional blocking lifecycle issue in the new repair path. The existing review threads already cover the health-check thread race and the pool-creation/shutdown race, so I did not duplicate those.
Critical checkpoint conclusions:
- Goal/test: The PR aims to bound Python UDF pool initialization and allow partial availability; the added tests cover timeout and partial-success cases, but not shutdown racing with repair wait.
- Scope/focus: The change is focused on Python UDF process lifecycle, but the new asynchronous repair/init model introduces extra lifecycle edges.
- Concurrency: Still problematic;
_get_process()can return a process aftershutdown()marks the pool stopped. Existing comments cover two other shutdown races. - Lifecycle/static init: No new static initialization issue found; process ownership/lifecycle has the issue noted inline.
- Config/compatibility/protocol: No new config or FE/BE protocol compatibility issue found.
- Parallel paths: Health-check repair and foreground repair are mostly unified by
repairing; the stopped-pool return path is the uncovered parallel edge. - Tests: Missing a concurrent shutdown-vs-repair wait regression for the inline issue.
- Observability/performance: No additional blocking observability or performance issue found beyond lifecycle correctness.
User focus: no additional user-provided focus was supplied.
4f8e5ae to
1be429f
Compare
|
/review |
There was a problem hiding this comment.
Code review completed for PR 64039.
No additional inline findings from my pass. I treated the existing inline threads as known context and did not duplicate them; the current head appears to include changes addressing the previously raised shutdown/pool-creation/health-check synchronization concerns.
Critical checkpoint conclusions:
- Goal and tests: The PR aims to make Python UDF process pool initialization/repair bounded and more resilient to stuck process startup. The code generally implements that through detached per-slot initialization, bounded socket startup/termination waits, pool repair, and unit coverage for stuck/partial initialization and repair paths.
- Scope and focus: Changes are concentrated in the Python UDF process manager/runtime and related BE unit tests.
- Concurrency: The modified code is concurrency-heavy. Current head uses per-pool mutex/CV state,
_pools_mutex,_health_check_mutex, and atomics to coordinate initialization, repair, health check, and shutdown. I did not find an additional distinct race beyond the already-known review context. - Lifecycle: Process and health-check lifecycle paths were reviewed, including shutdown and detached init/repair workers. No new lifecycle blocker found in this pass.
- Configuration: No new config item is added. Existing
max_python_process_numand Python memory limit are used consistently with the pre-existing behavior. - Compatibility: No FE/BE protocol, storage format, function symbol, or persisted metadata compatibility issue identified.
- Parallel paths: UDF, UDAF, UDTF client acquisition all share
PythonServerManager::get_client, so the changed pool behavior applies consistently. - Tests: BE unit tests were added/updated for bounded stuck startup, partial pool initialization, repair, and shutdown-related behavior. I did not run tests in this review runner.
- Observability: Existing/new logs include version and process counts/errors for initialization and repair; sufficient for this change.
- Transaction/persistence/data correctness: Not applicable; no transaction, storage visibility, or persistence path changed.
- Performance: The design avoids waiting for every process during initial pool creation and keeps slow forks outside per-pool locks. No additional obvious hot-path performance issue found.
- User focus: No additional user-provided review focus was specified.
|
run buildall |
TPC-H: Total hot run time: 29032 ms |
TPC-DS: Total hot run time: 169510 ms |
1be429f to
09765bf
Compare
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Automated review summary for PR 64039 at head 09765bf4c122f438a1ea22939475f612989600e7.
I reviewed the current diff and the existing inline review context. The previously raised shutdown/init races around pool creation after shutdown, starting the health-check thread during shutdown, and returning a repaired process after stopped is set are already-known review items and appear to be addressed in the current head, so I am not resubmitting duplicate inline comments.
Critical checkpoint conclusions:
- Goal/test proof: The PR changes Python UDF process pool initialization to return once one usable process is available, bounds startup/termination waits, and relies on repair/health-check to fill missing slots. The added BE unit tests cover stuck socket creation, partial pool availability, invalid versions, repair, and multi-version initialization.
- Scope/minimality: The changes are focused on Python UDF process lifecycle and its unit tests, though the lifecycle code is inherently concurrency-heavy.
- Concurrency: The current head serializes pool-map creation with shutdown through
_pools_mutexand_shutdown_flag, protects health-check thread ownership with_health_check_mutex, uses per-pool locks for pool state, and rechecksstoppedafter repair waits. I did not find an additional distinct race that should block this PR. - Lifecycle/static initialization: No new cross-translation-unit static initialization dependency was identified. Detached init/repair workers capture the per-version pool and use static fork/repair helpers, which avoids direct manager lifetime access after shutdown.
- Configuration: No new configuration item is added. Existing
max_python_process_numis still read dynamically for init/repair sizing. - Compatibility: No FE-BE protocol, storage format, or function symbol compatibility issue was identified.
- Parallel paths: Foreground repair and health-check repair now share the
repairingguard; current head applies the main lifecycle checks across init, foreground get, health check, repair, and shutdown paths. - Conditional checks: The new
stopped,initialized,has_available_process, andrepairingchecks have local rationale and follow the lifecycle being introduced. - Test coverage/results: Unit tests were added/updated for the changed behavior. I did not run the test suite in this review environment.
- Observability: Existing and new logs include Python version, process counts, pid, and failure status in the relevant init/repair/shutdown paths; no additional required metric was identified.
- Transaction/persistence/data writes: Not applicable.
- FE-BE variable passing: Not applicable.
- Performance: The change avoids blocking request startup on a fully populated pool and keeps fork work out of pool locks. I did not identify an obvious new hot-path performance blocker.
User focus response: No additional user-provided review focus was specified.
TPC-H: Total hot run time: 29768 ms |
TPC-DS: Total hot run time: 169113 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
There was a problem hiding this comment.
I found one blocking process-lifecycle issue in the new bounded shutdown path. I did not repeat the existing review threads; the current head appears to address the previously raised pool-creation/shutdown races and the stopped recheck after repair wait.
Critical checkpoint conclusions:
- Goal/test coverage: the change targets bounded Python UDF pool initialization and partial availability, with focused BE unit tests for timeout, partial success, repair, and shutdown paths.
- Scope/clarity: the implementation is mostly focused on Python UDF process lifecycle, but the unreaped-detach fallback below leaves a correctness/resource leak gap in the failure path.
- Concurrency/lifecycle: pool mutex, manager mutex, health-check mutex, and stopped flags now cover the main pool races already discussed; detached init/repair workers are guarded before publishing. The remaining issue is child-process ownership after timed-out SIGKILL waits.
- Configuration: no new config items. Existing max_python_process_num remains used.
- Compatibility/protocol/storage/transactions: no protocol, storage format, transaction, or FE-BE compatibility changes found.
- Parallel paths: init, foreground repair, health-check repair, and shutdown paths were reviewed together; the child-reaping concern applies to shutdown and failed-fork cleanup semantics.
- Observability: timeout and kill paths log useful pids, but logging alone does not clean up late-exiting children.
- User focus: no additional user-provided review focus was specified.
|
|
||
| if (int exit_code = _child.exit_code(); exit_code > 128 && exit_code <= 255) { | ||
| int signal = exit_code - 128; | ||
| LOG(INFO) << "Python process was killed by signal " << signal; |
There was a problem hiding this comment.
Detaching here after both SIGTERM and SIGKILL waits time out leaves no owner that will ever reap this pid. A concrete path is a Python process stuck in uninterruptible I/O longer than terminate_timeout: shutdown() sends SIGKILL, wait_child_exit() returns TIMEOUT, _child.detach() drops boost's handle, and _is_shutdown is set. If the process exits later, it remains as a zombie under Doris because neither PythonServerManager nor a background reaper tracks the pid. Please keep a reaping owner, for example by enqueueing the pid to a bounded background reaper, or avoid detaching/marking shutdown until the child has actually been reaped.
|
run buildall |
TPC-H: Total hot run time: 28610 ms |
TPC-DS: Total hot run time: 169868 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
87f2215 to
c85d072
Compare
|
/review |
There was a problem hiding this comment.
I found two blocking issues in the new asynchronous Python UDF process lifecycle code. The main functional risk is that initialization now waits the full pool timeout even when all fork workers have already failed, which turns quick configuration/runtime failures into a 20s stall. The new background reaper also does not actually enforce its queue bound once pids have moved into the worker-local pending list.
Critical checkpoint conclusions:
- Goal/test: The PR aims to bound Python UDF startup/shutdown and avoid blocking/zombie processes. Tests cover several timeout and reaper paths, but they currently accept the full timeout on all-failed init and do not cover sustained background-reaper backlog.
- Scope: The change is focused on Python UDF process management, but the async init/reaper changes introduce new lifecycle state that needs tighter accounting.
- Concurrency: The pool creation/shutdown and health-check thread races previously raised appear addressed in this head, but detached init/repair workers rely on pool state and condition variables; the all-workers-failed state is not represented.
- Lifecycle: The background reaper keeps pids after foreground detach, but its pending ownership is local to the thread and invisible to enqueue-side capacity checks.
- Config/compatibility: No new config or storage/protocol compatibility concern found.
- Parallel paths: Startup failure in
fork()andshutdown()both use the new reaper path; the same bounded-ownership issue applies to both. - Tests: Added BE unit tests are useful, but missing a case proving initialization returns promptly after all fork attempts fail and a case proving the reaper bound includes in-flight pending pids.
- Observability: Logs are generally adequate for the reviewed paths.
- Transaction/data correctness: Not applicable.
- Performance: The 20s wait on already-failed init is a user-visible latency regression; the reaper backlog can grow beyond the intended cap under repeated stuck children.
User focus: no additional user-provided review focus was supplied.
Problem Summary:
Python UDF process pool initialization previously required the whole pool to finish initialization before BE could continue serving the query.
In abnormal environments, Python process startup may hang or take a very long time in paths such as:
fork/ child process creationWhen one process slot gets stuck, the whole process pool initialization can be blocked. As a result, FE may hit the send fragments RPC timeout before BE returns a meaningful Python UDF error:
RpcException, msg: timeout when waiting for send fragments rpc, query timeout:900, left timeout for this operation:30.be.log:
Solution
Change Python process pool initialization from "wait until all processes are created" to "return once at least one usable process is available".
The pool no longer treats full-size initialization as a prerequisite for serving queries. Once one Python process is alive, the current query can proceed. Missing or failed process slots are repaired asynchronously by the existing health check / repair path.
SERVICE_UNAVAILABLEbefore FE send fragments RPC timeout.wait.