[SPARK-47565][PYTHON] PySpark worker pool crash resilience#45635
[SPARK-47565][PYTHON] PySpark worker pool crash resilience#45635sebastianhillig-db wants to merge 1 commit into
Conversation
utkarsh39
left a comment
There was a problem hiding this comment.
Flushing an initial round of comments for my understanding
There was a problem hiding this comment.
If the interestOps call succeed, will both of these checks be automatically true?
There was a problem hiding this comment.
It seems that this isn't always the case, i.e. the workerHandle may already see the process being dead and selectionKey update will happily pass. I also check isValid for the off-chance, that we got cancelled after interestOps was set.
|
LGTM. Let's get a review from others? |
|
@ueshin @HyukjinKwon can you take a look here? |
|
Let's file a JIRA, see https://spark.apache.org/contributing.html |
|
Apache Spark uses the GitHub Actions in your forked repository so the builds have to be found in https://github.com/sebastianhillig-db/spark/actions . The GitHub Actions would have to be enabled at https://github.com/sebastianhillig-db/spark/settings/actions , and rebase this PR |
HyukjinKwon
left a comment
There was a problem hiding this comment.
The fix itself seems pretty good.
There was a problem hiding this comment.
It seems that there is a chance to introduce an infinite loop to Apache Spark. Maybe, limit the number of retry? WDYT, @sebastianhillig-db ?
There was a problem hiding this comment.
On each iteration, a worker is pulled from idleWorkers, this will end up "emptying" the pool. The synchronization around this will ensure that no other workers are added while this happens. (see https://github.com/apache/spark/pull/45635/files/ba3c6f6ee19762278004594735f25ab4f6fafb3e#diff-1bd846874b06327e6abd0803aa74eed890352dfa974d5c1da1a12dc7477e20d0L411-L413)
There was a problem hiding this comment.
On each iteration, a worker is pulled from
idleWorkers, this will end up "emptying" the pool. The synchronization around this will ensure that no other workers are added while this happens. (see https://github.com/apache/spark/pull/45635/files/ba3c6f6ee19762278004594735f25ab4f6fafb3e#diff-1bd846874b06327e6abd0803aa74eed890352dfa974d5c1da1a12dc7477e20d0L411-L413)
The link seems to be broken.
There was a problem hiding this comment.
Ugh, sorry - the force push broke that link. I'm referring to "releaseWorker" using the same synchronization, so we should not be adding new workers while this code runs.
There was a problem hiding this comment.
Promise not to force push again: https://github.com/apache/spark/pull/45635/files#diff-1bd846874b06327e6abd0803aa74eed890352dfa974d5c1da1a12dc7477e20d0L411
ba3c6f6 to
0f59a6a
Compare
|
Merged to master. |
What changes were proposed in this pull request?
PySpark worker processes may die while they are idling. Here we aim to provide some resilience, by validating process and selectionkey aliveness prior to returning the process from idle pool.
Why are the changes needed?
To not fail queries when a python process crashed while idling.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added appropriate testcase.
Was this patch authored or co-authored using generative AI tooling?
No