[SPARK-47565][PYTHON] PySpark worker pool crash resilience by sebastianhillig-db · Pull Request #45635 · apache/spark

sebastianhillig-db · 2024-03-21T09:16:02Z

What changes were proposed in this pull request?

PySpark worker processes may die while they are idling. Here we aim to provide some resilience, by validating process and selectionkey aliveness prior to returning the process from idle pool.

Why are the changes needed?

To not fail queries when a python process crashed while idling.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added appropriate testcase.

Was this patch authored or co-authored using generative AI tooling?

No

utkarsh39

Flushing an initial round of comments for my understanding

utkarsh39 · 2024-03-22T15:32:32Z

If the interestOps call succeed, will both of these checks be automatically true?

It seems that this isn't always the case, i.e. the workerHandle may already see the process being dead and selectionKey update will happily pass. I also check isValid for the off-chance, that we got cancelled after interestOps was set.

utkarsh39 · 2024-03-22T17:21:05Z

LGTM. Let's get a review from others?

sebastianhillig-db · 2024-03-25T15:38:57Z

@ueshin @HyukjinKwon can you take a look here?

HyukjinKwon · 2024-03-26T01:04:04Z

Let's file a JIRA, see https://spark.apache.org/contributing.html

HyukjinKwon · 2024-03-26T01:06:07Z

Apache Spark uses the GitHub Actions in your forked repository so the builds have to be found in https://github.com/sebastianhillig-db/spark/actions . The GitHub Actions would have to be enabled at https://github.com/sebastianhillig-db/spark/settings/actions , and rebase this PR

HyukjinKwon

The fix itself seems pretty good.

dongjoon-hyun · 2024-03-26T16:58:49Z

It seems that there is a chance to introduce an infinite loop to Apache Spark. Maybe, limit the number of retry? WDYT, @sebastianhillig-db ?

On each iteration, a worker is pulled from idleWorkers, this will end up "emptying" the pool. The synchronization around this will ensure that no other workers are added while this happens. (see https://github.com/apache/spark/pull/45635/files/ba3c6f6ee19762278004594735f25ab4f6fafb3e#diff-1bd846874b06327e6abd0803aa74eed890352dfa974d5c1da1a12dc7477e20d0L411-L413)

On each iteration, a worker is pulled from idleWorkers, this will end up "emptying" the pool. The synchronization around this will ensure that no other workers are added while this happens. (see https://github.com/apache/spark/pull/45635/files/ba3c6f6ee19762278004594735f25ab4f6fafb3e#diff-1bd846874b06327e6abd0803aa74eed890352dfa974d5c1da1a12dc7477e20d0L411-L413)

The link seems to be broken.

Ugh, sorry - the force push broke that link. I'm referring to "releaseWorker" using the same synchronization, so we should not be adding new workers while this code runs.

Promise not to force push again: https://github.com/apache/spark/pull/45635/files#diff-1bd846874b06327e6abd0803aa74eed890352dfa974d5c1da1a12dc7477e20d0L411

HyukjinKwon · 2024-04-04T11:15:39Z

Merged to master.

github-actions Bot added CORE PYTHON labels Mar 21, 2024

sebastianhillig-db changed the title ~~[WIP] First stab at dealing with worker crashes~~ [WIP] First stab at dealing with worker crashes while idling Mar 21, 2024

utkarsh39 reviewed Mar 22, 2024

View reviewed changes

Comment thread core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala Outdated

sebastianhillig-db changed the title ~~[WIP] First stab at dealing with worker crashes while idling~~ PySpark worker pool crash resilience Mar 25, 2024

HyukjinKwon reviewed Mar 26, 2024

View reviewed changes

Comment thread python/pyspark/tests/test_worker.py Outdated

HyukjinKwon approved these changes Mar 26, 2024

View reviewed changes

sebastianhillig-db changed the title ~~PySpark worker pool crash resilience~~ [SPARK-47565] PySpark worker pool crash resilience Mar 26, 2024

dongjoon-hyun reviewed Mar 26, 2024

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-47565] PySpark worker pool crash resilience~~ [SPARK-47565][PYTHON] PySpark worker pool crash resilience Mar 27, 2024

sebastianhillig-db requested a review from dongjoon-hyun March 27, 2024 08:41

squash

0f59a6a

sebastianhillig-db force-pushed the python-worker-factory-crash branch from ba3c6f6 to 0f59a6a Compare March 27, 2024 09:42

ueshin reviewed Mar 27, 2024

View reviewed changes

Comment thread core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala

HyukjinKwon closed this in bffb02d Apr 4, 2024

sebastianhillig-db deleted the python-worker-factory-crash branch April 4, 2024 11:25

Conversation

sebastianhillig-db commented Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

utkarsh39 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

utkarsh39 Mar 22, 2024

Choose a reason for hiding this comment

Uh oh!

sebastianhillig-db Mar 22, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

utkarsh39 commented Mar 22, 2024

Uh oh!

sebastianhillig-db commented Mar 25, 2024

Uh oh!

HyukjinKwon commented Mar 26, 2024

Uh oh!

Uh oh!

HyukjinKwon commented Mar 26, 2024

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 26, 2024

Choose a reason for hiding this comment

Uh oh!

sebastianhillig-db Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

sebastianhillig-db Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

sebastianhillig-db Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HyukjinKwon commented Apr 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sebastianhillig-db commented Mar 21, 2024 •

edited

Loading