[SPARK-29619][PYTHON][CORE] Add retry times when reading the daemon port. #26282

beliefer · 2019-10-28T10:53:08Z

What changes were proposed in this pull request?

This PR is related to #26510 and add retry mechanism to start process.
We have a lot job with PySpark. We only find a small number of job causes the issue. After some try will reduce the issue.
I think the root cause of the exit is the robust problem with the user code.
If when starting Python process, but exited caused by a unstable factor of system, do we need to make the effort to retry?

Why are the changes needed?

In order to clarify the exception and try three times default.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Exists UT.

SparkQA · 2019-10-28T10:57:55Z

Test build #112767 has finished for PR 26282 at commit 8888eb5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-10-28T11:57:30Z

core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala

-          daemonPort = in.readInt()
-        } catch {
-          case _: EOFException =>
-            throw new SparkException(s"No port number in $daemonModule's stdout")


Why don't you just fix exception message or adding an error level log? Seems confusing error doesn't justify retrying

Adding a retry sometimes makes the task successful.

port 0 in daemon.py will automatically allocate available socket. In which case does it fail?

In our production environment , the python sub process sometimes will exit from code 139 .
When I debug it and find the python process is dead.
So , the code in.ReadInt () will throw EOFException not related the port .

When I debug it and find the python process is dead.

Do you know why it was dead? We might better have to make daemon more robust rather than just retrying.

Can you tell me in which case it can fail by user code?

We have a lot job with PySpark. We only find a small number of job causes the issue.
A user tries to catch some exception of Python code, and we will test it.

@HyukjinKwon If when starting Python process, but exited caused by a unstable factor of system, do we need to make the effort to retry?

Yes I also don't see a reason to retry, insofar as I don't see evidence that it would make subsequent attempts succeed.

@srowen Thanks for your review. I will close this PR.

SparkQA · 2019-10-28T13:19:48Z

Test build #112768 has finished for PR 26282 at commit 36e94af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2019-10-30T18:19:07Z

retest this please

SparkQA · 2019-10-30T20:19:50Z

Test build #112954 has finished for PR 26282 at commit 36e94af.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-10-31T00:39:46Z

retest this please

SparkQA · 2019-10-31T03:06:19Z

Test build #112976 has finished for PR 26282 at commit 36e94af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer added 2 commits October 28, 2019 18:35

add meaningful log and add retry times.

ee90f89

add meaningful log and add retry times.

8888eb5

Fix scalastyle.

36e94af

HyukjinKwon reviewed Oct 28, 2019

View reviewed changes

dongjoon-hyun added the SPARK CORE label Oct 30, 2019

beliefer changed the title ~~[SPARK-29619] Add meaningful log and retry for start python worker process~~ [SPARK-29619] [CORE] Improve the exception message when reading the daemon port Nov 14, 2019

beliefer changed the title ~~[SPARK-29619] [CORE] Improve the exception message when reading the daemon port~~ [SPARK-29619] [CORE] Add retry times when reading the daemon port. Nov 14, 2019

beliefer changed the title ~~[SPARK-29619] [CORE] Add retry times when reading the daemon port.~~ [SPARK-29619][PYTHON][CORE] Add retry times when reading the daemon port. Nov 14, 2019

beliefer closed this Nov 15, 2019

[SPARK-29619][PYTHON][CORE] Add retry times when reading the daemon port. #26282

[SPARK-29619][PYTHON][CORE] Add retry times when reading the daemon port. #26282

Uh oh!

Conversation

beliefer commented Oct 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 28, 2019

Uh oh!

jiangxb1987 commented Oct 30, 2019

Uh oh!

SparkQA commented Oct 30, 2019

Uh oh!

maropu commented Oct 31, 2019

Uh oh!

SparkQA commented Oct 31, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

beliefer commented Oct 28, 2019 •

edited

Loading