Skip to content

Conversation

@beliefer
Copy link
Contributor

@beliefer beliefer commented Oct 28, 2019

What changes were proposed in this pull request?

This PR is related to #26510 and add retry mechanism to start process.
We have a lot job with PySpark. We only find a small number of job causes the issue. After some try will reduce the issue.
I think the root cause of the exit is the robust problem with the user code.
If when starting Python process, but exited caused by a unstable factor of system, do we need to make the effort to retry?

Why are the changes needed?

In order to clarify the exception and try three times default.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Exists UT.

@SparkQA
Copy link

SparkQA commented Oct 28, 2019

Test build #112767 has finished for PR 26282 at commit 8888eb5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

daemonPort = in.readInt()
} catch {
case _: EOFException =>
throw new SparkException(s"No port number in $daemonModule's stdout")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you just fix exception message or adding an error level log? Seems confusing error doesn't justify retrying

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a retry sometimes makes the task successful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

port 0 in daemon.py will automatically allocate available socket. In which case does it fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our production environment , the python sub process sometimes will exit from code 139 .
When I debug it and find the python process is dead.
So , the code in.ReadInt () will throw EOFException not related the port .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I debug it and find the python process is dead.

Do you know why it was dead? We might better have to make daemon more robust rather than just retrying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell me in which case it can fail by user code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a lot job with PySpark. We only find a small number of job causes the issue.
A user tries to catch some exception of Python code, and we will test it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon If when starting Python process, but exited caused by a unstable factor of system, do we need to make the effort to retry?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I also don't see a reason to retry, insofar as I don't see evidence that it would make subsequent attempts succeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Thanks for your review. I will close this PR.

@SparkQA
Copy link

SparkQA commented Oct 28, 2019

Test build #112768 has finished for PR 26282 at commit 36e94af.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Oct 30, 2019

Test build #112954 has finished for PR 26282 at commit 36e94af.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Oct 31, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #112976 has finished for PR 26282 at commit 36e94af.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer beliefer changed the title [SPARK-29619] Add meaningful log and retry for start python worker process [SPARK-29619] [CORE] Improve the exception message when reading the daemon port Nov 14, 2019
@beliefer beliefer changed the title [SPARK-29619] [CORE] Improve the exception message when reading the daemon port [SPARK-29619] [CORE] Add retry times when reading the daemon port. Nov 14, 2019
@beliefer beliefer changed the title [SPARK-29619] [CORE] Add retry times when reading the daemon port. [SPARK-29619][PYTHON][CORE] Add retry times when reading the daemon port. Nov 14, 2019
@beliefer beliefer closed this Nov 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants