-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-29619][PYTHON][CORE] Add retry times when reading the daemon port. #26282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #112767 has finished for PR 26282 at commit
|
| daemonPort = in.readInt() | ||
| } catch { | ||
| case _: EOFException => | ||
| throw new SparkException(s"No port number in $daemonModule's stdout") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you just fix exception message or adding an error level log? Seems confusing error doesn't justify retrying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a retry sometimes makes the task successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
port 0 in daemon.py will automatically allocate available socket. In which case does it fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our production environment , the python sub process sometimes will exit from code 139 .
When I debug it and find the python process is dead.
So , the code in.ReadInt () will throw EOFException not related the port .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I debug it and find the python process is dead.
Do you know why it was dead? We might better have to make daemon more robust rather than just retrying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you tell me in which case it can fail by user code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a lot job with PySpark. We only find a small number of job causes the issue.
A user tries to catch some exception of Python code, and we will test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon If when starting Python process, but exited caused by a unstable factor of system, do we need to make the effort to retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I also don't see a reason to retry, insofar as I don't see evidence that it would make subsequent attempts succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Thanks for your review. I will close this PR.
|
Test build #112768 has finished for PR 26282 at commit
|
|
retest this please |
|
Test build #112954 has finished for PR 26282 at commit
|
|
retest this please |
|
Test build #112976 has finished for PR 26282 at commit
|
What changes were proposed in this pull request?
This PR is related to #26510 and add retry mechanism to start process.
We have a lot job with PySpark. We only find a small number of job causes the issue. After some try will reduce the issue.
I think the root cause of the exit is the robust problem with the user code.
If when starting Python process, but exited caused by a unstable factor of system, do we need to make the effort to retry?
Why are the changes needed?
In order to clarify the exception and try three times default.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Exists UT.