-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-5697] Configurable registration retry interval and max attempts #4481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Before, the Spark Driver would attempt to connect 3 times to the master, with 20 seconds between each attempt, and if the master did not respond, the driver would give up. In practice, however, users may have a long queue of jobs or a busy network that makes giving up this early unreasonable. Users should choose to allow the driver to wait a longer time to eventually have its registration be processed by the master node. The user does not need to manually resubmit jobs as often if they set the timeout and number of retries to higher numbers, as long as they are willing to accept that a job may fail later if the master truly does crash or become unreachable.
|
Let me know if I should document these configurations somewhere. |
|
Test build #27126 has finished for PR 4481 at commit
|
|
Could you describe in more detail the circumstances where you see that the master is overloaded and unable to accept connections? In general the master is a very lightweight component in Spark, I'm really surprised there was ever a case where this didn't just succeed the first time (outside of an all-out failure of the master). |
|
Messages being lost because of an unreliable network can make the driver's messages get lost and thus the driver retries even when the master is alive. Also we have a decent number of Spark contexts being created and stopped frequently, which can make the master's akka message queue become a bottleneck. We have had a few cases when the driver gave up but the master launched executors anyways, meaning if the driver had just waited a little longer, it could have proceeded with the job. |
|
@mccheah for our use case what values would you set these to? Just bump the retry count up to like 10? |
|
Just increase it depending on the environment. Of course the user should exercise discretion - if the retry count needs to be very high there's likely something very wrong in their system. |
|
Hey Matt, Sorry I'm still a bit confused. Basically my concern is that we're trying to ignore an underlying bug by adding a configuration option, which is something we really try not to do. Is it the case that you regularly have unreliable network connectivity inside of your cluster? Spark overall assumes that nodes can establish reliable TCP connections to one another. Do you actually have TCP flows that are terminated from within the network as a regular occurrence? It's very hard for me to imagine a modern hardware cluster where this is the case. The second explanation you gave was the Akka message queue. Akka in general should be able to process thousands of messages per second which is way more than anyone would reasonably submit to the standalone cluster manager. It's possible that we are in some way blocking inside of our actors in a way that is severely limiting throughput. If that is the case, then we should identify and fix the bug. Are you seeing specific akka timeouts or some type of error message that could help pin down what is happening? My guess is that there is just something buggy about job submission, and ideally we should fix that instead of trying to add more knobs to turn to work around it. If you have a reproduction of this behavior that would actually be the best. I.e. a stress test or something that could identify what is going on. |
|
If there's not going to be follow up along these lines, would you mind closing the PR? |
Before, the Spark Driver would attempt to connect 3 times to the master,
with 20 seconds between each attempt, and if the master did not respond,
the driver would give up.
In practice, however, users may have a long queue of jobs or a busy
network that makes giving up this early unreasonable. Users should
choose to allow the driver to wait a longer time to eventually have its
registration be processed by the master node. The user does not need to
manually resubmit jobs as often if they set the timeout and number of
retries to higher numbers, as long as they are willing to accept that a
job may fail later if the master truly does crash or become unreachable.