[SPARK-5697] Configurable registration retry interval and max attempts #4481

mccheah · 2015-02-09T21:55:03Z

Before, the Spark Driver would attempt to connect 3 times to the master,
with 20 seconds between each attempt, and if the master did not respond,
the driver would give up.

In practice, however, users may have a long queue of jobs or a busy
network that makes giving up this early unreasonable. Users should
choose to allow the driver to wait a longer time to eventually have its
registration be processed by the master node. The user does not need to
manually resubmit jobs as often if they set the timeout and number of
retries to higher numbers, as long as they are willing to accept that a
job may fail later if the master truly does crash or become unreachable.

Before, the Spark Driver would attempt to connect 3 times to the master, with 20 seconds between each attempt, and if the master did not respond, the driver would give up. In practice, however, users may have a long queue of jobs or a busy network that makes giving up this early unreasonable. Users should choose to allow the driver to wait a longer time to eventually have its registration be processed by the master node. The user does not need to manually resubmit jobs as often if they set the timeout and number of retries to higher numbers, as long as they are willing to accept that a job may fail later if the master truly does crash or become unreachable.

mccheah · 2015-02-09T21:55:25Z

Let me know if I should document these configurations somewhere.

SparkQA · 2015-02-09T23:11:35Z

Test build #27126 has finished for PR 4481 at commit 4cb5bf8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

pwendell · 2015-02-10T08:05:04Z

Could you describe in more detail the circumstances where you see that the master is overloaded and unable to accept connections? In general the master is a very lightweight component in Spark, I'm really surprised there was ever a case where this didn't just succeed the first time (outside of an all-out failure of the master).

mccheah · 2015-02-10T08:53:28Z

Messages being lost because of an unreliable network can make the driver's messages get lost and thus the driver retries even when the master is alive. Also we have a decent number of Spark contexts being created and stopped frequently, which can make the master's akka message queue become a bottleneck.

We have had a few cases when the driver gave up but the master launched executors anyways, meaning if the driver had just waited a little longer, it could have proceeded with the job.

ash211 · 2015-02-13T23:38:15Z

@mccheah for our use case what values would you set these to? Just bump the retry count up to like 10?

mccheah · 2015-02-15T21:58:03Z

Just increase it depending on the environment. Of course the user should exercise discretion - if the retry count needs to be very high there's likely something very wrong in their system.

pwendell · 2015-02-16T23:23:27Z

Hey Matt, Sorry I'm still a bit confused. Basically my concern is that we're trying to ignore an underlying bug by adding a configuration option, which is something we really try not to do.

Is it the case that you regularly have unreliable network connectivity inside of your cluster? Spark overall assumes that nodes can establish reliable TCP connections to one another. Do you actually have TCP flows that are terminated from within the network as a regular occurrence? It's very hard for me to imagine a modern hardware cluster where this is the case.

The second explanation you gave was the Akka message queue. Akka in general should be able to process thousands of messages per second which is way more than anyone would reasonably submit to the standalone cluster manager. It's possible that we are in some way blocking inside of our actors in a way that is severely limiting throughput. If that is the case, then we should identify and fix the bug.

Are you seeing specific akka timeouts or some type of error message that could help pin down what is happening? My guess is that there is just something buggy about job submission, and ideally we should fix that instead of trying to add more knobs to turn to work around it.

If you have a reproduction of this behavior that would actually be the best. I.e. a stress test or something that could identify what is going on.

srowen · 2015-04-15T12:36:16Z

If there's not going to be follow up along these lines, would you mind closing the PR?
(Aside: we now have a new syntax for specifying time properties, FWIW.)

mccheah closed this Apr 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5697] Configurable registration retry interval and max attempts #4481

[SPARK-5697] Configurable registration retry interval and max attempts #4481

Uh oh!

mccheah commented Feb 9, 2015

Uh oh!

mccheah commented Feb 9, 2015

Uh oh!

SparkQA commented Feb 9, 2015

Uh oh!

pwendell commented Feb 10, 2015

Uh oh!

mccheah commented Feb 10, 2015

Uh oh!

ash211 commented Feb 13, 2015

Uh oh!

mccheah commented Feb 15, 2015

Uh oh!

pwendell commented Feb 16, 2015

Uh oh!

srowen commented Apr 15, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-5697] Configurable registration retry interval and max attempts #4481

[SPARK-5697] Configurable registration retry interval and max attempts #4481

Uh oh!

Conversation

mccheah commented Feb 9, 2015

Uh oh!

mccheah commented Feb 9, 2015

Uh oh!

SparkQA commented Feb 9, 2015

Uh oh!

pwendell commented Feb 10, 2015

Uh oh!

mccheah commented Feb 10, 2015

Uh oh!

ash211 commented Feb 13, 2015

Uh oh!

mccheah commented Feb 15, 2015

Uh oh!

pwendell commented Feb 16, 2015

Uh oh!

srowen commented Apr 15, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants