Skip to content

Conversation

@mccheah
Copy link
Contributor

@mccheah mccheah commented Feb 9, 2015

Before, the Spark Driver would attempt to connect 3 times to the master,
with 20 seconds between each attempt, and if the master did not respond,
the driver would give up.

In practice, however, users may have a long queue of jobs or a busy
network that makes giving up this early unreasonable. Users should
choose to allow the driver to wait a longer time to eventually have its
registration be processed by the master node. The user does not need to
manually resubmit jobs as often if they set the timeout and number of
retries to higher numbers, as long as they are willing to accept that a
job may fail later if the master truly does crash or become unreachable.

Before, the Spark Driver would attempt to connect 3 times to the master,
with 20 seconds between each attempt, and if the master did not respond,
the driver would give up.

In practice, however, users may have a long queue of jobs or a busy
network that makes giving up this early unreasonable. Users should
choose to allow the driver to wait a longer time to eventually have its
registration be processed by the master node. The user does not need to
manually resubmit jobs as often if they set the timeout and number of
retries to higher numbers, as long as they are willing to accept that a
job may fail later if the master truly does crash or become unreachable.
@mccheah
Copy link
Contributor Author

mccheah commented Feb 9, 2015

Let me know if I should document these configurations somewhere.

@SparkQA
Copy link

SparkQA commented Feb 9, 2015

Test build #27126 has finished for PR 4481 at commit 4cb5bf8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pwendell
Copy link
Contributor

Could you describe in more detail the circumstances where you see that the master is overloaded and unable to accept connections? In general the master is a very lightweight component in Spark, I'm really surprised there was ever a case where this didn't just succeed the first time (outside of an all-out failure of the master).

@mccheah
Copy link
Contributor Author

mccheah commented Feb 10, 2015

Messages being lost because of an unreliable network can make the driver's messages get lost and thus the driver retries even when the master is alive. Also we have a decent number of Spark contexts being created and stopped frequently, which can make the master's akka message queue become a bottleneck.

We have had a few cases when the driver gave up but the master launched executors anyways, meaning if the driver had just waited a little longer, it could have proceeded with the job.

@ash211
Copy link
Contributor

ash211 commented Feb 13, 2015

@mccheah for our use case what values would you set these to? Just bump the retry count up to like 10?

@mccheah
Copy link
Contributor Author

mccheah commented Feb 15, 2015

Just increase it depending on the environment. Of course the user should exercise discretion - if the retry count needs to be very high there's likely something very wrong in their system.

@pwendell
Copy link
Contributor

Hey Matt, Sorry I'm still a bit confused. Basically my concern is that we're trying to ignore an underlying bug by adding a configuration option, which is something we really try not to do.

Is it the case that you regularly have unreliable network connectivity inside of your cluster? Spark overall assumes that nodes can establish reliable TCP connections to one another. Do you actually have TCP flows that are terminated from within the network as a regular occurrence? It's very hard for me to imagine a modern hardware cluster where this is the case.

The second explanation you gave was the Akka message queue. Akka in general should be able to process thousands of messages per second which is way more than anyone would reasonably submit to the standalone cluster manager. It's possible that we are in some way blocking inside of our actors in a way that is severely limiting throughput. If that is the case, then we should identify and fix the bug.

Are you seeing specific akka timeouts or some type of error message that could help pin down what is happening? My guess is that there is just something buggy about job submission, and ideally we should fix that instead of trying to add more knobs to turn to work around it.

If you have a reproduction of this behavior that would actually be the best. I.e. a stress test or something that could identify what is going on.

@srowen
Copy link
Member

srowen commented Apr 15, 2015

If there's not going to be follow up along these lines, would you mind closing the PR?
(Aside: we now have a new syntax for specifying time properties, FWIW.)

@mccheah mccheah closed this Apr 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants