Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3923] Increase Akka heartbeat pause above heartbeat interval #2784

Closed
wants to merge 2 commits into from

Conversation

aarondav
Copy link
Contributor

Something about the 2.3.4 upgrade seems to have made the issue manifest where all the services disconnect from each other after exactly 1000 seconds (which is the heartbeat interval). This post suggests that heartbeat pause should be greater than heartbeat interval, and increasing the pause from 600s to 6000s seems to have rectified the issue. My current cluster has now exceeded 1400s of uptime without failure!

I do not know why this fixed it, because the threshold we have set for the failure detector is the exponent of a timeout, and 300 is extremely large. Perhaps the default failure detector changed in 2.3.4 and now ignores threshold.

Something about the 2.3.4 upgrade seems to have made the issue manifest where
all the services disconnect from each other after exactly 1000 seconds (which
is the heartbeat interval). [This post](https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs)
suggests that heartbeat pause should be less than heartbeat interval, and decreasing
the interval from 1000s to below the 600s of the heartbeat pause seems to have
rectified the issue. My current cluster has now exceeded 1400s of uptime without
failure!

I do not know why this fixed it, because the threshold we have set for the
failure detector is the exponent of a timeout, and 300 is extremely large.
Perhaps the default failure detector changed in 2.3.4 and now ignores
threshold.
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amp.lab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21694/
Test FAILed.

@aarondav
Copy link
Contributor Author

Jenkins, retest this please.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21702/
Test FAILed.

@aarondav
Copy link
Contributor Author

Jenkins, retest this please.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21706/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 14, 2014

QA tests have started for PR 2784 at commit 3639220.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 14, 2014

QA tests have finished for PR 2784 at commit 3639220.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@aarondav
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Oct 14, 2014

QA tests have started for PR 2784 at commit 3639220.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 14, 2014

QA tests have started for PR 2784 at commit 9cb0372.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 14, 2014

QA tests have finished for PR 2784 at commit 3639220.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21711/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 14, 2014

QA tests have finished for PR 2784 at commit 9cb0372.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21712/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 14, 2014

QA tests have started for PR 2784 at commit 9cb0372.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 14, 2014

QA tests have finished for PR 2784 at commit 9cb0372.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@witgo
Copy link
Contributor

witgo commented Oct 14, 2014

This configuration seems to be the value in milliseconds.
DeadlineFailureDetector.scala
PhiAccrualFailureDetector.scala

@witgo
Copy link
Contributor

witgo commented Oct 14, 2014

Sorry, I made a mistake.

@ScrapCodes
Copy link
Member

Hey Aaron,
I increased the interval because its any way a "noise" !, We don't intend to use the akka's Failure Detector because we have our own heart beat tracking mechanism in place. If you reduce the time interval the number of System messages exchanged will rise. It may not be evident as in effect on performance or in perf benchmark etc, but these are unnecessary.

You can actually increase the pause, until akka provides a property to completely turn this off. (I think we should log an issue ?)

@SparkQA
Copy link

SparkQA commented Oct 14, 2014

QA tests have started for PR 2784 at commit bd1151a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 14, 2014

QA tests have finished for PR 2784 at commit bd1151a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21733/
Test PASSed.

@aarondav
Copy link
Contributor Author

@ScrapCodes increased pause by order of magnitude and reverted change to interval

@ScrapCodes
Copy link
Member

Thanks, LGTM.

@ScrapCodes
Copy link
Member

Minor: Your PR title looks misleading ! :)

@aarondav aarondav changed the title [SPARK-3923] Decrease Akka heartbeat interval below heartbeat pause [SPARK-3923] Increase Akka heartbeat pause above below heartbeat interval Oct 15, 2014
@aarondav
Copy link
Contributor Author

Updated, but I think we should always give PRs a name opposite to what they actually do. Keeps things interesting.

@vanzin
Copy link
Contributor

vanzin commented Oct 15, 2014

"above below"?

@aarondav aarondav changed the title [SPARK-3923] Increase Akka heartbeat pause above below heartbeat interval [SPARK-3923] Increase Akka heartbeat pause above heartbeat interval Oct 15, 2014
@andrewor14
Copy link
Contributor

I see, if a heartbeat is lost there is no way to recover if the wait time is less than the interval. With these changes the default pause is 6 times the default interval. This LGTM. I'm merging this.

@asfgit asfgit closed this in 7f7b50e Oct 17, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants