[SPARK-4006] Block Manager - Double Register Crash #2854

tsliwowicz · 2014-10-20T10:12:30Z

This issue affects all versions since 0.7 up to (including) 1.1

In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

https://issues.apache.org/jira/browse/SPARK-4006

…ster without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister.

AmplabJenkins · 2014-10-20T10:17:11Z

Can one of the admins verify this patch?

andrewor14 · 2014-10-20T18:45:52Z

Hey @tsliwowicz thanks for fixing this inconsistency. Since this is an issue affecting the most recent version of Spark as well, would you mind opening a PR against the master branch rather than against 0.9? It will allow us to merge this more easily into branches 1.0, 1.1, and master.

andrewor14 · 2014-10-20T18:48:08Z

core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala

-          logError("Got two different block manager registrations on " + id.executorId)
-          System.exit(1)
+          // A block manager of the same executor already exists so remove it (assumed dead).
+          logError("Got two different block manager registrations on same executor - will remove, new Id " + id+", orig id - "+manager)


We have a 100 character limit per line for Spark PRs.

tsliwowicz · 2014-10-21T12:54:23Z

@andrewor14 - thanks, and sure - I will fix your comments and do a PR against master.
However, re your logging comments, it really isn't that much. It adds a few lines of logging per run, which is insignificant, and it helps greatly to track registration and removal of block managers, which is really helpful in production to track issues. If you still think it's too much I will leave them out.

AmplabJenkins · 2014-10-21T23:13:51Z

Can one of the admins verify this patch?

tsliwowicz · 2014-10-21T23:54:19Z

Created another pull request - #2886 - this time on master and also fixed the comments above.

andrewor14 · 2014-10-23T18:01:47Z

Jenkins, add to whitelist

andrewor14 · 2014-10-23T18:02:10Z

Hey @tsliwowicz can you make the same changes I suggested in #2886 here?

SparkQA · 2014-10-23T18:05:18Z

QA tests have started for PR 2854 at commit 81d69f0.

This patch merges cleanly.

SparkQA · 2014-10-23T18:32:08Z

QA tests have finished for PR 2854 at commit 81d69f0.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-23T18:32:11Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22076/
Test FAILed.

…f double registe... ...r without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. This is just like #2854 except it's on master Author: Tal Sliwowicz <tal.s@taboola.com> Closes #2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits: 094d508 [Tal Sliwowicz] some more white space change undone 41a2217 [Tal Sliwowicz] some more whitspaces change undone 7bcfc3d [Tal Sliwowicz] whitspaces fix df9d98f [Tal Sliwowicz] Code review comments fixed f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.

…f double registe... ...r without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. This is just like apache#2854 except it's on master Author: Tal Sliwowicz <tal.s@taboola.com> Closes apache#2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits: 094d508 [Tal Sliwowicz] some more white space change undone 41a2217 [Tal Sliwowicz] some more whitspaces change undone 7bcfc3d [Tal Sliwowicz] whitspaces fix df9d98f [Tal Sliwowicz] Code review comments fixed f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue. (cherry picked from commit 6b48522) Conflicts: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala

…f double registe... ...r without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. This is just like apache#2854 except it's on master Author: Tal Sliwowicz <tal.s@taboola.com> Closes apache#2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits: 094d508 [Tal Sliwowicz] some more white space change undone 41a2217 [Tal Sliwowicz] some more whitspaces change undone 7bcfc3d [Tal Sliwowicz] whitspaces fix df9d98f [Tal Sliwowicz] Code review comments fixed f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue. (cherry picked from commit 6b48522) Conflicts: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala (cherry picked from commit d122236) Conflicts: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala

…f double registe... ...r without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. This is just like apache#2854 except it's on master Author: Tal Sliwowicz <tal.s@taboola.com> Closes apache#2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits: 094d508 [Tal Sliwowicz] some more white space change undone 41a2217 [Tal Sliwowicz] some more whitspaces change undone 7bcfc3d [Tal Sliwowicz] whitspaces fix df9d98f [Tal Sliwowicz] Code review comments fixed f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue. (cherry picked from commit 6b48522) Conflicts: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala

SparkQA · 2014-10-23T20:50:15Z

QA tests have started for PR 2854 at commit 95ae4db.

This patch merges cleanly.

SparkQA · 2014-10-23T21:18:29Z

QA tests have finished for PR 2854 at commit 95ae4db.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-23T21:18:32Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22083/
Test FAILed.

tsliwowicz · 2014-10-24T13:56:41Z

there seems to be some technical issue with the build. (not a real failure with the pull request itself)

JoshRosen · 2014-12-04T23:36:16Z

Jenkins, retest this please.

SparkQA · 2014-12-04T23:38:00Z

Test build #24148 has started for PR 2854 at commit 95ae4db.

This patch merges cleanly.

SparkQA · 2014-12-05T00:03:52Z

Test build #24148 has finished for PR 2854 at commit 95ae4db.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-05T00:03:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24148/
Test FAILed.

SparkQA · 2014-12-22T20:44:36Z

Test build #24712 has finished for PR 2854 at commit 95ae4db.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-22T20:44:39Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24712/
Test FAILed.

andrewor14 · 2014-12-22T21:08:08Z

retest this please...

SparkQA · 2014-12-22T21:12:33Z

Test build #24714 has started for PR 2854 at commit 95ae4db.

This patch merges cleanly.

SparkQA · 2014-12-22T21:39:10Z

Test build #24714 has finished for PR 2854 at commit 95ae4db.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-22T21:39:14Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24714/
Test FAILed.

JoshRosen · 2014-12-24T03:12:15Z

Jenkins, retest this please.

JoshRosen · 2014-12-24T03:13:19Z

Hmm, it's not obvious to me what failed on that last run. I guess I'll just have Jenkins retest it. Pretty sure that it's not an issue in this PR, but it doesn't cost anything to just try again.

SparkQA · 2014-12-24T03:17:36Z

Test build #24756 has started for PR 2854 at commit 95ae4db.

This patch merges cleanly.

SparkQA · 2014-12-24T03:43:34Z

Test build #24756 has finished for PR 2854 at commit 95ae4db.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-24T03:43:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24756/
Test FAILed.

JoshRosen · 2014-12-24T21:05:36Z

Since it's not obvious what's failing, I guess I'll have to log into Jenkins and look at the logs.

andrewor14 · 2015-01-07T23:23:52Z

retest this please

SparkQA · 2015-01-07T23:27:29Z

Test build #25180 has started for PR 2854 at commit 95ae4db.

This patch merges cleanly.

SparkQA · 2015-01-07T23:53:25Z

Test build #25180 has finished for PR 2854 at commit 95ae4db.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-07T23:53:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25180/
Test FAILed.

andrewor14 · 2015-01-08T20:09:20Z

branch-0.9 in general seems to be failing tests because of port contention. I will open a PR to disable the SparkUI during tests to fix this.

andrewor14 · 2015-01-09T18:25:04Z

Ok I just fixed the port contention and python issues so tests should pass now. Let's retest this please.

SparkQA · 2015-01-09T18:27:38Z

Test build #25331 has started for PR 2854 at commit 95ae4db.

This patch merges cleanly.

SparkQA · 2015-01-09T18:53:56Z

Test build #25331 has finished for PR 2854 at commit 95ae4db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-09T18:53:59Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25331/
Test PASSed.

andrewor14 · 2015-01-09T19:58:32Z

Finally. I'm merging this thanks everyone and @davies who fixed the python tests. :)

This issue affects all versions since 0.7 up to (including) 1.1 In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. https://issues.apache.org/jira/browse/SPARK-4006 Author: Tal Sliwowicz <tal.s@taboola.com> Closes #2854 from tsliwowicz/branch-0.9.2-block-mgr-removal and squashes the following commits: 95ae4db [Tal Sliwowicz] [SPARK-4006] In long running contexts, we encountered the situation of double registe... 81d69f0 [Tal Sliwowicz] fixed comment efd93f2 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.

andrewor14 · 2015-01-09T20:07:42Z

Hi @tsliwowicz can you close this PR now that it's merged? Thanks.

srowen · 2015-02-13T00:24:02Z

Mind closing this @tsliwowicz ? It won't auto-close since it was not opened against master.

srowen · 2015-02-23T15:29:28Z

Mind closing this PR?

This commit exists to close a pull request on github.

pwendell · 2015-06-04T06:40:12Z

@tsliwowicz can you please close this pull request?

tsliwowicz · 2015-06-04T09:24:52Z

@pwendell Done

tsliwowicz added 2 commits October 12, 2014 11:35

fixed comment

81d69f0

tsliwowicz mentioned this pull request Oct 20, 2014

mesos executor ids now consist of the slave id and a counter to fix dupl... #1358

Closed

tsliwowicz changed the title ~~Block Manager - Double Register Crash~~ [SPARK-4006] Block Manager - Double Register Crash Oct 20, 2014

andrewor14 reviewed Oct 20, 2014
View reviewed changes

tsliwowicz mentioned this pull request Oct 21, 2014

[SPARK-4006] In long running contexts, we encountered the situation of double registe... #2886

Closed

asfgit pushed a commit that referenced this pull request Jun 4, 2015

[MAINTENANCE] Closes #2854

e63783a

This commit exists to close a pull request on github.

tsliwowicz closed this Jun 4, 2015

[SPARK-4006] Block Manager - Double Register Crash #2854

[SPARK-4006] Block Manager - Double Register Crash #2854

Conversation

tsliwowicz commented Oct 20, 2014

AmplabJenkins commented Oct 20, 2014

andrewor14 commented Oct 20, 2014

andrewor14 Oct 20, 2014

Choose a reason for hiding this comment

tsliwowicz commented Oct 21, 2014

AmplabJenkins commented Oct 21, 2014

tsliwowicz commented Oct 21, 2014

andrewor14 commented Oct 23, 2014

andrewor14 commented Oct 23, 2014

SparkQA commented Oct 23, 2014

SparkQA commented Oct 23, 2014

AmplabJenkins commented Oct 23, 2014

SparkQA commented Oct 23, 2014

SparkQA commented Oct 23, 2014

AmplabJenkins commented Oct 23, 2014

tsliwowicz commented Oct 24, 2014

JoshRosen commented Dec 4, 2014

SparkQA commented Dec 4, 2014

SparkQA commented Dec 5, 2014

AmplabJenkins commented Dec 5, 2014

SparkQA commented Dec 22, 2014

AmplabJenkins commented Dec 22, 2014

andrewor14 commented Dec 22, 2014

SparkQA commented Dec 22, 2014

SparkQA commented Dec 22, 2014

AmplabJenkins commented Dec 22, 2014

JoshRosen commented Dec 24, 2014

JoshRosen commented Dec 24, 2014

SparkQA commented Dec 24, 2014

SparkQA commented Dec 24, 2014

AmplabJenkins commented Dec 24, 2014

JoshRosen commented Dec 24, 2014

andrewor14 commented Jan 7, 2015

SparkQA commented Jan 7, 2015

SparkQA commented Jan 7, 2015

AmplabJenkins commented Jan 7, 2015

andrewor14 commented Jan 8, 2015

andrewor14 commented Jan 9, 2015

SparkQA commented Jan 9, 2015

SparkQA commented Jan 9, 2015

AmplabJenkins commented Jan 9, 2015

andrewor14 commented Jan 9, 2015

andrewor14 commented Jan 9, 2015

srowen commented Feb 13, 2015

srowen commented Feb 23, 2015

pwendell commented Jun 4, 2015

tsliwowicz commented Jun 4, 2015