Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4006] Block Manager - Double Register Crash #2854

Closed

Conversation

tsliwowicz
Copy link
Contributor

This issue affects all versions since 0.7 up to (including) 1.1

In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

https://issues.apache.org/jira/browse/SPARK-4006

…ster without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@tsliwowicz tsliwowicz changed the title Block Manager - Double Register Crash [SPARK-4006] Block Manager - Double Register Crash Oct 20, 2014
@andrewor14
Copy link
Contributor

Hey @tsliwowicz thanks for fixing this inconsistency. Since this is an issue affecting the most recent version of Spark as well, would you mind opening a PR against the master branch rather than against 0.9? It will allow us to merge this more easily into branches 1.0, 1.1, and master.

logError("Got two different block manager registrations on " + id.executorId)
System.exit(1)
// A block manager of the same executor already exists so remove it (assumed dead).
logError("Got two different block manager registrations on same executor - will remove, new Id " + id+", orig id - "+manager)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a 100 character limit per line for Spark PRs.

@tsliwowicz
Copy link
Contributor Author

@andrewor14 - thanks, and sure - I will fix your comments and do a PR against master.
However, re your logging comments, it really isn't that much. It adds a few lines of logging per run, which is insignificant, and it helps greatly to track registration and removal of block managers, which is really helpful in production to track issues. If you still think it's too much I will leave them out.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@tsliwowicz
Copy link
Contributor Author

Created another pull request - #2886 - this time on master and also fixed the comments above.

@andrewor14
Copy link
Contributor

Jenkins, add to whitelist

@andrewor14
Copy link
Contributor

Hey @tsliwowicz can you make the same changes I suggested in #2886 here?

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have started for PR 2854 at commit 81d69f0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have finished for PR 2854 at commit 81d69f0.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22076/
Test FAILed.

asfgit pushed a commit that referenced this pull request Oct 23, 2014
…f double registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like #2854 except it's on master

Author: Tal Sliwowicz <tal.s@taboola.com>

Closes #2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits:

094d508 [Tal Sliwowicz] some more white space change undone
41a2217 [Tal Sliwowicz] some more whitspaces change undone
7bcfc3d [Tal Sliwowicz] whitspaces fix
df9d98f [Tal Sliwowicz] Code review comments fixed
f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.
tsliwowicz added a commit to taboola/spark that referenced this pull request Oct 23, 2014
…f double registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like apache#2854 except it's on master

Author: Tal Sliwowicz <tal.s@taboola.com>

Closes apache#2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits:

094d508 [Tal Sliwowicz] some more white space change undone
41a2217 [Tal Sliwowicz] some more whitspaces change undone
7bcfc3d [Tal Sliwowicz] whitspaces fix
df9d98f [Tal Sliwowicz] Code review comments fixed
f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.

(cherry picked from commit 6b48522)

Conflicts:
	core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
tsliwowicz added a commit to taboola/spark that referenced this pull request Oct 23, 2014
…f double registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like apache#2854 except it's on master

Author: Tal Sliwowicz <tal.s@taboola.com>

Closes apache#2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits:

094d508 [Tal Sliwowicz] some more white space change undone
41a2217 [Tal Sliwowicz] some more whitspaces change undone
7bcfc3d [Tal Sliwowicz] whitspaces fix
df9d98f [Tal Sliwowicz] Code review comments fixed
f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.

(cherry picked from commit 6b48522)

Conflicts:
	core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala

(cherry picked from commit d122236)

Conflicts:
	core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
…f double registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like apache#2854 except it's on master

Author: Tal Sliwowicz <tal.s@taboola.com>

Closes apache#2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits:

094d508 [Tal Sliwowicz] some more white space change undone
41a2217 [Tal Sliwowicz] some more whitspaces change undone
7bcfc3d [Tal Sliwowicz] whitspaces fix
df9d98f [Tal Sliwowicz] Code review comments fixed
f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.

(cherry picked from commit 6b48522)

Conflicts:
	core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have started for PR 2854 at commit 95ae4db.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have finished for PR 2854 at commit 95ae4db.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22083/
Test FAILed.

@tsliwowicz
Copy link
Contributor Author

there seems to be some technical issue with the build. (not a real failure with the pull request itself)

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Dec 4, 2014

Test build #24148 has started for PR 2854 at commit 95ae4db.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 5, 2014

Test build #24148 has finished for PR 2854 at commit 95ae4db.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24148/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Dec 22, 2014

Test build #24712 has finished for PR 2854 at commit 95ae4db.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24712/
Test FAILed.

@andrewor14
Copy link
Contributor

retest this please...

@SparkQA
Copy link

SparkQA commented Dec 22, 2014

Test build #24714 has started for PR 2854 at commit 95ae4db.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 22, 2014

Test build #24714 has finished for PR 2854 at commit 95ae4db.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24714/
Test FAILed.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@JoshRosen
Copy link
Contributor

Hmm, it's not obvious to me what failed on that last run. I guess I'll just have Jenkins retest it. Pretty sure that it's not an issue in this PR, but it doesn't cost anything to just try again.

@SparkQA
Copy link

SparkQA commented Dec 24, 2014

Test build #24756 has started for PR 2854 at commit 95ae4db.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 24, 2014

Test build #24756 has finished for PR 2854 at commit 95ae4db.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24756/
Test FAILed.

@JoshRosen
Copy link
Contributor

Since it's not obvious what's failing, I guess I'll have to log into Jenkins and look at the logs.

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jan 7, 2015

Test build #25180 has started for PR 2854 at commit 95ae4db.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 7, 2015

Test build #25180 has finished for PR 2854 at commit 95ae4db.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25180/
Test FAILed.

@andrewor14
Copy link
Contributor

branch-0.9 in general seems to be failing tests because of port contention. I will open a PR to disable the SparkUI during tests to fix this.

@andrewor14
Copy link
Contributor

Ok I just fixed the port contention and python issues so tests should pass now. Let's retest this please.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25331 has started for PR 2854 at commit 95ae4db.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25331 has finished for PR 2854 at commit 95ae4db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25331/
Test PASSed.

@andrewor14
Copy link
Contributor

Finally. I'm merging this thanks everyone and @davies who fixed the python tests. :)

asfgit pushed a commit that referenced this pull request Jan 9, 2015
This issue affects all versions since 0.7 up to (including) 1.1

In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

https://issues.apache.org/jira/browse/SPARK-4006

Author: Tal Sliwowicz <tal.s@taboola.com>

Closes #2854 from tsliwowicz/branch-0.9.2-block-mgr-removal and squashes the following commits:

95ae4db [Tal Sliwowicz] [SPARK-4006] In long running contexts, we encountered the situation of double registe...
81d69f0 [Tal Sliwowicz] fixed comment
efd93f2 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.
@andrewor14
Copy link
Contributor

Hi @tsliwowicz can you close this PR now that it's merged? Thanks.

@srowen
Copy link
Member

srowen commented Feb 13, 2015

Mind closing this @tsliwowicz ? It won't auto-close since it was not opened against master.

@srowen
Copy link
Member

srowen commented Feb 23, 2015

Mind closing this PR?

asfgit pushed a commit that referenced this pull request Jun 4, 2015
This commit exists to close a pull request on github.
@pwendell
Copy link
Contributor

pwendell commented Jun 4, 2015

@tsliwowicz can you please close this pull request?

@tsliwowicz tsliwowicz closed this Jun 4, 2015
@tsliwowicz
Copy link
Contributor Author

@pwendell Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants