Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4006] In long running contexts, we encountered the situation of double registe... #2886

Closed
wants to merge 5 commits into from

Conversation

tsliwowicz
Copy link
Contributor

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like #2854 except it's on master

…ster without a remove in between. The cause for that is unknown, and assumed a temp network issue.

    However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

    The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

    Also - added some logging for register and unregister.
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@andrewor14
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Oct 22, 2014

QA tests have started for PR 2886 at commit f48bce9.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 22, 2014

QA tests have finished for PR 2886 at commit f48bce9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22016/
Test PASSed.

// This should never happen. Let's just quit.
logError("Got two different block manager registrations on " + id.executorId)
System.exit(1)
// A block manager of the same executor already exists so remove it (assumed dead).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already exists, so remove it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually what I meant was to add a comma between "exists" and "so"... It's ok I can fix this myself when I merge it

@andrewor14
Copy link
Contributor

Hey @tsliwowicz there are a few style and wording issues that I'd like to see fixed. Also, I would prefer to have the whitespace changes reverted. However, I think the fix is correct and this LGTM logically.

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have started for PR 2886 at commit df9d98f.

  • This patch merges cleanly.

@tsliwowicz
Copy link
Contributor Author

@andrewor14 - thanks for the comments. I believe I fixed them all. Let me know!

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have started for PR 2886 at commit 094d508.

  • This patch merges cleanly.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22071/
Test FAILed.

@tsliwowicz
Copy link
Contributor Author

the failure seems technical (not related to my fix), I think. Local maven build works fine for me.

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have finished for PR 2886 at commit df9d98f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22070/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have finished for PR 2886 at commit 094d508.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22072/
Test PASSed.

@andrewor14
Copy link
Contributor

Ok I have merged this into master. It doesn't merge cleanly into 1.1, so @tsliwowicz can you create a new PR against that branch? Thanks.

@andrewor14
Copy link
Contributor

(also branch 1.0)

@asfgit asfgit closed this in 6b48522 Oct 23, 2014
tsliwowicz added a commit to taboola/spark that referenced this pull request Oct 23, 2014
…f double registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like apache#2854 except it's on master

Author: Tal Sliwowicz <tal.s@taboola.com>

Closes apache#2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits:

094d508 [Tal Sliwowicz] some more white space change undone
41a2217 [Tal Sliwowicz] some more whitspaces change undone
7bcfc3d [Tal Sliwowicz] whitspaces fix
df9d98f [Tal Sliwowicz] Code review comments fixed
f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.

(cherry picked from commit 6b48522)

Conflicts:
	core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
tsliwowicz added a commit to taboola/spark that referenced this pull request Oct 23, 2014
…f double registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like apache#2854 except it's on master

Author: Tal Sliwowicz <tal.s@taboola.com>

Closes apache#2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits:

094d508 [Tal Sliwowicz] some more white space change undone
41a2217 [Tal Sliwowicz] some more whitspaces change undone
7bcfc3d [Tal Sliwowicz] whitspaces fix
df9d98f [Tal Sliwowicz] Code review comments fixed
f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.

(cherry picked from commit 6b48522)

Conflicts:
	core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala

(cherry picked from commit d122236)

Conflicts:
	core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
tsliwowicz added a commit to taboola/spark that referenced this pull request Oct 23, 2014
…f double registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like apache#2854 except it's on master

Author: Tal Sliwowicz <tal.s@taboola.com>

Closes apache#2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits:

094d508 [Tal Sliwowicz] some more white space change undone
41a2217 [Tal Sliwowicz] some more whitspaces change undone
7bcfc3d [Tal Sliwowicz] whitspaces fix
df9d98f [Tal Sliwowicz] Code review comments fixed
f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.

(cherry picked from commit 6b48522)

Conflicts:
	core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
@tsliwowicz
Copy link
Contributor Author

will do. Can you also merge into the 0.9 branch? I will update the PR I already have for it. #2854

@tsliwowicz
Copy link
Contributor Author

@andrewor14 I created PR for 1.0, 1.1 and updated the 0.9 PR - can you please review and merge if ok? Non of the merges were clean, so I decided to do it for each branch.

@andrewor14
Copy link
Contributor

Thanks a lot @tsliwowicz I'll do that shortly.

asfgit pushed a commit that referenced this pull request Oct 24, 2014
…f d...

...ouble registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like #2886 except it's on branch-1.1

Author: Tal Sliwowicz <tal.s@taboola.com>

Closes #2915 from tsliwowicz/branch-1.1-block-mgr-removal and squashes the following commits:

d122236 [Tal Sliwowicz] [SPARK-4006] In long running contexts, we encountered the situation of double registe...
asfgit pushed a commit that referenced this pull request Dec 18, 2014
…f d...

...ouble registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like #2886 except it's on branch-1.0

Author: Tal Sliwowicz <tal.s@taboola.com>

Closes #2914 from tsliwowicz/branch-1.0-block-mgr-removal and squashes the following commits:

1014493 [Tal Sliwowicz] [SPARK-4006] In long running contexts, we encountered the situation of double registe...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants