Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3495][SPARK-3496] Backporting block replication fixes made in master to branch 1.1 #3191

Closed

Conversation

tdas
Copy link
Contributor

@tdas tdas commented Nov 10, 2014

The original PR was #2366

This backport was non-trivial because Spark 1.1 uses ConnectionManager instead of NioBlockTransferService, which required slight modification to unit tests. Other than that the code is exactly same as in the original PR. Please refer to discussion in the original PR if you have any thoughts.

…n target node is dead AND [SPARK-3496] Block replication by mistake chooses driver as target

If a block manager (say, A) wants to replicate a block and the node chosen for replication (say, B) is dead, then the attempt to send the block to B fails. However, this continues to fail indefinitely. Even if the driver learns about the demise of the B, A continues to try replicating to B and failing miserably.

The reason behind this bug is that A initially fetches a list of peers from the driver (when B was active), but never updates it after B is dead. This affects Spark Streaming as its receiver uses block replication.

The solution in this patch adds the following.
- Changed BlockManagerMaster to return all the peers of a block manager, rather than the requested number. It also filters out driver BlockManager.
- Refactored BlockManager's replication code to handle peer caching correctly.
    + The peer for replication is randomly selected. This is different from past behavior where for a node A, a node B was deterministically chosen for the lifetime of the application.
    + If replication fails to one node, the peers are refetched.
    + The peer cached has a TTL of 1 second to enable discovery of new peers and using them for replication.
- Refactored use of \<driver\> in BlockManager into a new method `BlockManagerId.isDriver`
- Added replication unit tests (replication was not tested till now, duh!)

This should not make a difference in performance of Spark workloads where replication is not used.

@andrewor14 @JoshRosen

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes apache#2366 from tdas/replication-fix and squashes the following commits:

9690f57 [Tathagata Das] Moved replication tests to a new BlockManagerReplicationSuite.
0661773 [Tathagata Das] Minor changes based on PR comments.
a55a65c [Tathagata Das] Added a unit test to test replication behavior.
012afa3 [Tathagata Das] Bug fix
89f91a0 [Tathagata Das] Minor change.
68e2c72 [Tathagata Das] Made replication peer selection logic more efficient.
08afaa9 [Tathagata Das] Made peer selection for replication deterministic to block id
3821ab9 [Tathagata Das] Fixes based on PR comments.
08e5646 [Tathagata Das] More minor changes.
d402506 [Tathagata Das] Fixed imports.
4a20531 [Tathagata Das] Filtered driver block manager from peer list, and also consolidated the use of <driver> in BlockManager.
7598f91 [Tathagata Das] Minor changes.
03de02d [Tathagata Das] Change replication logic to correctly refetch peers from master on failure and on new worker addition.
d081bf6 [Tathagata Das] Fixed bug in get peers and unit tests to test get-peers and replication under executor churn.
9f0ac9f [Tathagata Das] Modified replication tests to fail on replication bug.
af0c1da [Tathagata Das] Added replication unit tests to BlockManagerSuite

Conflicts:
	core/src/main/scala/org/apache/spark/storage/BlockManager.scala
	core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala
	core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala
@SparkQA
Copy link

SparkQA commented Nov 10, 2014

QA tests have started for PR 3191 at commit 2ed927f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 10, 2014

QA tests have finished for PR 3191 at commit 2ed927f.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23163/
Test FAILed.

@tdas
Copy link
Contributor Author

tdas commented Nov 10, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Nov 10, 2014

QA tests have started for PR 3191 at commit 2ed927f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 10, 2014

QA tests have finished for PR 3191 at commit 2ed927f.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23165/
Test FAILed.

@tdas tdas changed the title [SPARK-3495][SPARK-3496] Backporting fix made in master to branch 1.1 [SPARK-3495][SPARK-3496] Backporting block replication fixes made in master to branch 1.1 Nov 11, 2014
@SparkQA
Copy link

SparkQA commented Nov 11, 2014

QA tests have started for PR 3191 at commit 593214a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 11, 2014

QA tests have finished for PR 3191 at commit 593214a.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23180/
Test PASSed.

asfgit pushed a commit that referenced this pull request Nov 11, 2014
…master to branch 1.1

The original PR was #2366

This backport was non-trivial because Spark 1.1 uses ConnectionManager instead of NioBlockTransferService, which required slight modification to unit tests. Other than that the code is exactly same as in the original PR. Please refer to discussion in the original PR if you have any thoughts.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #3191 from tdas/replication-fix-branch-1.1-backport and squashes the following commits:

593214a [Tathagata Das] Merge remote-tracking branch 'apache-github/branch-1.1' into branch-1.1
2ed927f [Tathagata Das] Fixed error in unit test.
de4ff73 [Tathagata Das] [SPARK-3495] Block replication fails continuously when the replication target node is dead AND [SPARK-3496] Block replication by mistake chooses driver as target
@andrewor14
Copy link
Contributor

This is already merged. Can you close this?

@tdas
Copy link
Contributor Author

tdas commented Nov 11, 2014

Closing. Github does not automatically merge PRs that are not to the main branch.

@tdas tdas closed this Nov 11, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants