[SPARK-47702][CORE] Remove Shuffle service endpoint from the locations list when RDD block is removed form a node.#45836
Closed
maheshk114 wants to merge 2 commits intoapache:masterfrom
Closed
[SPARK-47702][CORE] Remove Shuffle service endpoint from the locations list when RDD block is removed form a node.#45836maheshk114 wants to merge 2 commits intoapache:masterfrom
maheshk114 wants to merge 2 commits intoapache:masterfrom
Conversation
…s list when RDD block is removed form a node
Author
|
@attilapiros Can you please take a look at this PR. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
Contributor
|
@maheshk114 I am sorry I missed this PR earlier. Let me try to reopen it. |
Contributor
|
retest |
attilapiros
reviewed
Aug 1, 2024
Comment on lines
+2115
to
+2116
| // memory size and disk size are being kept for calculating delta. Reset the replica | ||
| // count 0 in storage level to notify that its a remove operation. |
Contributor
There was a problem hiding this comment.
Suggested change
| // memory size and disk size are being kept for calculating delta. Reset the replica | |
| // count 0 in storage level to notify that its a remove operation. | |
| // memory size and disk size are being kept for calculating delta but reset the replication | |
| // factor to 0 to indicate its a remove operation. |
Contributor
|
@maheshk114 could you please open a new PR with the same content? |
Contributor
|
I think it was the stale tag which caused it to get closed again @attilapiros. Removed it, let us see if it stays open :-) |
attilapiros
reviewed
Aug 3, 2024
Comment on lines
+312
to
+313
| val store2 = makeBlockManager(10000, "host-2") | ||
| assert(master.getPeers(store1.blockManagerId).toSet === Set(store2.blockManagerId)) |
| val blockId = RDDBlockId(1, 2) | ||
| val message = new Array[Byte](1000) | ||
|
|
||
| // if SHUFFLE_SERVICE_FETCH_RDD_ENABLED is enabled, then shuffle port should be present. |
Contributor
There was a problem hiding this comment.
Suggested change
| // if SHUFFLE_SERVICE_FETCH_RDD_ENABLED is enabled, then shuffle port should be present. | |
| // As SHUFFLE_SERVICE_FETCH_RDD_ENABLED is enabled the external shuffle service should be listed |
| assert(master.getLocations(blockId).contains( | ||
| BlockManagerId("host-1", "localhost", shuffleServicePort, None))) | ||
|
|
||
| // after block is removed, shuffle port should be removed. |
Contributor
There was a problem hiding this comment.
Suggested change
| // after block is removed, shuffle port should be removed. | |
| // After block is removed the external shuffle service should be removed too from the locations |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR fix a bug where the shuffle service end point is not removed when a RDD block is removed from a node. This is done by changing the call to reportBlockStatus in removeBlockInternal method. Currently StorageLevel.NONE is passed as storage level. This causes the driver to ignore the RDD in method updateBlockInfo as storageLevel.useDisk is set as false in StorageLevel.NONE.
Why are the changes needed?
If the block location is not updated properly, then tasks fails with fetch failed exception. The tasks will try to read the RDD blocks from a node using external shuffle service. The read will fail, if the node is already decommissioned.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added a new UT.
Was this patch authored or co-authored using generative AI tooling?
No