[SPARK-33154][CORE][K8S] Handle cleaned shuffles during migration#30046
[SPARK-33154][CORE][K8S] Handle cleaned shuffles during migration#30046holdenk wants to merge 15 commits intoapache:masterfrom
Conversation
|
Test build #129796 has finished for PR 30046 at commit
|
core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionUnitSuite.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala
Show resolved
Hide resolved
| decommissioningTest = true) | ||
| } | ||
|
|
||
| test("Test basic decommissioning with shuffle cleanup", k8sTestTag) { |
There was a problem hiding this comment.
Thank you for adding this.
...ion-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Gentle ping, @holdenk . |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #129851 has finished for PR 30046 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
BTW, the newly added K8s IT fails. Could you check it?
- Test decommissioning with dynamic allocation & shuffle cleanups *** FAILED ***
core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionUnitSuite.scala
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #129863 has finished for PR 30046 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
R failure is unrelated. cc @dongjoon-hyun |
|
Test build #129871 has finished for PR 30046 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM. Thank you, @holdenk .
Merged to master.
If a block is removed between discovery to transfer fo the block, we short circuit that block and remove it from the list to transfer and increment the transferred blocks. This is complicated since both RPC errors and local read errors may be reported with the same exception class. Slow shuffle refreshes could waste time when decommissioning has already finished. Decommissioning might avoid transferring some some blocks to an otherwise live host which is marked as "full" if a deleted block fails to transfer to that host. No. New unit and integration tests. Closes apache#30046 from holdenk/handle-cleaned-shuffles-during0migration. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
If a block is removed between discovery to transfer fo the block, we short circuit that block and remove it from the list to transfer and increment the transferred blocks. This is complicated since both RPC errors and local read errors may be reported with the same exception class.
Why are the changes needed?
Slow shuffle refreshes could waste time when decommissioning has already finished. Decommissioning might avoid transferring some some blocks to an otherwise live host which is marked as "full" if a deleted block fails to transfer to that host.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New unit and integration tests.