Skip to content

[SPARK-33154][CORE][K8S] Handle cleaned shuffles during migration#30046

Closed
holdenk wants to merge 15 commits intoapache:masterfrom
holdenk:handle-cleaned-shuffles-during0migration
Closed

[SPARK-33154][CORE][K8S] Handle cleaned shuffles during migration#30046
holdenk wants to merge 15 commits intoapache:masterfrom
holdenk:handle-cleaned-shuffles-during0migration

Conversation

@holdenk
Copy link
Contributor

@holdenk holdenk commented Oct 15, 2020

What changes were proposed in this pull request?

If a block is removed between discovery to transfer fo the block, we short circuit that block and remove it from the list to transfer and increment the transferred blocks. This is complicated since both RPC errors and local read errors may be reported with the same exception class.

Why are the changes needed?

Slow shuffle refreshes could waste time when decommissioning has already finished. Decommissioning might avoid transferring some some blocks to an otherwise live host which is marked as "full" if a deleted block fails to transfer to that host.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit and integration tests.

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Test build #129796 has finished for PR 30046 at commit fa19ca4.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

decommissioningTest = true)
}

test("Test basic decommissioning with shuffle cleanup", k8sTestTag) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-33154] Handle cleaned shuffles during migration [SPARK-33154][CORE][K8S] Handle cleaned shuffles during migration Oct 15, 2020
@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34404/

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34404/

@dongjoon-hyun
Copy link
Member

Gentle ping, @holdenk .

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34457/

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34457/

@SparkQA
Copy link

SparkQA commented Oct 15, 2020

Test build #129851 has finished for PR 30046 at commit cc926b3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the newly added K8s IT fails. Could you check it?

- Test decommissioning with dynamic allocation & shuffle cleanups *** FAILED ***

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34469/

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34469/

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Test build #129863 has finished for PR 30046 at commit 32c4580.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34476/

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34476/

@holdenk
Copy link
Contributor Author

holdenk commented Oct 16, 2020

R failure is unrelated. cc @dongjoon-hyun

@SparkQA
Copy link

SparkQA commented Oct 16, 2020

Test build #129871 has finished for PR 30046 at commit b50eea8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @holdenk .
Merged to master.

holdenk added a commit to holdenk/spark that referenced this pull request Oct 27, 2020
If a block is removed between discovery to transfer fo the block, we short circuit that block and remove it from the list to transfer and increment the transferred blocks. This is complicated since both RPC errors and local read errors may be reported with the same exception class.

Slow shuffle refreshes could waste time when decommissioning has already finished. Decommissioning might avoid transferring some some blocks to an otherwise live host which is marked as "full" if a deleted block fails to transfer to that host.

No.

New unit and integration tests.

Closes apache#30046 from holdenk/handle-cleaned-shuffles-during0migration.

Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants