Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle #33034

Closed
wants to merge 28 commits into from

Conversation

venkata91
Copy link
Contributor

@venkata91 venkata91 commented Jun 23, 2021

What changes were proposed in this pull request?

[SPARK-23243] and [SPARK-25341] addressed cases of stage retries for indeterminate stage involving operations like repartition. This PR addresses the same issues in the context of push-based shuffle. Currently there is no way to distinguish the current execution of a stage for a shuffle ID. Therefore the changes explained below are necessary.

Core changes are summarized as follows:

  1. Introduce a new variable shuffleMergeId in ShuffleDependency which is monotonically increasing value tracking the temporal ordering of execution of <stage-id, stage-attempt-id> for a shuffle ID.
  2. Correspondingly make changes in the push-based shuffle protocol layer in MergedShuffleFileManager, BlockStoreClient passing the shuffleMergeId in order to keep track of the shuffle output in separate files on the shuffle service side.
  3. DAGScheduler increments the shuffleMergeId tracked in ShuffleDependency in the cases of a indeterministic stage execution
  4. Deterministic stage will have shuffleMergeId set to 0 as no special handling is needed in this case and indeterminate stage will have shuffleMergeId starting from 1.

Why are the changes needed?

New protocol changes are needed due to the reasons explained above.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added new unit tests in RemoteBlockPushResolverSuite, DAGSchedulerSuite, BlockIdSuite, ErrorHandlerSuite

@venkata91
Copy link
Contributor Author

cc @mridulm @Victsm @otterc @zhouyejoe @Ngone51 Please take a look. Currently it is in work in progress as tests are being added. Raised this PR now since we are making protocol changes, it would be better if it can be done before branch-3.2 cut that way at least protocol changes can be merged if reviews on implementation details takes more time. Thanks :)

@venkata91 venkata91 force-pushed the SPARK-32923 branch 2 times, most recently from de08b6e to 39e5df6 Compare June 23, 2021 02:35
Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to handle the case of how shuffleSequenceId needs to be configured.

Copy link
Contributor

@otterc otterc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to fetch protocols seem unnecessary because fetch will never request shuffle data of an older shuffleSequenceId.

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments

@venkata91
Copy link
Contributor Author

@mridulm @otterc @Ngone51 Currently I have updated the PR with the changes of SPARK-35546, will remove it once the PR gets merged. Please review.

@venkata91 venkata91 force-pushed the SPARK-32923 branch 2 times, most recently from 6e21a27 to 92ada56 Compare July 19, 2021 20:04
@venkata91
Copy link
Contributor Author

@otterc feels this PR is quite big, I agree. I will break this PR in to 2 client and server and keep this for reference purposes.

@mridulm
Copy link
Contributor

mridulm commented Jul 21, 2021

Is this still WIP ?

@venkata91
Copy link
Contributor Author

Is this still WIP ?

Yeah I am in the process of breaking this into 2 PRs. Will update here once that is done.

@venkata91
Copy link
Contributor Author

Is this still WIP ?

Yeah I am in the process of breaking this into 2 PRs. Will update here once that is done.

After having offline discussions with @mridulm , we decided not to break this PR in to 2. Will fix one of the pending change and remove the WIP tag.

@otterc
Copy link
Contributor

otterc commented Jul 21, 2021

After having offline discussions with @mridulm , we decided not to break this PR in to 2. Will fix one of the pending change and remove the WIP tag.

@venkata91 Could you please provide a reason for not breaking this PR in multiple parts. This PR is touching a lot files both on the client and server side. It changes the protocols. On the client, this again touches the driver side, the push side, and the fetch side.
We have broken changes in the past for similar sized changes to make the review easy and reduces the introduction of bugs. So why for this one we are making the exception?
cc. @mridulm @Victsm @Ngone51

@venkata91
Copy link
Contributor Author

After having offline discussions with @mridulm , we decided not to break this PR in to 2. Will fix one of the pending change and remove the WIP tag.

@venkata91 Could you please provide a reason for not breaking this PR in multiple parts. This PR is touching a lot files both on the client and server side. It changes the protocols. On the client, this again touches the driver side, the push side, and the fetch side.
We have broken changes in the past for similar sized changes to make the review easy and reduces the introduction of bugs. So why for this one we are making the exception?
cc. @mridulm @Victsm @Ngone51
There are couple of reasons:

  1. Splitting the PR into 2 is not easy as well as clean - there are few classes which are dependent on both client and server side for eg: MergeStatuses, FinalizeShuffleMerge and if I start adding one of them to either client side changes or server side changes then the dependencies slowly expand and becomes hard to separate them into 2 clean units given now we have implementation also in place.
  2. Given the timeline of RC, @mridulm feels if we break this in to 2 and 1 of them gets in but the other don't then that would be a problem. May be we can time it out so that both of them are close to completion and try to merge them both together.
    Any suggestions?

@venkata91 venkata91 changed the title WIP: [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle Jul 21, 2021
@venkata91
Copy link
Contributor Author

venkata91 commented Jul 21, 2021

@mridulm Removed the WIP tag now it is good to review. cc @Victsm @otterc @Ngone51

@Ngone51
Copy link
Member

Ngone51 commented Jul 22, 2021

I tend to agree with not break because of the RC timeline. Usually, multiple PRs take more time to get all merged in than one PR.
I'll take a look today.

@venkata91
Copy link
Contributor Author

venkata91 commented Aug 1, 2021

Can you fix the conflict @venkata91 ?

@mridulm Fixed the conflict, should be good now. Will wait for the tests to run completely.

@SparkQA
Copy link

SparkQA commented Aug 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46442/

@SparkQA
Copy link

SparkQA commented Aug 1, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46442/

@SparkQA
Copy link

SparkQA commented Aug 1, 2021

Test build #141932 has finished for PR 33034 at commit 4a43d1d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mridulm
Copy link
Contributor

mridulm commented Aug 1, 2021

Thanks @venkata91, the tests are passing with the update.
The PR has addressed all pending comment, but will wait for @Ngone51, @otterc or @zhouyejoe to also take a look before merging.

@zhouyejoe
Copy link
Contributor

@venkata91 Thanks for working on this. LGTM.

@asfgit asfgit closed this in c039d99 Aug 2, 2021
asfgit pushed a commit that referenced this pull request Aug 2, 2021
…ush-based shuffle

[[SPARK-23243](https://issues.apache.org/jira/browse/SPARK-23243)] and [[SPARK-25341](https://issues.apache.org/jira/browse/SPARK-25341)] addressed cases of stage retries for indeterminate stage involving operations like repartition. This PR addresses the same issues in the context of push-based shuffle. Currently there is no way to distinguish the current execution of a stage for a shuffle ID. Therefore the changes explained below are necessary.

Core changes are summarized as follows:

1. Introduce a new variable `shuffleMergeId` in `ShuffleDependency` which is monotonically increasing value tracking the temporal ordering of execution of <stage-id, stage-attempt-id> for a shuffle ID.
2. Correspondingly make changes in the push-based shuffle protocol layer in `MergedShuffleFileManager`, `BlockStoreClient` passing the `shuffleMergeId` in order to keep track of the shuffle output in separate files on the shuffle service side.
3. `DAGScheduler` increments the `shuffleMergeId` tracked in `ShuffleDependency` in the cases of a indeterministic stage execution
4. Deterministic stage will have `shuffleMergeId` set to 0 as no special handling is needed in this case and indeterminate stage will have `shuffleMergeId` starting from 1.

New protocol changes are needed due to the reasons explained above.

No

Added new unit tests in `RemoteBlockPushResolverSuite, DAGSchedulerSuite, BlockIdSuite, ErrorHandlerSuite`

Closes #33034 from venkata91/SPARK-32923.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit c039d99)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
@mridulm
Copy link
Contributor

mridulm commented Aug 2, 2021

Merged to master and branch-3.2
There was some conflict cherry picking to branch-3.2 which I manually fixed.

+CC @gengliangwang

Thanks for fixing this @venkata91
Thanks for all the reviews @Ngone51, @otterc, @zhouyejoe and @Victsm !

This was the last patch for push based shuffle SPIP - the only pending task is documentation.
Thanks for all the PR's and reviews everyone !!

@gengliangwang
Copy link
Member

@venkata91 Thanks for the work!
@mridulm Thanks for the ping :)

@venkata91
Copy link
Contributor Author

Thanks for the thorough reviews @mridulm @Ngone51 @otterc @zhouyejoe . Learned quite a lot :)

zhouyejoe pushed a commit to linkedin/spark that referenced this pull request Aug 3, 2021
…ush-based shuffle

[[SPARK-23243](https://issues.apache.org/jira/browse/SPARK-23243)] and [[SPARK-25341](https://issues.apache.org/jira/browse/SPARK-25341)] addressed cases of stage retries for indeterminate stage involving operations like repartition. This PR addresses the same issues in the context of push-based shuffle. Currently there is no way to distinguish the current execution of a stage for a shuffle ID. Therefore the changes explained below are necessary.

Core changes are summarized as follows:

1. Introduce a new variable `shuffleMergeId` in `ShuffleDependency` which is monotonically increasing value tracking the temporal ordering of execution of <stage-id, stage-attempt-id> for a shuffle ID.
2. Correspondingly make changes in the push-based shuffle protocol layer in `MergedShuffleFileManager`, `BlockStoreClient` passing the `shuffleMergeId` in order to keep track of the shuffle output in separate files on the shuffle service side.
3. `DAGScheduler` increments the `shuffleMergeId` tracked in `ShuffleDependency` in the cases of a indeterministic stage execution
4. Deterministic stage will have `shuffleMergeId` set to 0 as no special handling is needed in this case and indeterminate stage will have `shuffleMergeId` starting from 1.

New protocol changes are needed due to the reasons explained above.

No

Added new unit tests in `RemoteBlockPushResolverSuite, DAGSchedulerSuite, BlockIdSuite, ErrorHandlerSuite`

Closes apache#33034 from venkata91/SPARK-32923.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit c039d99)
asfgit pushed a commit that referenced this pull request Aug 4, 2021
…when finalize request for higher shuffleMergeId is received

### What changes were proposed in this pull request?

Clean up older shuffleMergeId shuffle files when finalize request for higher shuffleMergeId is received when no blocks pushed for the corresponding shuffleMergeId. This is identified as part of #33034 (comment).

### Why are the changes needed?

Without this change, older shuffleMergeId files won't be cleaned up properly.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added changes to existing unit test to address this case.

Closes #33605 from venkata91/SPARK-32923-follow-on.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
asfgit pushed a commit that referenced this pull request Aug 4, 2021
…when finalize request for higher shuffleMergeId is received

### What changes were proposed in this pull request?

Clean up older shuffleMergeId shuffle files when finalize request for higher shuffleMergeId is received when no blocks pushed for the corresponding shuffleMergeId. This is identified as part of #33034 (comment).

### Why are the changes needed?

Without this change, older shuffleMergeId files won't be cleaned up properly.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added changes to existing unit test to address this case.

Closes #33605 from venkata91/SPARK-32923-follow-on.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit d816949)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
Ngone51 pushed a commit that referenced this pull request Aug 10, 2021
### What changes were proposed in this pull request?
Cleanup `RemoteBlockPushResolver` log messages by using `AppShufflePartitionInfo#toString()` to avoid duplications. Currently this is based off of #33034 will remove those changes once it is merged and remove the WIP at that time.

### Why are the changes needed?
Minor cleanup to make code more readable.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No tests, just changing log messages

Closes #33561 from venkata91/SPARK-36332.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: yi.wu <yi.wu@databricks.com>
Ngone51 pushed a commit that referenced this pull request Aug 10, 2021
### What changes were proposed in this pull request?
Cleanup `RemoteBlockPushResolver` log messages by using `AppShufflePartitionInfo#toString()` to avoid duplications. Currently this is based off of #33034 will remove those changes once it is merged and remove the WIP at that time.

### Why are the changes needed?
Minor cleanup to make code more readable.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No tests, just changing log messages

Closes #33561 from venkata91/SPARK-36332.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: yi.wu <yi.wu@databricks.com>
(cherry picked from commit ab89710)
Signed-off-by: yi.wu <yi.wu@databricks.com>
domybest11 pushed a commit to domybest11/spark that referenced this pull request Jun 15, 2022
…ush-based shuffle

[[SPARK-23243](https://issues.apache.org/jira/browse/SPARK-23243)] and [[SPARK-25341](https://issues.apache.org/jira/browse/SPARK-25341)] addressed cases of stage retries for indeterminate stage involving operations like repartition. This PR addresses the same issues in the context of push-based shuffle. Currently there is no way to distinguish the current execution of a stage for a shuffle ID. Therefore the changes explained below are necessary.

Core changes are summarized as follows:

1. Introduce a new variable `shuffleMergeId` in `ShuffleDependency` which is monotonically increasing value tracking the temporal ordering of execution of <stage-id, stage-attempt-id> for a shuffle ID.
2. Correspondingly make changes in the push-based shuffle protocol layer in `MergedShuffleFileManager`, `BlockStoreClient` passing the `shuffleMergeId` in order to keep track of the shuffle output in separate files on the shuffle service side.
3. `DAGScheduler` increments the `shuffleMergeId` tracked in `ShuffleDependency` in the cases of a indeterministic stage execution
4. Deterministic stage will have `shuffleMergeId` set to 0 as no special handling is needed in this case and indeterminate stage will have `shuffleMergeId` starting from 1.

New protocol changes are needed due to the reasons explained above.

No

Added new unit tests in `RemoteBlockPushResolverSuite, DAGSchedulerSuite, BlockIdSuite, ErrorHandlerSuite`

Closes apache#33034 from venkata91/SPARK-32923.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit c039d99)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
domybest11 pushed a commit to domybest11/spark that referenced this pull request Jun 15, 2022
…when finalize request for higher shuffleMergeId is received

### What changes were proposed in this pull request?

Clean up older shuffleMergeId shuffle files when finalize request for higher shuffleMergeId is received when no blocks pushed for the corresponding shuffleMergeId. This is identified as part of apache#33034 (comment).

### Why are the changes needed?

Without this change, older shuffleMergeId files won't be cleaned up properly.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added changes to existing unit test to address this case.

Closes apache#33605 from venkata91/SPARK-32923-follow-on.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit d816949)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
wangyum pushed a commit that referenced this pull request May 26, 2023
…ush-based shuffle

[[SPARK-23243](https://issues.apache.org/jira/browse/SPARK-23243)] and [[SPARK-25341](https://issues.apache.org/jira/browse/SPARK-25341)] addressed cases of stage retries for indeterminate stage involving operations like repartition. This PR addresses the same issues in the context of push-based shuffle. Currently there is no way to distinguish the current execution of a stage for a shuffle ID. Therefore the changes explained below are necessary.

Core changes are summarized as follows:

1. Introduce a new variable `shuffleMergeId` in `ShuffleDependency` which is monotonically increasing value tracking the temporal ordering of execution of <stage-id, stage-attempt-id> for a shuffle ID.
2. Correspondingly make changes in the push-based shuffle protocol layer in `MergedShuffleFileManager`, `BlockStoreClient` passing the `shuffleMergeId` in order to keep track of the shuffle output in separate files on the shuffle service side.
3. `DAGScheduler` increments the `shuffleMergeId` tracked in `ShuffleDependency` in the cases of a indeterministic stage execution
4. Deterministic stage will have `shuffleMergeId` set to 0 as no special handling is needed in this case and indeterminate stage will have `shuffleMergeId` starting from 1.

New protocol changes are needed due to the reasons explained above.

No

Added new unit tests in `RemoteBlockPushResolverSuite, DAGSchedulerSuite, BlockIdSuite, ErrorHandlerSuite`

Closes #33034 from venkata91/SPARK-32923.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
wangyum pushed a commit that referenced this pull request May 26, 2023
…when finalize request for higher shuffleMergeId is received

### What changes were proposed in this pull request?

Clean up older shuffleMergeId shuffle files when finalize request for higher shuffleMergeId is received when no blocks pushed for the corresponding shuffleMergeId. This is identified as part of #33034 (comment).

### Why are the changes needed?

Without this change, older shuffleMergeId files won't be cleaned up properly.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added changes to existing unit test to address this case.

Closes #33605 from venkata91/SPARK-32923-follow-on.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants