New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-36772] FinalizeShuffleMerge fails with an exception due to attempt id not matching #34018
Conversation
…o attempt id not matching.
@zhouyejoe Thanks for the fix. |
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockStoreClient.java
Outdated
Show resolved
Hide resolved
Tested the patch within our own cluster with YARN set up. Push based shuffle works fine in yarn cluster mode, where Yarn will provide default application attempt 1 to the job. Added unit test, leveraging the CSMockExternalBlockManager, started a SparkContext and check whether the application attemptId is being set in BlockStoreClient |
@zhouyejoe could you update the PR title and description? |
Updated the PR title and description. |
...network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockStoreClient.java
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/scheduler/CoarseGrainedSchedulerBackendSuite.scala
Outdated
Show resolved
Hide resolved
Discussed with @Ngone51.
Thoughts ? |
SGTM. Will update accordingly. |
I'm thinking of the potential compatibility issue in the future by using the |
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
Outdated
Show resolved
Hide resolved
…nt from Int to String, but add comparableAppAttemptId in ExternalBlockStoreClient
I am also thinking about the same.
Update:
|
BTW, I tested in three modes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just couple of minor comments - looks good to me.
+CC @Ngone51
...network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockStoreClient.java
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
Outdated
Show resolved
Hide resolved
...network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockStoreClient.java
Outdated
Show resolved
Hide resolved
...network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockStoreClient.java
Outdated
Show resolved
Hide resolved
Updated. Last run PR test failed on some unrelated unit test. |
Ok to test |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Merging to master/3.2 |
…empt id not matching ### What changes were proposed in this pull request? Remove the appAttemptId from TransportConf, and parsing through SparkEnv. ### Why are the changes needed? Push based shuffle will fail if there are any attemptId set in the SparkConf, as the attemptId is not set correctly in Driver. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested within our Yarn cluster. Without this PR, the Driver will fail to finalize the shuffle merge on all the mergers. After the patch, Driver can successfully finalize the shuffle merge and the push based shuffle can work fine. Also with unit test to verify the attemptId is being set in the BlockStoreClient in Driver. Closes #34018 from zhouyejoe/SPARK-36772. Authored-by: Ye Zhou <yezhou@linkedin.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit cabc36b) Signed-off-by: Gengliang Wang <gengliang@apache.org>
Test build #143433 has finished for PR 34018 at commit
|
Late lgtm. Thanks @zhouyejoe |
Thanks for reviewing and merging it @gengliangwang ! |
Thanks for review @mridulm @Ngone51 @venkata91 @gengliangwang |
…empt id not matching ### What changes were proposed in this pull request? Remove the appAttemptId from TransportConf, and parsing through SparkEnv. ### Why are the changes needed? Push based shuffle will fail if there are any attemptId set in the SparkConf, as the attemptId is not set correctly in Driver. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested within our Yarn cluster. Without this PR, the Driver will fail to finalize the shuffle merge on all the mergers. After the patch, Driver can successfully finalize the shuffle merge and the push based shuffle can work fine. Also with unit test to verify the attemptId is being set in the BlockStoreClient in Driver. Closes apache#34018 from zhouyejoe/SPARK-36772. Authored-by: Ye Zhou <yezhou@linkedin.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit cabc36b) Signed-off-by: Gengliang Wang <gengliang@apache.org>
…empt id not matching Remove the appAttemptId from TransportConf, and parsing through SparkEnv. Push based shuffle will fail if there are any attemptId set in the SparkConf, as the attemptId is not set correctly in Driver. No Tested within our Yarn cluster. Without this PR, the Driver will fail to finalize the shuffle merge on all the mergers. After the patch, Driver can successfully finalize the shuffle merge and the push based shuffle can work fine. Also with unit test to verify the attemptId is being set in the BlockStoreClient in Driver. Closes #34018 from zhouyejoe/SPARK-36772. Authored-by: Ye Zhou <yezhou@linkedin.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>
What changes were proposed in this pull request?
Remove the appAttemptId from TransportConf, and parsing through SparkEnv.
Why are the changes needed?
Push based shuffle will fail if there are any attemptId set in the SparkConf, as the attemptId is not set correctly in Driver.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Tested within our Yarn cluster. Without this PR, the Driver will fail to finalize the shuffle merge on all the mergers. After the patch, Driver can successfully finalize the shuffle merge and the push based shuffle can work fine.
Also with unit test to verify the attemptId is being set in the BlockStoreClient in Driver.