[CELEBORN-1233] Treat unfound PartitionLocation as failed in Controller#commitFiles#2235
Closed
waitinfuture wants to merge 8 commits intoapache:mainfrom
Closed
[CELEBORN-1233] Treat unfound PartitionLocation as failed in Controller#commitFiles#2235waitinfuture wants to merge 8 commits intoapache:mainfrom
waitinfuture wants to merge 8 commits intoapache:mainfrom
Conversation
Contributor
Author
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2235 +/- ##
==========================================
+ Coverage 47.61% 47.62% +0.01%
==========================================
Files 192 192
Lines 11865 11865
Branches 1050 1050
==========================================
+ Hits 5648 5649 +1
Misses 5843 5843
+ Partials 374 373 -1 ☔ View full report in Codecov by Sentry. |
RexXiong
reviewed
Jan 17, 2024
| .booleanConf | ||
| .createWithDefault(false) | ||
|
|
||
| val TEST_WORKER_UNDER_TEST: ConfigEntry[Boolean] = |
Contributor
There was a problem hiding this comment.
why this configuration needed?
Contributor
Author
There was a problem hiding this comment.
The reason is that WithShuffleClientSuite#registerAndFinishPartition repeatedly called mapPartitionMapperEnd twice. I reverted the config and removed the duplicate invocation. PTAL @RexXiong
Contributor
Author
|
Thanks, merging to main(v0.5.0)/branch-0.4(v0.4.0)/branch-0.3(v0.3.3) |
waitinfuture
added a commit
that referenced
this pull request
Jan 18, 2024
…er#commitFiles
### What changes were proposed in this pull request?
I tested 1T TPCDS with the following Celeborn 8-worker cluster setup:
1. Workers have fixed ports for rpc/push/replicate
2. `spark.celeborn.client.spark.fetch.throwsFetchFailure` is enabled
3. graceful shutdown is enabled
I randomly kill -9 and ./sbin/stop-worker.sh (both graceful shutdown and non-graceful shutdown) some workers and start it immediately. Then I encountered result incorrectness with low probability (1 out of 99 queries).
After digging into it, I found the reason is as follows:
1. At time T1, all workers are serving shuffle 602
2. At time T2, I run stop-worker.sh for worker2, and then run kill -9 and start worker1. Since the workers are configured with fixed ports, clients think they are OK and Master let them re-register, which will also success. And worker2 is clean in memory.
4. At time T3, push requests to worker2 fails and revives on worker1, so worker1 has reservation for shuffle 602. Then I start worker2.
5. At time T4, LifecycleManager sends CommitFiles to all workers, on worker1, it just logs that some PartitionLocations
don't exist and ignores them.
6. CommitFiles success, but worker1 loses some data before restarting, and no error happens.
The following snapshot shows the process.

This PR fixes this by treating unfound PartitionLocations as failed when handling CommitFiles.
### Why are the changes needed?
ditto
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test
Closes #2235 from waitinfuture/1233.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
(cherry picked from commit 749a0fa)
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
waitinfuture
added a commit
that referenced
this pull request
Jan 18, 2024
…er#commitFiles
### What changes were proposed in this pull request?
I tested 1T TPCDS with the following Celeborn 8-worker cluster setup:
1. Workers have fixed ports for rpc/push/replicate
2. `spark.celeborn.client.spark.fetch.throwsFetchFailure` is enabled
3. graceful shutdown is enabled
I randomly kill -9 and ./sbin/stop-worker.sh (both graceful shutdown and non-graceful shutdown) some workers and start it immediately. Then I encountered result incorrectness with low probability (1 out of 99 queries).
After digging into it, I found the reason is as follows:
1. At time T1, all workers are serving shuffle 602
2. At time T2, I run stop-worker.sh for worker2, and then run kill -9 and start worker1. Since the workers are configured with fixed ports, clients think they are OK and Master let them re-register, which will also success. And worker2 is clean in memory.
4. At time T3, push requests to worker2 fails and revives on worker1, so worker1 has reservation for shuffle 602. Then I start worker2.
5. At time T4, LifecycleManager sends CommitFiles to all workers, on worker1, it just logs that some PartitionLocations
don't exist and ignores them.
6. CommitFiles success, but worker1 loses some data before restarting, and no error happens.
The following snapshot shows the process.

This PR fixes this by treating unfound PartitionLocations as failed when handling CommitFiles.
### Why are the changes needed?
ditto
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test
Closes #2235 from waitinfuture/1233.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
(cherry picked from commit 749a0fa)
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
I tested 1T TPCDS with the following Celeborn 8-worker cluster setup:
spark.celeborn.client.spark.fetch.throwsFetchFailureis enabledI randomly kill -9 and ./sbin/stop-worker.sh (both graceful shutdown and non-graceful shutdown) some workers and start it immediately. Then I encountered result incorrectness with low probability (1 out of 99 queries).
After digging into it, I found the reason is as follows:
don't exist and ignores them.
The following snapshot shows the process.
This PR fixes this by treating unfound PartitionLocations as failed when handling CommitFiles.
Why are the changes needed?
ditto
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Manual test