[CELEBORN-1233] Treat unfound PartitionLocation as failed in Controller#commitFiles by waitinfuture · Pull Request #2235 · apache/celeborn

waitinfuture · 2024-01-17T10:49:24Z

What changes were proposed in this pull request?

I tested 1T TPCDS with the following Celeborn 8-worker cluster setup:

Workers have fixed ports for rpc/push/replicate
spark.celeborn.client.spark.fetch.throwsFetchFailure is enabled
graceful shutdown is enabled

I randomly kill -9 and ./sbin/stop-worker.sh (both graceful shutdown and non-graceful shutdown) some workers and start it immediately. Then I encountered result incorrectness with low probability (1 out of 99 queries).

After digging into it, I found the reason is as follows:

At time T1, all workers are serving shuffle 602
At time T2, I run stop-worker.sh for worker2, and then run kill -9 and start worker1. Since the workers are configured with fixed ports, clients think they are OK and Master let them re-register, which will also success. And worker2 is clean in memory.
At time T3, push requests to worker2 fails and revives on worker1, so worker1 has reservation for shuffle 602. Then I start worker2.
At time T4, LifecycleManager sends CommitFiles to all workers, on worker1, it just logs that some PartitionLocations
don't exist and ignores them.
CommitFiles success, but worker1 loses some data before restarting, and no error happens.

The following snapshot shows the process.

This PR fixes this by treating unfound PartitionLocations as failed when handling CommitFiles.

Why are the changes needed?

ditto

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual test

…er#commitFiles

waitinfuture · 2024-01-17T10:49:42Z

cc @RexXiong @FMX

codecov · 2024-01-17T11:03:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (30608ea) 47.61% compared to head (39796a4) 47.62%.
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2235      +/-   ##
==========================================
+ Coverage   47.61%   47.62%   +0.01%     
==========================================
  Files         192      192              
  Lines       11865    11865              
  Branches     1050     1050              
==========================================
+ Hits         5648     5649       +1     
  Misses       5843     5843              
+ Partials      374      373       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

RexXiong · 2024-01-17T14:57:36Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

      .booleanConf
      .createWithDefault(false)

+  val TEST_WORKER_UNDER_TEST: ConfigEntry[Boolean] =


why this configuration needed?

The reason is that WithShuffleClientSuite#registerAndFinishPartition repeatedly called mapPartitionMapperEnd twice. I reverted the config and removed the duplicate invocation. PTAL @RexXiong

This reverts commit 2312dc7.

This reverts commit bd9ae09.

This reverts commit 8855063.

RexXiong

LGTM, thanks!

waitinfuture · 2024-01-18T03:28:36Z

Thanks, merging to main(v0.5.0)/branch-0.4(v0.4.0)/branch-0.3(v0.3.3)

…er#commitFiles ### What changes were proposed in this pull request? I tested 1T TPCDS with the following Celeborn 8-worker cluster setup: 1. Workers have fixed ports for rpc/push/replicate 2. `spark.celeborn.client.spark.fetch.throwsFetchFailure` is enabled 3. graceful shutdown is enabled I randomly kill -9 and ./sbin/stop-worker.sh (both graceful shutdown and non-graceful shutdown) some workers and start it immediately. Then I encountered result incorrectness with low probability (1 out of 99 queries). After digging into it, I found the reason is as follows: 1. At time T1, all workers are serving shuffle 602 2. At time T2, I run stop-worker.sh for worker2, and then run kill -9 and start worker1. Since the workers are configured with fixed ports, clients think they are OK and Master let them re-register, which will also success. And worker2 is clean in memory. 4. At time T3, push requests to worker2 fails and revives on worker1, so worker1 has reservation for shuffle 602. Then I start worker2. 5. At time T4, LifecycleManager sends CommitFiles to all workers, on worker1, it just logs that some PartitionLocations don't exist and ignores them. 6. CommitFiles success, but worker1 loses some data before restarting, and no error happens. The following snapshot shows the process. ![image](https://github.com/apache/incubator-celeborn/assets/948245/9ef1a1ff-bb26-420a-929c-70c9476ec700) This PR fixes this by treating unfound PartitionLocations as failed when handling CommitFiles. ### Why are the changes needed? ditto ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test Closes #2235 from waitinfuture/1233. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> (cherry picked from commit 749a0fa) Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>

[CELEBORN-1233] Treat unfound PartitionLocation as failed in Controll…

5341918

…er#commitFiles

waitinfuture added 3 commits January 17, 2024 21:22

fix ut

8855063

fix NPE

bd9ae09

version

2312dc7

RexXiong reviewed Jan 17, 2024

View reviewed changes

waitinfuture added 4 commits January 18, 2024 10:58

Revert "version"

22e26aa

This reverts commit 2312dc7.

Revert "fix NPE"

4a05732

This reverts commit bd9ae09.

Revert "fix ut"

512deb2

This reverts commit 8855063.

remove duplicate calling mapPartitionMapperEnd in WithShuffleClientSuite

39796a4

RexXiong approved these changes Jan 18, 2024

View reviewed changes

waitinfuture closed this in 749a0fa Jan 18, 2024

jiang13021 mentioned this pull request Jul 19, 2024

[CELEBORN-1233] Add unit test to verify data correctness #2638

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-1233] Treat unfound PartitionLocation as failed in Controller#commitFiles#2235

[CELEBORN-1233] Treat unfound PartitionLocation as failed in Controller#commitFiles#2235
waitinfuture wants to merge 8 commits intoapache:mainfrom
waitinfuture:1233

waitinfuture commented Jan 17, 2024 •

edited

Loading

Uh oh!

waitinfuture commented Jan 17, 2024

Uh oh!

codecov bot commented Jan 17, 2024 •

edited

Loading

Uh oh!

RexXiong Jan 17, 2024

Uh oh!

waitinfuture Jan 18, 2024 •

edited

Loading

Uh oh!

RexXiong left a comment

Uh oh!

waitinfuture commented Jan 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

waitinfuture commented Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

waitinfuture commented Jan 17, 2024

Uh oh!

codecov bot commented Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

RexXiong Jan 17, 2024

Choose a reason for hiding this comment

Uh oh!

waitinfuture Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RexXiong left a comment

Choose a reason for hiding this comment

Uh oh!

waitinfuture commented Jan 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

waitinfuture commented Jan 17, 2024 •

edited

Loading

codecov bot commented Jan 17, 2024 •

edited

Loading

waitinfuture Jan 18, 2024 •

edited

Loading