Skip to content

[CELEBORN-1233] Treat unfound PartitionLocation as failed in Controller#commitFiles#2235

Closed
waitinfuture wants to merge 8 commits intoapache:mainfrom
waitinfuture:1233
Closed

[CELEBORN-1233] Treat unfound PartitionLocation as failed in Controller#commitFiles#2235
waitinfuture wants to merge 8 commits intoapache:mainfrom
waitinfuture:1233

Conversation

@waitinfuture
Copy link
Contributor

@waitinfuture waitinfuture commented Jan 17, 2024

What changes were proposed in this pull request?

I tested 1T TPCDS with the following Celeborn 8-worker cluster setup:

  1. Workers have fixed ports for rpc/push/replicate
  2. spark.celeborn.client.spark.fetch.throwsFetchFailure is enabled
  3. graceful shutdown is enabled

I randomly kill -9 and ./sbin/stop-worker.sh (both graceful shutdown and non-graceful shutdown) some workers and start it immediately. Then I encountered result incorrectness with low probability (1 out of 99 queries).

After digging into it, I found the reason is as follows:

  1. At time T1, all workers are serving shuffle 602
  2. At time T2, I run stop-worker.sh for worker2, and then run kill -9 and start worker1. Since the workers are configured with fixed ports, clients think they are OK and Master let them re-register, which will also success. And worker2 is clean in memory.
  3. At time T3, push requests to worker2 fails and revives on worker1, so worker1 has reservation for shuffle 602. Then I start worker2.
  4. At time T4, LifecycleManager sends CommitFiles to all workers, on worker1, it just logs that some PartitionLocations
    don't exist and ignores them.
  5. CommitFiles success, but worker1 loses some data before restarting, and no error happens.

The following snapshot shows the process.

image

This PR fixes this by treating unfound PartitionLocations as failed when handling CommitFiles.

Why are the changes needed?

ditto

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual test

@waitinfuture
Copy link
Contributor Author

cc @RexXiong @FMX

@codecov
Copy link

codecov bot commented Jan 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (30608ea) 47.61% compared to head (39796a4) 47.62%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2235      +/-   ##
==========================================
+ Coverage   47.61%   47.62%   +0.01%     
==========================================
  Files         192      192              
  Lines       11865    11865              
  Branches     1050     1050              
==========================================
+ Hits         5648     5649       +1     
  Misses       5843     5843              
+ Partials      374      373       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

.booleanConf
.createWithDefault(false)

val TEST_WORKER_UNDER_TEST: ConfigEntry[Boolean] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this configuration needed?

Copy link
Contributor Author

@waitinfuture waitinfuture Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that WithShuffleClientSuite#registerAndFinishPartition repeatedly called mapPartitionMapperEnd twice. I reverted the config and removed the duplicate invocation. PTAL @RexXiong

Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@waitinfuture
Copy link
Contributor Author

Thanks, merging to main(v0.5.0)/branch-0.4(v0.4.0)/branch-0.3(v0.3.3)

waitinfuture added a commit that referenced this pull request Jan 18, 2024
…er#commitFiles

### What changes were proposed in this pull request?
I tested 1T TPCDS with the following Celeborn 8-worker cluster setup:
1. Workers have fixed ports for rpc/push/replicate
2. `spark.celeborn.client.spark.fetch.throwsFetchFailure` is enabled
3. graceful shutdown is enabled

I randomly kill -9 and ./sbin/stop-worker.sh (both graceful shutdown and non-graceful shutdown) some workers and start it immediately. Then I encountered result incorrectness with low probability (1 out of 99 queries).

After digging into it, I found the reason is as follows:
1. At time T1, all workers are serving shuffle 602
2. At time T2, I run stop-worker.sh for worker2, and then run kill -9 and start worker1. Since the workers are configured with fixed ports, clients think they are OK and Master let them re-register, which will also success. And worker2 is clean in memory.
4. At time T3, push requests to worker2 fails and revives on worker1, so worker1 has reservation for shuffle 602. Then I start worker2.
5. At time T4, LifecycleManager sends CommitFiles to all workers, on worker1, it just logs that some PartitionLocations
    don't exist and ignores them.
6. CommitFiles success, but worker1 loses some data before restarting, and no error happens.

The following snapshot shows the process.

![image](https://github.com/apache/incubator-celeborn/assets/948245/9ef1a1ff-bb26-420a-929c-70c9476ec700)

This PR fixes this by treating unfound PartitionLocations as failed when handling CommitFiles.

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test

Closes #2235 from waitinfuture/1233.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
(cherry picked from commit 749a0fa)
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
waitinfuture added a commit that referenced this pull request Jan 18, 2024
…er#commitFiles

### What changes were proposed in this pull request?
I tested 1T TPCDS with the following Celeborn 8-worker cluster setup:
1. Workers have fixed ports for rpc/push/replicate
2. `spark.celeborn.client.spark.fetch.throwsFetchFailure` is enabled
3. graceful shutdown is enabled

I randomly kill -9 and ./sbin/stop-worker.sh (both graceful shutdown and non-graceful shutdown) some workers and start it immediately. Then I encountered result incorrectness with low probability (1 out of 99 queries).

After digging into it, I found the reason is as follows:
1. At time T1, all workers are serving shuffle 602
2. At time T2, I run stop-worker.sh for worker2, and then run kill -9 and start worker1. Since the workers are configured with fixed ports, clients think they are OK and Master let them re-register, which will also success. And worker2 is clean in memory.
4. At time T3, push requests to worker2 fails and revives on worker1, so worker1 has reservation for shuffle 602. Then I start worker2.
5. At time T4, LifecycleManager sends CommitFiles to all workers, on worker1, it just logs that some PartitionLocations
    don't exist and ignores them.
6. CommitFiles success, but worker1 loses some data before restarting, and no error happens.

The following snapshot shows the process.

![image](https://github.com/apache/incubator-celeborn/assets/948245/9ef1a1ff-bb26-420a-929c-70c9476ec700)

This PR fixes this by treating unfound PartitionLocations as failed when handling CommitFiles.

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test

Closes #2235 from waitinfuture/1233.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
(cherry picked from commit 749a0fa)
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants