[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry #1584

zuston · 2024-03-15T03:14:03Z

What changes were proposed in this pull request?

clear out previous stage attempt data synchronously when registering the re-assignment shuffleIds.

Why are the changes needed?

If the previous stage attempt is in the purge queue in shuffle-server side, the retry stage writing will cause
unknown exceptions, so we'd better to clear out all previous stage attempt data before re-registering

This PR is to sync remove previous stage data when the first attempt writer is initialized.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

zuston · 2024-03-15T03:18:41Z

cc @dingshun3016 @yl09099 PTAL

github-actions · 2024-03-15T03:27:35Z

Test Results

2 433 files ±0 2 433 suites ±0 5h 0m 35s ⏱️ - 1m 8s
934 tests ±0 933 ✅ ±0 1 💤 ±0 0 ❌ ±0
10 828 runs ±0 10 814 ✅ ±0 14 💤 ±0 0 ❌ ±0

Results for commit 61e2a9b. ± Comparison against base commit fd64d9d.

♻️ This comment has been updated with latest results.

zuston · 2024-03-15T03:54:16Z

After rethinking this, I think the reassignAllShuffleServersForWholeStage could be invoked by the retry writer rather than previous failed writer that could ensure no older data into server after re-register.

codecov-commenter · 2024-03-22T08:50:59Z

Codecov Report

Attention: Patch coverage is 3.29670% with 352 lines in your changes are missing coverage. Please review.

Project coverage is 53.42%. Comparing base (6f6d35a) to head (5c9d9e3).
Report is 34 commits behind head on master.

Files	Patch %	Lines
...uniffle/shuffle/manager/RssShuffleManagerBase.java	0.00%	187 Missing ⚠️
.../shuffle/handle/StageAttemptShuffleHandleInfo.java	0.00%	43 Missing ⚠️
...pache/uniffle/server/ShuffleServerGrpcService.java	0.00%	32 Missing ⚠️
.../apache/spark/shuffle/RssStageResubmitManager.java	0.00%	22 Missing ⚠️
...spark/shuffle/handle/MutableShuffleHandleInfo.java	0.00%	22 Missing ⚠️
...niffle/server/netty/ShuffleServerNettyHandler.java	0.00%	9 Missing ⚠️
...ffle/client/request/RssRegisterShuffleRequest.java	0.00%	7 Missing ⚠️
...fle/shuffle/manager/ShuffleManagerGrpcService.java	0.00%	6 Missing ⚠️
...ffle/client/impl/grpc/ShuffleServerGrpcClient.java	0.00%	6 Missing ⚠️
...ffle/client/request/RssSendShuffleDataRequest.java	0.00%	5 Missing ⚠️
... and 6 more

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #1584      +/-   ##
============================================
- Coverage     54.86%   53.42%   -1.45%     
- Complexity     2358     2943     +585     
============================================
  Files           368      435      +67     
  Lines         16379    23768    +7389     
  Branches       1504     2208     +704     
============================================
+ Hits           8986    12697    +3711     
- Misses         6862    10290    +3428     
- Partials        531      781     +250

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jerqi · 2024-03-22T10:20:36Z

It's dangerous to delete the failed data of the stage when we retry. It's hard to reach the condition to delete the data. We should rely on the data skip to avoid reading the failure data.

zuston · 2024-03-22T10:22:34Z

It's dangerous to delete the failed data of the stage when we retry. It's hard to reach the condition to delete the data.

Could you describe more?

server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java

jerqi · 2024-03-23T12:40:31Z

It's dangerous to delete the failed data of the stage when we retry. It's hard to reach the condition to delete the data.

Could you describe more?

There may be some tasks will write legacy data to the shuffle server after you delete the shuffle data. Because although we resubmit the stage, some tasks for last attempt may write the data. Spark doesn't guarantee that all tasks will be ended from last attempt although you have started the newest attempt.

jerqi · 2024-03-25T02:34:29Z

@EnricoMi If we have the retry of stage, the taskId may not unique. Because we don't have stage attemptId to differ task 1 attempt 0 in the stage attempt 0 and task 1 attempt 0 in the stage attempt 1. This may cause we read wrong data.

zuston · 2024-03-25T03:17:05Z

It's dangerous to delete the failed data of the stage when we retry. It's hard to reach the condition to delete the data.

Could you describe more?

There may be some tasks will write legacy data to the shuffle server after you delete the shuffle data. Because although we resubmit the stage, some tasks for last attempt may write the data. Spark doesn't guarantee that all tasks will be ended from last attempt although you have started the newest attempt.

If so, we'd better to reject the shuffle data of older version. This could be implemented by maintaining the latest staeg attempt id

jerqi · 2024-03-25T07:28:12Z

It's dangerous to delete the failed data of the stage when we retry. It's hard to reach the condition to delete the data.

Could you describe more?

There may be some tasks will write legacy data to the shuffle server after you delete the shuffle data. Because although we resubmit the stage, some tasks for last attempt may write the data. Spark doesn't guarantee that all tasks will be ended from last attempt although you have started the newest attempt.

If so, we'd better to reject the shuffle data of older version. This could be implemented by maintaining the latest staeg attempt id

OK, Maybe rejection the legacy data will be better choice.

jerqi · 2024-03-25T07:28:43Z

@EnricoMi If we have the retry of stage, the taskId may not unique. Because we don't have stage attemptId to differ task 1 attempt 0 in the stage attempt 0 and task 1 attempt 0 in the stage attempt 1. This may cause we read wrong data.

Ignore this. Maybe rejection legacy data will be a better choice.

EnricoMi · 2024-03-25T09:07:19Z

server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java

@@ -158,6 +158,30 @@ public void registerShuffle(
    String remoteStoragePath = req.getRemoteStorage().getPath();
    String user = req.getUser();

+    if (req.getIsStageRetry()) {


If removeShuffleDataSync is always being called, we can avoid adding plumbing isStateRetry in here. When isStateRetry == false, this is a NOOP.

Method removeShuffleDataSync might return true if it found data to delete, so we can conditionally log the message below.

I prefer reserving the isStageRetry(or use stage attempt number to replace this) param for 2 reasons

this is more explicit for stage retry, especially when something go wrong, like the previous data has been purged due to expire heartbeat. If having this, the log will indicate the abnormal problem happens

for the next PR, I will introduce the stage latest attempt to discard the older attempt data.

all this plumbing for logging is peculiar

maybe there are better mechanisms to discard older data

zuston · 2024-03-26T03:55:14Z

Could you help review this? @EnricoMi @jerqi spark2 change will be finished after this PR is OK for you

jerqi · 2024-03-26T05:45:28Z

Could you help review this? @EnricoMi @jerqi spark2 change will be finished after this PR is OK for you

Several questions:

How to reject the legacy requests?
How to delete the legacy shuffle?

zuston · 2024-03-26T06:02:06Z

How to reject the legacy requests?

Using the latest attemtp id in server side to check whether the send request is valid with the older version, this will be finished in the next PR.

How to delete the legacy shuffle?

This has been involved in this PR.

EnricoMi · 2024-03-26T10:49:08Z

Can we register a shuffle as the tuple (shuffle_id, stage_attempt_id)? This way, we do not need to wait for (shuffle_id, 0) to be be deleted synchronously, and can go on registering and writing (shuffle_id, 1). Deletion could take a significant time for large partitions (think TBs).

EnricoMi · 2024-03-26T10:53:01Z

I think deletion of earlier shuffle data should not be synchronously in the first place! That is flawed by design. Think of TB of shuffle data. They should be deleted quickly / constant time (e.g. HDFS move) and cleaned up asynchronously (e.g. HDMF delete).

zuston · 2024-03-26T11:32:03Z

Can we register a shuffle as the tuple (shuffle_id, stage_attempt_id)? This way, we do not need to wait for (shuffle_id, 0) to be be deleted synchronously, and can go on registering and writing (shuffle_id, 1). Deletion could take a significant time for large partitions (think TBs).

Agree with you. I’m concerned about the cost of refactor.

zuston · 2024-05-22T06:43:06Z

client-spark/common/src/main/java/org/apache/spark/shuffle/handle/ChainShuffleHandleInfo.java

+import org.apache.uniffle.common.ShuffleServerInfo;
+import org.apache.uniffle.proto.RssProtos;
+
+public class ChainShuffleHandleInfo extends ShuffleHandleInfoBase {


How about renaming to StageAttemptShuffleHandleInfo

zuston · 2024-05-22T06:43:45Z

client-spark/common/src/main/java/org/apache/spark/shuffle/handle/MutableShuffleHandleInfo.java

-                  .build();
-          replicaServersProto.put(replicaServerEntry.getKey(), item);
-        }
+    Map<Integer, RssProtos.PartitionReplicaServers> partitionToServers = new HashMap<>();


Why removing the synchronized ?

zuston · 2024-05-22T06:47:43Z

client-spark/common/src/main/java/org/apache/uniffle/shuffle/manager/RssShuffleManagerBase.java

+  @Override
+  public boolean reassignOnStageResubmit(
+      int stageId, int stageAttemptNumber, int shuffleId, int numPartitions) {
+    ReentrantReadWriteLock.WriteLock shuffleWriteLock = getShuffleWriteLock(shuffleId);


I think this also could be added into the StageResubmitManager.

zuston · 2024-05-22T06:49:57Z

...park/common/src/main/java/org/apache/uniffle/shuffle/manager/RssShuffleManagerInterface.java


-  MutableShuffleHandleInfo reassignOnBlockSendFailure(
+  ChainShuffleHandleInfo reassignOnBlockSendFailure(


This should be still as MutableShuffleHandleInfo

zuston · 2024-05-22T06:51:24Z

proto/src/main/proto/Rss.proto

@@ -184,6 +184,7 @@ message ShuffleRegisterRequest {
  string user = 5;
  DataDistribution shuffleDataDistribution = 6;
  int32 maxConcurrencyPerPartitionToWrite = 7;
+  int32 stageAttemptNumber = 8;


This should be solved.

…at all previous data is cleared for stage retry

EnricoMi · 2024-06-03T09:17:37Z

...park/common/src/main/java/org/apache/spark/shuffle/handle/StageAttemptShuffleHandleInfo.java

+  private LinkedList<ShuffleHandleInfo> historyHandles;
+
+  public StageAttemptShuffleHandleInfo(ShuffleHandleInfo shuffleServerInfo) {
+    super(0, null);


Why does this have to be a ShuffleHandleInfo when this is 0 / null?

Why does this have to be a ShuffleHandleInfo when this is 0 / null?

This has been modified, the whole PR migrated to #1762.

zuston force-pushed the stageRetry2 branch from 17944ab to 1c0710a Compare March 15, 2024 03:15

leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this pull request Mar 19, 2024

[apache#1584] Add metrics about block size distribution

27eeb41

zuston force-pushed the stageRetry2 branch 2 times, most recently from b192095 to 4d3a892 Compare March 22, 2024 08:25

zuston changed the title ~~[#1579] fix(spark): clear out previous stage attempt data synchronously~~ [#1579] fix(spark): Adjust reassgin time to avoid failure to clean up previous stage data Mar 22, 2024

EnricoMi reviewed Mar 22, 2024

View reviewed changes

server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java Show resolved Hide resolved

EnricoMi reviewed Mar 25, 2024

View reviewed changes

zuston changed the title ~~[#1579] fix(spark): Adjust reassgin time to avoid failure to clean up previous stage data~~ [#1579] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry Mar 26, 2024

zuston changed the title ~~[#1579] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry~~ [#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry Mar 26, 2024

zuston requested review from EnricoMi and jerqi March 26, 2024 03:49

zuston mentioned this pull request Mar 26, 2024

[#1579][part-2] feat(spark): allow to register shuffle in parallel #1604

Closed

zuston force-pushed the stageRetry2 branch from 536a32f to 1714ab3 Compare March 26, 2024 09:43

yl09099 force-pushed the stageRetry2 branch 16 times, most recently from d922716 to ae02409 Compare May 21, 2024 06:07

zuston commented May 22, 2024

View reviewed changes

yl09099 force-pushed the stageRetry2 branch 6 times, most recently from 5c9d9e3 to 3dd6b34 Compare May 24, 2024 11:21

yl09099 force-pushed the stageRetry2 branch 2 times, most recently from a8b70cc to 10bc42c Compare June 3, 2024 03:15

[apache#1579][part-1] fix(spark): Adjust reassigned time to ensure th…

d1da077

…at all previous data is cleared for stage retry

yl09099 force-pushed the stageRetry2 branch from 10bc42c to d1da077 Compare June 3, 2024 03:17

[apache#1579][part-1] fix(spark): Split Stage retry and Block retry.

61e2a9b

EnricoMi reviewed Jun 3, 2024

View reviewed changes

yl09099 mentioned this pull request Jun 3, 2024

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry #1762

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry #1584

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry #1584

zuston commented Mar 15, 2024 •

edited

Loading

zuston commented Mar 15, 2024

github-actions bot commented Mar 15, 2024 •

edited

Loading

zuston commented Mar 15, 2024 •

edited

Loading

codecov-commenter commented Mar 22, 2024 •

edited

Loading

jerqi commented Mar 22, 2024

zuston commented Mar 22, 2024

jerqi commented Mar 23, 2024 •

edited

Loading

jerqi commented Mar 25, 2024

zuston commented Mar 25, 2024

jerqi commented Mar 25, 2024

jerqi commented Mar 25, 2024

EnricoMi Mar 25, 2024

zuston Mar 26, 2024

EnricoMi Mar 26, 2024

zuston commented Mar 26, 2024

jerqi commented Mar 26, 2024

zuston commented Mar 26, 2024

EnricoMi commented Mar 26, 2024

EnricoMi commented Mar 26, 2024 •

edited

Loading

zuston commented Mar 26, 2024

zuston May 22, 2024

zuston May 22, 2024

zuston May 22, 2024

zuston May 22, 2024

zuston May 22, 2024

EnricoMi Jun 3, 2024

yl09099 Jun 3, 2024


		MutableShuffleHandleInfo reassignOnBlockSendFailure(
		ChainShuffleHandleInfo reassignOnBlockSendFailure(

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry #1584

Are you sure you want to change the base?

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry #1584

Conversation

zuston commented Mar 15, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zuston commented Mar 15, 2024

github-actions bot commented Mar 15, 2024 • edited Loading

Test Results

zuston commented Mar 15, 2024 • edited Loading

codecov-commenter commented Mar 22, 2024 • edited Loading

Codecov Report

jerqi commented Mar 22, 2024

zuston commented Mar 22, 2024

jerqi commented Mar 23, 2024 • edited Loading

jerqi commented Mar 25, 2024

zuston commented Mar 25, 2024

jerqi commented Mar 25, 2024

jerqi commented Mar 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuston commented Mar 26, 2024

jerqi commented Mar 26, 2024

zuston commented Mar 26, 2024

EnricoMi commented Mar 26, 2024

EnricoMi commented Mar 26, 2024 • edited Loading

zuston commented Mar 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuston commented Mar 15, 2024 •

edited

Loading

github-actions bot commented Mar 15, 2024 •

edited

Loading

zuston commented Mar 15, 2024 •

edited

Loading

codecov-commenter commented Mar 22, 2024 •

edited

Loading

jerqi commented Mar 23, 2024 •

edited

Loading

EnricoMi commented Mar 26, 2024 •

edited

Loading