[#1608][part-3] feat(spark3): support reading partition data from multiple reassignment servers #1615

zuston · 2024-04-02T06:39:23Z

What changes were proposed in this pull request?

Support reading from partition block data reassignment servers.

Why are the changes needed?

Writer has been writing data into reassignment servers, so it's necessary to read from reassignment servers.
And the blockId will be stored in their owned partition servers, so this PR can read blockIds from these servers and
support min-replica requirements at the same time.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

PartitionBlockDataReassignTest integration test.

codecov-commenter · 2024-04-02T06:48:54Z

Codecov Report

Attention: Patch coverage is 66.96429% with 37 lines in your changes are missing coverage. Please review.

Project coverage is 54.97%. Comparing base (1051d26) to head (03f9293).
Report is 3 commits behind head on master.

Files	Patch %	Lines
...rg/apache/spark/shuffle/KryoSerializerWrapper.java	0.00%	15 Missing ⚠️
...he/uniffle/client/impl/ShuffleWriteClientImpl.java	0.00%	7 Missing ⚠️
...fle/shuffle/manager/ShuffleManagerGrpcService.java	0.00%	4 Missing ⚠️
.../response/RssPartitionToShuffleServerResponse.java	0.00%	4 Missing ⚠️
...va/org/apache/spark/shuffle/ShuffleHandleInfo.java	93.75%	2 Missing and 1 partial ⚠️
...lient/PartitionDataReplicaRequirementTracking.java	93.75%	2 Missing ⚠️
...he/uniffle/server/buffer/ShuffleBufferManager.java	0.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #1615      +/-   ##
============================================
+ Coverage     53.98%   54.97%   +0.98%     
+ Complexity     2872     2750     -122     
============================================
  Files           438      407      -31     
  Lines         24927    21215    -3712     
  Branches       2126     2014     -112     
============================================
- Hits          13456    11662    -1794     
+ Misses        10626     8812    -1814     
+ Partials        845      741     -104

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-04-02T07:02:40Z

Test Results

2 384 files +21 2 384 suites +21 4h 33m 18s ⏱️ + 1m 55s
919 tests + 7 918 ✅ + 7 1 💤 ±0 0 ❌ ±0
10 666 runs +81 10 652 ✅ +81 14 💤 ±0 0 ❌ ±0

Results for commit 297947e. ± Comparison against base commit 3ea3aaa.

♻️ This comment has been updated with latest results.

client-spark/spark3/src/main/java/org/apache/spark/shuffle/RssShuffleManager.java

…ock data reassignment servers

zuston · 2024-04-08T09:32:05Z

...ration-test/spark3/src/test/java/org/apache/uniffle/test/PartitionBlockDataReassignTest.java

+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+/** This class is to test the mechanism of partition block data reassignment. */
+public class PartitionBlockDataReassignTest extends SparkSQLTest {


This class is to test the reassign spark test.

zuston · 2024-04-08T09:32:43Z

PTAL @dingshun3016 @xumanbu @jerqi

dingshun3016 · 2024-04-08T10:49:38Z

client-spark/common/src/main/java/org/apache/spark/shuffle/ShuffleHandleInfo.java

    }

-    replacement = faultyServerReplacements.get(faultyServerId);
+    replacement = faultyServerToReplacements.get(faultyServerId);
    for (Integer partitionId : partitionIds) {
      List<ShuffleServerInfo> replicaServers = partitionToServers.get(partitionId);
      for (int i = 0; i < replicaServers.size(); i++) {
        if (replicaServers.get(i).getId().equals(faultyServerId)) {


There a little problem, when reassign server for the second time, faultyServerId is not in partitionToServers.

Currently, the server reassignment will only occur once time.

jerqi · 2024-04-09T06:43:08Z

client-spark/common/src/main/java/org/apache/spark/shuffle/KryoSerializerWrapper.java

+import org.apache.spark.serializer.KryoSerializer;
+import org.apache.spark.serializer.SerializerInstance;
+
+public class KryoSerializerWrapper {


We shouldn't bind to KryoSeriliazerWrapper.

I hope the shuffleHandleInfo could be shared by the driver and executors by seriliazed bytes.

Can other RPCs also be implemented in this manner?

Yes similar implementations could be acted in shuffleManager rpc service, because the driver and executor are always bounded into the same version, no compatiblity problems.

This has been refactored to avoid de/serialization performance drop @jerqi

jerqi · 2024-04-09T06:43:21Z

proto/src/main/proto/Rss.proto

-  RemoteStorageInfo remote_storage_info = 3;
-  string msg = 4;
+  string msg = 2;
+  bytes shuffleHandleInfoSerializableBytes = 3;


Why do we need this?

jerqi · 2024-04-09T06:49:02Z

There are some points which I cared:

Do we support multiple times reassignments? Do we add some tests for this case?
Do we work with multiple replicas well?

zuston · 2024-04-09T07:00:41Z

There are some points which I cared:

Do we support multiple times reassignments? Do we add some tests for this case?

Multiple reassignments servers for one partition will be added in the next PR, which will be used by the following cases;

reassign servers fails to send blocks and then reassign again for one partition
partition is marked as a huge partition, which will be reassign to multiple servers at the first time, and then different tasks using different reassignment partition server by hash.

Do we work with multiple replicas well?

This should be more tests. But in current stage, I will make multiple replicas and reassign exlusive by the extra check, which also will be added in the next PR.

Partition data reassignment is a huge change, so let's do this carefully and slowly

xumanbu · 2024-04-10T03:58:24Z

server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBufferManager.java

@@ -712,4 +712,8 @@ public boolean limitHugePartition(
    }
    return false;
  }
+
+  public void setUsedMemory(long usedMemory) {


This function should only be used for testing purposes ?

…and unify the previous operation

dingshun3016 · 2024-04-12T06:21:05Z

In the same task, when the failed blocks resend successfully and the remaining blocks need to be sent, partitionToServers needs to be replaced with the last one in org.apache.spark.shuffle.writer.WriteBufferManager#createShuffleBlock

zuston · 2024-04-12T12:29:48Z

In the same task, when the failed blocks resend successfully and the remaining blocks need to be sent, partitionToServers needs to be replaced with the last one in org.apache.spark.shuffle.writer.WriteBufferManager#createShuffleBlock

Yes, this is on the plan, but i will not reuse the concept of partitionToServer. Please wait for the following PRs

jerqi

LGTM, thanks all.

zuston · 2024-04-15T02:08:11Z

CI failure has been fixed. PTAL @jerqi

zuston · 2024-04-16T09:49:11Z

Ping @jerqi

EnricoMi · 2024-04-16T14:24:06Z

client-spark/common/src/main/java/org/apache/spark/shuffle/ShuffleHandleInfo.java

+          Set<ShuffleServerInfo> tempSet = new HashSet<>();
+          tempSet.addAll(replacements);
+          tempSet.removeAll(servers);
+          servers.addAll(tempSet);


This will keep the faulty servers in servers, is this expected?

Yes.

The name of faulty servers here is not accurate, (But here I will not change this, maybe later.) because due to tight memory, the NO_BUFFER will be threw. For these cases, we'd better to reassign partition server to write data.

So the reserving these server is to fetch partial data when reading.

Why don't we read from all servers until all blocks are read? Then we would not need to add replacement server on failure here?

Oh, I think I need to say the reassign occurs on the writer side. Sorry I still don’t catch your thought.

zuston · 2024-04-17T02:05:10Z

Thanks @jerqi @dingshun3016 @xumanbu @EnricoMi for your review. Merged.

EnricoMi · 2024-04-17T06:57:12Z

client/src/main/java/org/apache/uniffle/client/PartitionDataReplicaRequirementTracking.java

+    Map<Integer, Integer> succeedReplicas = succeedList.get(partitionId);
+    if (succeedReplicas == null) {
+      succeedReplicas = new HashMap<>();
+    }


This can be simplified:

Map<Integer, Integer> succeedReplicas = succeedList.getOrDefault(partitionId, new HashMap<>());

EnricoMi · 2024-04-17T06:58:16Z

client/src/main/java/org/apache/uniffle/client/PartitionDataReplicaRequirementTracking.java

+      succeedReplicas = new HashMap<>();
+    }
+
+    Map<Integer, List<ShuffleServerInfo>> replicaList = inventory.get(partitionId);


What if partitionId does not exist?

) ### What changes were proposed in this pull request? 1. make the write client always use the latest available assignment for the following writing when the block reassign happens. 2. support multi time retry for partition reassign 3. limit the max reassign server num of one partition 4. refactor the reassign rpc 5. rename the faultyServer -> receivingFailureServer. #### Reassign whole process ![image](https://github.com/apache/incubator-uniffle/assets/8609142/8afa5386-be39-4ccb-9c10-95ffb3154939) #### Always using the latest assignment To acheive always using the latest assignment, I introduce the `TaskAttemptAssignment` to get the latest assignment for current task. The creating process of AddBlockEvent also will apply the latest assignment by `TaskAttemptAssignment` And it will be updated by the `reassignOnBlockSendFailure` rpc. That means the original reassign rpc response will be refactored and replaced by the whole latest `shuffleHandleInfo`. ### Why are the changes needed? This PR is the subtask for #1608. Leverging the #1615 / #1610 / #1609, we have implemented the reassign servers mechansim when write client encounters the server failure or unhealthy. But this is not good enough that will not share the faulty server state to the unstarted tasks and latter `AddBlockEvent` . ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Unit and integration tests. Integration tests as follows: 1. `PartitionBlockDataReassignBasicTest` to validate the reassign mechanism valid 2. `PartitionBlockDataReassignMultiTimesTest` is to test the partition reassign mechanism of multiple retries. --------- Co-authored-by: Enrico Minack <github@enrico.minack.dev>

zuston changed the title ~~[#1608][part-3] feat(spark3): support reading from partition block data reassignment servers~~ [#1608][part-3] feat(spark3): support reading from partition multiple reassignment servers Apr 3, 2024

zuston changed the title ~~[#1608][part-3] feat(spark3): support reading from partition multiple reassignment servers~~ [#1608][part-3] feat(spark3): support reading from multiple reassignment servers Apr 3, 2024

This was referenced Apr 3, 2024

Support client partition data reassign #1608

Open

[#1608][part-4] feat(server)(spark3): activate partition reassign when server is inactive #1617

Open

xumanbu reviewed Apr 8, 2024

View reviewed changes

client-spark/spark3/src/main/java/org/apache/spark/shuffle/RssShuffleManager.java Outdated Show resolved Hide resolved

zuston force-pushed the issue-1608-3 branch from 4f26a4e to 8a306c6 Compare April 8, 2024 07:11

zuston changed the title ~~[#1608][part-3] feat(spark3): support reading from multiple reassignment servers~~ [#1608][part-3] feat(spark3): support reading partition data from multiple reassignment servers Apr 8, 2024

zuston added 10 commits April 8, 2024 17:28

[apache#1608][part-3] feat(spark3): support reading from partition bl…

f6ca2a9

…ock data reassignment servers

get blockIds from the multiple servers

e2a8c8c

rename some var in ShuffleHandleInfo

b1c7146

avoid checkstyle error

1a53c89

enable

8736352

fix testCreatePartitionReplicaTracking

0e57564

fix missing tracking

2f57766

checkstyle fix

c201d03

fix tests

ff3177a

serialize shuffleHandleInfo

d234f19

zuston force-pushed the issue-1608-3 branch from 761dedf to d234f19 Compare April 8, 2024 09:29

zuston commented Apr 8, 2024

View reviewed changes

dingshun3016 reviewed Apr 8, 2024

View reviewed changes

zuston added 2 commits April 9, 2024 14:20

fix blockFailureResendTest

aa34838

fix spark2

3844a97

jerqi reviewed Apr 9, 2024

View reviewed changes

throw exception

71a5968

zuston added 2 commits April 9, 2024 17:47

fix

03f9293

printout replica requirement tracking

648b312

zuston requested review from jerqi, xumanbu and dingshun3016 April 10, 2024 01:57

xumanbu approved these changes Apr 10, 2024

View reviewed changes

zuston added 4 commits April 10, 2024 17:32

Se/derialize into one class

54bdc9e

fix checkstyle

3ccc65c

refactor shuffleHandle to support multiple servers for one partition …

53d3390

…and unify the previous operation

use grpc proto to seralize the shuffleHandleInfo

c7459a3

jerqi previously approved these changes Apr 14, 2024

View reviewed changes

fix fromProto of shuffleServerInfo

4dafa81

zuston dismissed jerqi’s stale review via 4dafa81 April 15, 2024 01:31

comment enhance

297947e

zuston requested a review from jerqi April 15, 2024 02:00

jerqi approved these changes Apr 16, 2024

View reviewed changes

EnricoMi reviewed Apr 16, 2024

View reviewed changes

zuston merged commit 60fce8e into apache:master Apr 17, 2024
41 checks passed

zuston mentioned this pull request Apr 17, 2024

[#1608][part-5] feat(spark3): always use the available assignment #1652

Merged

EnricoMi reviewed Apr 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#1608][part-3] feat(spark3): support reading partition data from multiple reassignment servers #1615

[#1608][part-3] feat(spark3): support reading partition data from multiple reassignment servers #1615

zuston commented Apr 2, 2024 •

edited

codecov-commenter commented Apr 2, 2024 •

edited

github-actions bot commented Apr 2, 2024 •

edited

zuston Apr 8, 2024

zuston commented Apr 8, 2024

dingshun3016 Apr 8, 2024

zuston Apr 9, 2024

jerqi Apr 9, 2024

zuston Apr 9, 2024

xumanbu Apr 9, 2024

zuston Apr 10, 2024

zuston Apr 11, 2024 •

edited

jerqi Apr 9, 2024

jerqi commented Apr 9, 2024

zuston commented Apr 9, 2024 •

edited

xumanbu Apr 10, 2024

zuston Apr 10, 2024

dingshun3016 commented Apr 12, 2024

zuston commented Apr 12, 2024

jerqi left a comment

zuston commented Apr 15, 2024 •

edited

zuston commented Apr 16, 2024

EnricoMi Apr 16, 2024

zuston Apr 17, 2024

EnricoMi Apr 17, 2024

zuston Apr 17, 2024

zuston commented Apr 17, 2024

EnricoMi Apr 17, 2024

EnricoMi Apr 17, 2024

[#1608][part-3] feat(spark3): support reading partition data from multiple reassignment servers #1615

[#1608][part-3] feat(spark3): support reading partition data from multiple reassignment servers #1615

Conversation

zuston commented Apr 2, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

codecov-commenter commented Apr 2, 2024 • edited

Codecov Report

github-actions bot commented Apr 2, 2024 • edited

Test Results

Choose a reason for hiding this comment

zuston commented Apr 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuston Apr 11, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerqi commented Apr 9, 2024

zuston commented Apr 9, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dingshun3016 commented Apr 12, 2024

zuston commented Apr 12, 2024

jerqi left a comment

Choose a reason for hiding this comment

zuston commented Apr 15, 2024 • edited

zuston commented Apr 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuston commented Apr 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuston commented Apr 2, 2024 •

edited

codecov-commenter commented Apr 2, 2024 •

edited

github-actions bot commented Apr 2, 2024 •

edited

zuston Apr 11, 2024 •

edited

zuston commented Apr 9, 2024 •

edited

zuston commented Apr 15, 2024 •

edited