Send commit concurrently in client side #59

zuston · 2022-07-16T10:44:26Z

What changes were proposed in this pull request?

Sending commit concurrently in client side

Why are the changes needed?

I found when using the LOCALFILE storageType, waiting the commit will cost too much time. To speed up, it can be sent commit concurrently by using thread pool.

Performance Test Case
Using 1000 executors of Spark, single executor 1g/1core to run TeraSort 1TB.

When using LOCALFILE storageType mode, it cost 7.3 min.
And then after applying this PR, it cost 6.1 min

Does this PR introduce any user-facing change?

Introducing the conf of rss.client.data.commit.pool.size, the default value is assigned shuffle server size.

How was this patch tested?

No need

codecov-commenter · 2022-07-16T10:58:06Z

Codecov Report

Merging #59 (e392f1e) into master (e48f74e) will decrease coverage by 0.04%.
The diff coverage is 8.57%.

@@             Coverage Diff              @@
##             master      #59      +/-   ##
============================================
- Coverage     55.21%   55.16%   -0.05%     
+ Complexity     1111     1110       -1     
============================================
  Files           148      148              
  Lines          7953     7962       +9     
  Branches        760      760              
============================================
+ Hits           4391     4392       +1     
- Misses         3321     3328       +7     
- Partials        241      242       +1

Impacted Files	Coverage Δ
.../java/org/apache/hadoop/mapreduce/RssMRConfig.java	`87.50% <ø> (ø)`
...n/java/org/apache/hadoop/mapreduce/RssMRUtils.java	`31.70% <0.00%> (-0.40%)`	⬇️
.../java/org/apache/spark/shuffle/RssSparkConfig.java	`88.88% <ø> (ø)`
...e/uniffle/client/factory/ShuffleClientFactory.java	`0.00% <ø> (ø)`
...rg/apache/uniffle/client/util/RssClientConfig.java	`0.00% <ø> (ø)`
...he/uniffle/client/impl/ShuffleWriteClientImpl.java	`25.95% <8.82%> (-0.04%)`	⬇️
.../apache/uniffle/coordinator/ClientConfManager.java	`91.54% <0.00%> (-1.41%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e48f74e...e392f1e. Read the comment docs.

jerqi · 2022-07-16T12:07:40Z

Do you have performance tests? I guess this pr can't improve the performance. Because the performance bottleneck of commit operation is on the shuffle server in my opinion.

zuston · 2022-07-16T12:36:50Z

Yes. I tested
I use 1000 executors, single executor 1g/1core to run terasort 1TB.

When using localfile mode, it cost 7.3 min.
And when i apply this PR, it cost 6.1 min

@jerqi

zuston · 2022-07-16T12:42:05Z

Do you have performance tests? I guess this pr can't improve the performance. Because the performance bottleneck of commit operation is on the shuffle server in my opinion.

As I know the spilling to disk event need to be triggered by client side. So if the previous trigger is blocked, the next one will
not be triggered.

jerqi · 2022-07-16T12:59:39Z

We don't recommend to use the storageType LOCALFILE, because it has poor performance. But the improvement is ok for me.

jerqi · 2022-07-16T13:15:26Z

client/src/main/java/org/apache/uniffle/client/impl/ShuffleWriteClientImpl.java

@@ -247,43 +249,57 @@ public SendShuffleDataResult sendShuffleData(String appId, List<ShuffleBlockInfo
    return new SendShuffleDataResult(successBlockIds, failedBlockIds);
  }

+  /**
+   * This method will wait until all shuffle data have been spilled


spilled -> flushed.

jerqi · 2022-07-16T13:19:35Z

Yes. I tested I use 1000 executors, single executor 1g/1core to run terasort 1TB.

When using localfile mode, it cost 7.3 min. And when i apply this PR, it cost 6.1 min

@jerqi

Please put performance test results into Why are the changes need?

jerqi · 2022-07-16T13:21:00Z

client/src/main/java/org/apache/uniffle/client/util/RssClientConfig.java

@@ -17,6 +17,8 @@

 package org.apache.uniffle.client.util;

+import org.apache.hadoop.io.OutputBuffer;


Why do we need this?

jerqi · 2022-07-16T13:24:56Z

client/src/main/java/org/apache/uniffle/client/impl/ShuffleWriteClientImpl.java

  private final ForkJoinPool dataTransferPool;

  public ShuffleWriteClientImpl(String clientType, int retryMax, long retryIntervalMax, int heartBeatThreadNum,
                                int replica, int replicaWrite, int replicaRead, boolean replicaSkipEnabled,
-                                int dataTranferPoolSize) {
+                                int dataTranferPoolSize, int commitSenderPoolSize) {


We prefer the code style as below

public ShuffleWriteClientImpl( String clientType, int retryMax, long retryIntervalMax, int heartBeatThreadNum, int replica, int replicaWrite, int replicaRead, boolean replicaSkipEnabled, int dataTranferPoolSize, int commitSenderPoolSize) {

jerqi · 2022-07-16T13:27:36Z

client-mr/src/main/java/org/apache/hadoop/mapreduce/RssMRConfig.java

+  public static final String RSS_COMMIT_SENDER_POOL_SIZE =
+      MR_RSS_CONFIG_PREFIX + RssClientConfig.RSS_COMMIT_SENDER_POOL_SIZE;
+  public static final int RSS_COMMIT_SENDER_POOL_SIZE_DEFAULT_VALUE =
+      RssClientConfig.RSS_COMMIT_SENDER_POOL_SIZE_DEFAULT_VALUE;


The name's style should be consistent with data_transfer_pool_size. How about data_commit_pool_size?

jerqi · 2022-07-16T13:29:30Z

Could you update the document about this feature?

jerqi · 2022-07-16T13:43:47Z

4b5389f
In this pr, we use method stream to replace method parallelStream. It may be a bad choice. Method registerShuffleServer use method stream, too. Is it possible to improve performance to use method parallelStream in method registerShuffleServer? Will it create too many forkjoinPool?

zuston · 2022-07-16T13:48:58Z

If we close the forkjoin pool in the scope of method. I think it’s ok.

jerqi · 2022-07-16T14:08:17Z

If we close the forkjoin pool in the scope of method. I think it’s ok.

Ok

zuston · 2022-07-16T14:08:55Z

We don't recommend to use the storageType LOCALFILE, because it has poor performance. But the improvement is ok for me.

The performance of LOCALFILE looks better than ess. Due to no need to wait data flushed to disk, the MEMORY_LOCALFILE will better.

client-mr/src/main/java/org/apache/hadoop/mapreduce/RssMRConfig.java

zuston · 2022-07-16T14:44:50Z

Besides I think i can submit new PR to let registerShuffleServer do the same optimization

client-mr/src/main/java/org/apache/hadoop/mapreduce/RssMRUtils.java

client-spark/spark2/src/main/java/org/apache/spark/shuffle/RssShuffleManager.java

client/src/main/java/org/apache/uniffle/client/factory/ShuffleClientFactory.java

client/src/main/java/org/apache/uniffle/client/impl/ShuffleWriteClientImpl.java

jerqi · 2022-07-16T14:50:26Z

Besides I think i can submit new PR to let registerShuffleServer do the same optimization

We'd better have performance tests. RegisterShuffleServer may not cost too much time. The optimization have less effect.

jerqi · 2022-07-16T15:43:19Z

client/src/main/java/org/apache/uniffle/client/impl/ShuffleWriteClientImpl.java

-    });
+        });
+      }).join();
+    } catch (Exception e) {


Should we use

finally { forkJoinPool.shutdownNow(); }

My fault…..

jerqi · 2022-07-16T16:20:52Z

Could you update the document because this pr introduce the user-facing change?

zuston · 2022-07-17T02:23:48Z

Done @jerqi

jerqi

LGTM

Sending commit concurrently in client side

a1105a3

jerqi reviewed Jul 16, 2022

View reviewed changes

optimize

5fa3d98

jerqi reviewed Jul 16, 2022

View reviewed changes

client-mr/src/main/java/org/apache/hadoop/mapreduce/RssMRConfig.java Outdated Show resolved Hide resolved

fix

57cb2e9

jerqi reviewed Jul 16, 2022

View reviewed changes

optimize 1

9fbab16

jerqi reviewed Jul 16, 2022

View reviewed changes

zuston added 2 commits July 16, 2022 23:50

fix

16b8065

fix again

a6cf0a0

jerqi changed the title ~~Sending commit concurrently in client side~~ Send commit concurrently in client side Jul 16, 2022

Update doc

e392f1e

jerqi approved these changes Jul 17, 2022

View reviewed changes

jerqi merged commit c3616c2 into apache:master Jul 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Send commit concurrently in client side #59

Send commit concurrently in client side #59

zuston commented Jul 16, 2022 •

edited

codecov-commenter commented Jul 16, 2022 •

edited

jerqi commented Jul 16, 2022

zuston commented Jul 16, 2022

zuston commented Jul 16, 2022

jerqi commented Jul 16, 2022 •

edited

jerqi Jul 16, 2022

jerqi commented Jul 16, 2022

jerqi Jul 16, 2022

jerqi Jul 16, 2022 •

edited

jerqi Jul 16, 2022 •

edited

jerqi commented Jul 16, 2022

jerqi commented Jul 16, 2022 •

edited

zuston commented Jul 16, 2022

jerqi commented Jul 16, 2022

zuston commented Jul 16, 2022

zuston commented Jul 16, 2022

jerqi commented Jul 16, 2022 •

edited

jerqi Jul 16, 2022

zuston Jul 16, 2022

jerqi commented Jul 16, 2022

zuston commented Jul 17, 2022

jerqi left a comment

		@@ -17,6 +17,8 @@

		package org.apache.uniffle.client.util;

		import org.apache.hadoop.io.OutputBuffer;

Send commit concurrently in client side #59

Send commit concurrently in client side #59

Conversation

zuston commented Jul 16, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

codecov-commenter commented Jul 16, 2022 • edited

Codecov Report

jerqi commented Jul 16, 2022

zuston commented Jul 16, 2022

zuston commented Jul 16, 2022

jerqi commented Jul 16, 2022 • edited

jerqi Jul 16, 2022

Choose a reason for hiding this comment

jerqi commented Jul 16, 2022

jerqi Jul 16, 2022

Choose a reason for hiding this comment

jerqi Jul 16, 2022 • edited

Choose a reason for hiding this comment

jerqi Jul 16, 2022 • edited

Choose a reason for hiding this comment

jerqi commented Jul 16, 2022

jerqi commented Jul 16, 2022 • edited

zuston commented Jul 16, 2022

jerqi commented Jul 16, 2022

zuston commented Jul 16, 2022

zuston commented Jul 16, 2022

jerqi commented Jul 16, 2022 • edited

jerqi Jul 16, 2022

Choose a reason for hiding this comment

zuston Jul 16, 2022

Choose a reason for hiding this comment

jerqi commented Jul 16, 2022

zuston commented Jul 17, 2022

jerqi left a comment

Choose a reason for hiding this comment

zuston commented Jul 16, 2022 •

edited

codecov-commenter commented Jul 16, 2022 •

edited

jerqi commented Jul 16, 2022 •

edited

jerqi Jul 16, 2022 •

edited

jerqi Jul 16, 2022 •

edited

jerqi commented Jul 16, 2022 •

edited

jerqi commented Jul 16, 2022 •

edited