[Improvement] Skip blocks when read from memory #294

xianjingfeng · 2022-11-02T06:16:10Z

What changes were proposed in this pull request?

Skip blocks which not in expected blockId range when read from memory.

Why are the changes needed?

1.If we use AQE, every task will read data from all partitions.
2.If the data of the first shuffle server is incomplete, we need to read from another server if #276 is merged.
Both of the above situations will lead to read redundant data from shuffle server.

Does this PR introduce any user-facing change?

Set rss.client.read.block.skip.strategy to BLOCKID_RANGE.

How was this patch tested?

Already added

…hen read from memory

codecov-commenter · 2022-11-02T07:24:14Z

Codecov Report

Merging #294 (72195de) into master (884921b) will decrease coverage by 0.52%.
The diff coverage is 50.94%.

@@             Coverage Diff              @@
##             master     #294      +/-   ##
============================================
- Coverage     59.24%   58.71%   -0.53%     
- Complexity     1456     1614     +158     
============================================
  Files           180      194      +14     
  Lines          9631    11024    +1393     
  Branches        835      971     +136     
============================================
+ Hits           5706     6473     +767     
- Misses         3577     4167     +590     
- Partials        348      384      +36

Impacted Files	Coverage Δ
.../java/org/apache/hadoop/mapreduce/RssMRConfig.java	`23.07% <ø> (ø)`
...pache/hadoop/mapreduce/task/reduce/RssShuffle.java	`0.00% <0.00%> (ø)`
...e/uniffle/client/factory/ShuffleClientFactory.java	`0.00% <0.00%> (ø)`
...client/request/CreateShuffleReadClientRequest.java	`0.00% <0.00%> (ø)`
...rg/apache/uniffle/client/util/RssClientConfig.java	`0.00% <ø> (ø)`
...a/org/apache/uniffle/common/BlockSkipStrategy.java	`0.00% <0.00%> (ø)`
...pache/uniffle/server/ShuffleServerGrpcService.java	`0.81% <0.00%> (-0.01%)`	⬇️
.../org/apache/uniffle/server/ShuffleTaskManager.java	`77.22% <0.00%> (ø)`
...uniffle/storage/factory/ShuffleHandlerFactory.java	`0.00% <0.00%> (ø)`
.../storage/handler/impl/MemoryClientReadHandler.java	`0.00% <0.00%> (ø)`
... and 87 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

zuston · 2022-11-03T12:02:04Z

Thanks for proposing this PR, overall it will benefit more for some big memory shuffle-server.

After a brief look, I have a question that whether the memory size of processBlockId and expectBlockId are very large, maybe it will have some extra overhead especially for frequently getInMemoryData. Can we send one time for a same MemoryClientReadHandler

xianjingfeng · 2022-11-03T12:13:16Z

Thanks for proposing this PR, overall it will benefit more for some big memory shuffle-server.

After a brief look, I have a question that whether the memory size of processBlockId and expectBlockId are very large, maybe it will have some extra overhead especially for frequently getInMemoryData. Can we send one time for a same MemoryClientReadHandler

Good idea. I will try.

server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBuffer.java

frankliee · 2022-11-04T06:17:29Z

I have two suggestions:

Use bloomfilter or bitmap instead of complete processBlockIds and expectBlockIds.
Because these blockIds can be very large, while the size of needed data can be small.
Reserve the unskipped interface for the client with the lower version.

xianjingfeng · 2022-11-04T06:53:35Z

I have two suggestions:

Use bloomfilter or bitmap instead of complete processBlockIds and expectBlockIds.
Because these blockIds can be very large, while the size of needed data can be small.

Bloomfilter is not suitable here, because it can only make sure whether does not exist. And you can see RssUtils.serializeBitMap

frankliee · 2022-11-04T09:02:25Z

I have two suggestions:

Use bloomfilter or bitmap instead of complete processBlockIds and expectBlockIds.
Because these blockIds can be very large, while the size of needed data can be small.

Bloomfilter is not suitable here, because it can only make sure whether does not exist. And you can see RssUtils.serializeBitMap

Maybe we do not need precision skipping ?
We could use the set of (expectBlockIds - processBlockIds) to build a bloomfilter.
The blocks that does not fit bloomfilter can be skipped.

frankliee · 2022-11-04T09:18:59Z

Besides, it is better to add [ISSUE-ID] in title.

xianjingfeng · 2022-11-04T09:23:55Z

Maybe we do not need precision skipping ?
We could use the set of (expectBlockIds - processBlockIds) to build a bloomfilter.
The blocks that does not fit bloomfilter can be skipped.

I think precision skipping is better. And i will send one time for a same MemoryClientReadHandler

xianjingfeng · 2022-11-04T09:27:26Z

Besides, it is better to add [ISSUE-ID] in title.

I have not create issue for this pr. it is a part of #129, and issue #124 is for #129. Use [ISSUE-124] or create another?

jerqi · 2022-11-04T09:29:18Z

Maybe we can pass the min and max blockId to replace bitmap?

frankliee · 2022-11-04T09:36:57Z

Maybe we do not need precision skipping ?
We could use the set of (expectBlockIds - processBlockIds) to build a bloomfilter.
The blocks that does not fit bloomfilter can be skipped.

I think precision skipping is better. And i will send one time for a same MemoryClientReadHandler

The client side already have precision skipping. The bitmap of all blockIds can still be very large for data skew.
Coarse-grained skipping has been widely used, such as parquet, spark runtime filter and clickhouse.
Besides bloomfilter, the min-max of blockIds can also be a potential option.

xianjingfeng · 2022-11-04T10:04:30Z

1.I think min and max blockId is ok, but processBlockIds is discontinuous and maybe we need use array.
2.If we send one time for a same MemoryClientReadHandler, we need store some info for it. It may cost a lot of memory if we have many tasks. Is this still needed?
@jerqi @frankliee @zuston

jerqi · 2022-11-04T10:18:50Z

1.I think min and max blockId is ok, but processBlockIds is discontinuous and maybe we need use array. 2.If we send one time for a same MemoryClientReadHandler, we need store some info for it. It may cost a lot of memory if we have many tasks. Is this still needed? @jerqi @frankliee @zuston

We seems that we don't need the processedBlock. We can reduce the expect blockIds range according to processed blocks.

xianjingfeng · 2022-11-04T10:19:48Z

We seems that we don't need the processedBlock. We can reduce the expect blockIds range according to processed blocks.

Get.

xianjingfeng · 2022-11-04T10:23:57Z

We seems that we don't need the processedBlock. We can reduce the expect blockIds range according to processed blocks.

But the final expect blockIds is also discontinuous. I'm going to use an arrry to store it, like [start1, end1, start2, end2], and if endN-startN is too small, i will remove it for reduce its size.

jerqi · 2022-11-04T11:13:41Z

We seems that we don't need the processedBlock. We can reduce the expect blockIds range according to processed blocks.

But the final expect blockIds is also discontinuous. I'm going to use an arrry to store it, like [start1, end1, start2, end2], and if endN-startN is too small, i will remove it for reduce its size.

We can limit the array size and try our best to filter more data which we have processed.

jerqi · 2022-11-04T11:16:22Z

Maybe we need some POC to verify the effect of every method.

jerqi · 2022-11-11T08:27:21Z

This pr can also optimize the AQE performance. @leixm Maybe you have interest.

jerqi · 2022-11-18T15:29:45Z

Maybe we can support multiple filters. Users can choose the filter which they like. MinMax may be good for AQE situation. Bitmap may be good for multiple replicas. We should some extra tests.

zuston · 2022-11-24T08:55:54Z

I have a question that the MinMax range is task id min max? @jerqi Not blockId?

zuston · 2022-11-24T09:23:07Z

I have a question that the MinMax range is task id min max? @jerqi Not blockId?

And why not directly use the taskIdBitmap to filter most data ? Especially for AQE

Do you mind I pick up this ticket to improve the AQE skew performance? If you hope this also could support multiple replicas, you could go on. @xianjingfeng

jerqi · 2022-11-24T09:32:39Z

I have a question that the MinMax range is task id min max? @jerqi Not blockId?

And why not directly use the taskIdBitmap to filter most data ? Especially for AQE

Do you mind I pick up this ticket to improve the AQE skew performance? If you hope this also could support multiple replicas, you could go on. @xianjingfeng

Bitmap is ok. We have concern about the size of bitmap. It need some tests.

zuston · 2022-11-24T10:30:16Z

I have a question that the MinMax range is task id min max? @jerqi Not blockId?

And why not directly use the taskIdBitmap to filter most data ? Especially for AQE
Do you mind I pick up this ticket to improve the AQE skew performance? If you hope this also could support multiple replicas, you could go on. @xianjingfeng

Bitmap is ok. We have concern about the size of bitmap. It need some tests.

The size of taskIdsBitmap shoud be small. Actually, it only contains the limited task ids.

jerqi · 2022-11-25T03:10:06Z

I have a question that the MinMax range is task id min max? @jerqi Not blockId?

And why not directly use the taskIdBitmap to filter most data ? Especially for AQE
Do you mind I pick up this ticket to improve the AQE skew performance? If you hope this also could support multiple replicas, you could go on. @xianjingfeng

Bitmap is ok. We have concern about the size of bitmap. It need some tests.

The size of taskIdsBitmap shoud be small. Actually, it only contains the limited task ids.

100w tasks will occupy 125k memory. If we use blockBitmap, the blockBitmap may occupy serveral MB.

xianjingfeng · 2022-11-25T03:45:37Z

I think in most cases, we only need to read from one replica, so i think we can give priority to AQE.

100w tasks will occupy 125k memory.

And i think this is acceptable and taskIdBitmap is more precision for multiple replicas.
@jerqi @zuston

zuston · 2022-11-25T03:52:59Z

I think in most cases, we only need to read from one replica, so i think we can give priority to AQE.

100w tasks will occupy 125k memory.

And i think this is acceptable and taskIdBitmap is more precision for multiple replicas. @jerqi @zuston

It's OK for me. I think we could disable this taskIdBitmap filter in no-AQE optimization. And especially for AQE, the taskIds size for one reader should not be large.

What do u think so? @jerqi @xianjingfeng

xianjingfeng · 2022-12-05T13:37:44Z

I'm tangled to choose taskbitmap or minmax as multi replica filter. Could we discuss this problem? If taskbitmap is small enough, Could we choose taskBitmap as multi replicas, too? It will be more easy to combine AQE and multi replicas.

Theoretically, i think min-max will be smaller and more precise for multi replicas in actual use. Because processedBlockIds is basically continuous in most cases.

jerqi · 2022-12-05T13:56:41Z

How to combine aqe and multi replicas? Do we need pass two filters?

client-spark/spark3/src/main/java/org/apache/spark/shuffle/reader/RssShuffleReader.java

xianjingfeng · 2022-12-05T14:17:57Z

How to combine aqe and multi replicas? Do we need pass two filters?

We just need pass one of them. I think taskIdBitmap will support multi replicas in the future. If @zuston doesn't do it, i will do it.

zuston · 2022-12-06T02:55:14Z

How to combine aqe and multi replicas? Do we need pass two filters?

We just need pass one of them. I think taskIdBitmap will support multi replicas in the future. If @zuston doesn't do it, i will do it.

Feel free to do this. I have no plan to support multiple replicas filter.

zuston · 2022-12-06T03:05:50Z

Theoretically, i think min-max will be smaller and more precise for multi replicas in actual use. Because processedBlockIds is basically continuous in most cases.

@xianjingfeng A little question: does the min-max blockIds filter and taskIdBitmap filter are exclusive?

xianjingfeng · 2022-12-06T03:19:08Z

Theoretically, i think min-max will be smaller and more precise for multi replicas in actual use. Because processedBlockIds is basically continuous in most cases.

@xianjingfeng A little question: does the min-max blockIds filter and taskIdBitmap filter are exclusive?

Yes

jerqi · 2022-12-06T08:33:47Z

Could we add some performance tests for this feature with production jobs?

xianjingfeng · 2022-12-06T08:36:04Z

Could we add some performance tests for this feature with production jobs?

I will

xianjingfeng · 2022-12-09T08:46:38Z

Performance Test

Table

Table1: 10g, dtypes: Array[(String, String)] = Array((v1,StringType), (k1,StringType)).
And all columns of k1 have the same value (value = 10)

Table2: 10 records, dtypes: Array[(String, String)] = Array((k2,StringType), (v2,StringType)).
And it has the only one record of k2=10

Env

Spark Resource Profile: 10 executors(1core4g)
Shuffle-server Environment: 6 shuffle servers, 20g for buffer read and 40g for buffer write.
Spark Shuffle Client Config: storage type: MEMORY_LOCALFILE_HDFS with LOCAL_ORDER
SQL: spark.sql("select * from Table1,Table2 where k1 = k2").write.mode("overwrite").parquet("xxxxxx")

Result

BITMAP and MINMAX look similar. I think their gap has little impact on the overall performance. See the following picture.

cc @jerqi @zuston

jerqi · 2022-12-09T11:00:30Z

Performance Test

Table

Table1: 10g, dtypes: Array[(String, String)] = Array((v1,StringType), (k1,StringType)). And all columns of k1 have the same value (value = 10)

Table2: 10 records, dtypes: Array[(String, String)] = Array((k2,StringType), (v2,StringType)). And it has the only one record of k2=10

Env

Spark Resource Profile: 10 executors(1core4g) Shuffle-server Environment: 6 shuffle servers, 20g for buffer read and 40g for buffer write. Spark Shuffle Client Config: storage type: MEMORY_LOCALFILE_HDFS with LOCAL_ORDER SQL: spark.sql("select * from Table1,Table2 where k1 = k2").write.mode("overwrite").parquet("xxxxxx")

Result

BITMAP and MINMAX look similar. I think their gap has little impact on the overall performance. See the following picture.

cc @jerqi @zuston

OK.

common/src/main/java/org/apache/uniffle/common/BlockSkipStrategy.java

jerqi · 2022-12-10T07:59:50Z

Could you modify the description of this pr? Do it only add a range filter strategy for AQE? Will multi replicas support to filter data in this pr?

xianjingfeng · 2022-12-10T10:38:33Z

Could you modify the description of this pr?

Done.

Do it only add a range filter strategy for AQE? Will multi replicas support to filter data in this pr?

Support multi replicas too.

jerqi · 2022-12-10T13:05:10Z

storage/src/main/java/org/apache/uniffle/storage/handler/impl/MemoryClientReadHandler.java

  }

  @Override
  public ShuffleDataResult readShuffleData() {
+    if (BlockSkipStrategy.BLOCKID_RANGE.equals(blockSkipStrategy) && lastBlockId == Constants.INVALID_BLOCK_ID) {


Why do we judge the lastBlockId == Constants.INVALID_BLOCK_ID?

We only need to build the blockId range at the first time the handler read.

OK, got it.

jerqi

LGTM, @zuston Do you have another suggestion?

jerqi · 2022-12-11T14:29:20Z

Merged. Thanks all @frankliee @zuston @xianjingfeng . @xianjingfeng Could you raise a follow-up pr to add some docs about this feature?

xianjingfeng · 2022-12-12T01:51:43Z

Could you raise a follow-up pr to add some docs about this feature?

Yes.

xianjingfeng · 2022-12-12T10:53:24Z

BLOCKID_RANGE is not a good choice now, because blockId is not continuous. My fault. 😂 @jerqi

incubator-uniffle/client/src/main/java/org/apache/uniffle/client/util/ClientUtils.java

Lines 33 to 37 in ddf8384

    
           // BlockId is long and composed of partitionId, executorId and AtomicInteger. 
        
           // AtomicInteger is first 19 bit, max value is 2^19 - 1 
        
           // partitionId is next 24 bit, max value is 2^24 - 1 
        
           // taskAttemptId is rest of 20 bit, max value is 2^20 - 1 
        
           public static Long getBlockId(long partitionId, long taskAttemptId, long atomicInt) {

Should we remove it or modify blockid generation rule?

jerqi · 2022-12-12T11:17:14Z

BLOCKID_RANGE is not a good choice now, because blockId is not continuous. My fault. 😂 @jerqi

incubator-uniffle/client/src/main/java/org/apache/uniffle/client/util/ClientUtils.java

Lines 33 to 37 in ddf8384

// BlockId is long and composed of partitionId, executorId and AtomicInteger.

// AtomicInteger is first 19 bit, max value is 2^19 - 1

// partitionId is next 24 bit, max value is 2^24 - 1

// taskAttemptId is rest of 20 bit, max value is 2^20 - 1

public static Long getBlockId(long partitionId, long taskAttemptId, long atomicInt) {

Should we remove it or modify blockid generation rule?

Let's remove it. We can use taskBitmap as the replica filter. It's hard to modify block generation rule. It will be imcompatible feature.

xianjingfeng · 2022-12-12T13:32:32Z

Let's remove it. We can use taskBitmap as the replica filter. It's hard to modify block generation rule. It will be imcompatible feature.

Revert directly or keep some modification, such as BlockSkipStrategy?

xianjingfeng · 2022-12-12T13:35:17Z

I wonder why we put AtomicInteger in front of blockId? What is the purpose of this design? @jerqi

This reverts commit 55191c4.

This reverts commit 55191c4. ### What changes were proposed in this pull request? Revert #294 ### Why are the changes needed? BlockId is discontinuous, so BLOCKID_RANGE is not a good choice to filter memory data ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No need

jerqi · 2022-12-12T14:17:32Z

I wonder why we put AtomicInteger in front of blockId? What is the purpose of this design? @jerqi

Reduce the size of RoaringBitmap. We should put the worst frequently data to higher bit.

Skip blocks which not in expectBlockIds or in processedBlockIds w…

38cfb35

…hen read from memory

xianjingfeng mentioned this pull request Nov 2, 2022

[Improvement] try read from backup shuffle servers when fetched data is inconsistent #129

Closed

fix ut

403ff4b

frankliee reviewed Nov 4, 2022

View reviewed changes

server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBuffer.java Outdated Show resolved Hide resolved

jerqi reviewed Dec 5, 2022

View reviewed changes

client-spark/spark3/src/main/java/org/apache/spark/shuffle/reader/RssShuffleReader.java Outdated Show resolved Hide resolved

Remove expectedTaskIdsBitmapFilterEnable/Fix bug/more UTs

f54402b

zuston reviewed Dec 9, 2022

View reviewed changes

common/src/main/java/org/apache/uniffle/common/BlockSkipStrategy.java Outdated Show resolved Hide resolved

Rename some variables

72195de

jerqi reviewed Dec 10, 2022

View reviewed changes

jerqi approved these changes Dec 11, 2022

View reviewed changes

zuston approved these changes Dec 11, 2022

View reviewed changes

jerqi merged commit 55191c4 into apache:master Dec 11, 2022

xianjingfeng added a commit to xianjingfeng/incubator-uniffle that referenced this pull request Dec 12, 2022

Revert "[Improvement] Skip blocks when read from memory (apache#294)"

9668101

This reverts commit 55191c4.

xianjingfeng mentioned this pull request Dec 12, 2022

Revert "[Improvement] Skip blocks when read from memory (#294)" #403

Merged

[Improvement] Skip blocks when read from memory #294

[Improvement] Skip blocks when read from memory #294

Conversation

xianjingfeng commented Nov 2, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

codecov-commenter commented Nov 2, 2022 • edited

Codecov Report

zuston commented Nov 3, 2022

xianjingfeng commented Nov 3, 2022

frankliee commented Nov 4, 2022 • edited

xianjingfeng commented Nov 4, 2022

frankliee commented Nov 4, 2022

frankliee commented Nov 4, 2022

xianjingfeng commented Nov 4, 2022

xianjingfeng commented Nov 4, 2022

jerqi commented Nov 4, 2022

frankliee commented Nov 4, 2022 • edited

xianjingfeng commented Nov 4, 2022

jerqi commented Nov 4, 2022

xianjingfeng commented Nov 4, 2022

xianjingfeng commented Nov 4, 2022

jerqi commented Nov 4, 2022

jerqi commented Nov 4, 2022

jerqi commented Nov 11, 2022

jerqi commented Nov 18, 2022

zuston commented Nov 24, 2022

zuston commented Nov 24, 2022 • edited

jerqi commented Nov 24, 2022 • edited

zuston commented Nov 24, 2022

jerqi commented Nov 25, 2022 • edited

xianjingfeng commented Nov 25, 2022

zuston commented Nov 25, 2022

xianjingfeng commented Dec 5, 2022

jerqi commented Dec 5, 2022 • edited

xianjingfeng commented Dec 5, 2022

zuston commented Dec 6, 2022

zuston commented Dec 6, 2022

xianjingfeng commented Dec 6, 2022

jerqi commented Dec 6, 2022 • edited

xianjingfeng commented Dec 6, 2022

xianjingfeng commented Dec 9, 2022 • edited

Performance Test

Table

Env

Result

jerqi commented Dec 9, 2022

Performance Test

Table

Env

Result

jerqi commented Dec 10, 2022

xianjingfeng commented Dec 10, 2022

jerqi Dec 10, 2022

Choose a reason for hiding this comment

xianjingfeng Dec 11, 2022

Choose a reason for hiding this comment

jerqi Dec 11, 2022

Choose a reason for hiding this comment

jerqi left a comment

Choose a reason for hiding this comment

jerqi commented Dec 11, 2022

xianjingfeng commented Dec 12, 2022

xianjingfeng commented Dec 12, 2022

jerqi commented Dec 12, 2022 • edited

xianjingfeng commented Dec 12, 2022

xianjingfeng commented Dec 12, 2022

jerqi commented Dec 12, 2022

xianjingfeng commented Nov 2, 2022 •

edited

codecov-commenter commented Nov 2, 2022 •

edited

frankliee commented Nov 4, 2022 •

edited

frankliee commented Nov 4, 2022 •

edited

zuston commented Nov 24, 2022 •

edited

jerqi commented Nov 24, 2022 •

edited

jerqi commented Nov 25, 2022 •

edited

jerqi commented Dec 5, 2022 •

edited

jerqi commented Dec 6, 2022 •

edited

xianjingfeng commented Dec 9, 2022 •

edited

jerqi commented Dec 12, 2022 •

edited