[Improvement][AQE] Sort MapId before the data are flushed #137

jerqi · 2022-08-06T03:17:09Z

When we use aqe, we need use mapId to filter the data which we don't need, If we sort MapId before the data are flushed. We split the data to segments, if a segment don't have the data which we want to read, we will drop the data. If data is sorted by mapId, we can filter more data and mprove our performance.

zuston · 2022-08-28T12:15:21Z

Do we need to sort data by MapID of one partition before flushing data for all jobs? I think no. This will bring unused cost for those non-AQE optimized stages. Maybe we could sort the partition data by MapId when AQE's specified ShufflePartitionSpec is applied in first time.

jerqi · 2022-08-28T14:07:01Z

You are right.

zuston · 2022-10-09T03:24:32Z

Do u have implemented this in your internal version? If not, I'm interested on this. @jerqi

jerqi · 2022-10-09T03:26:32Z

No. You can go ahead.

zuston · 2022-10-28T06:40:26Z

I propose the design of this issue. https://docs.google.com/document/d/1G0cOFVJbYLf2oX1fiadh7zi2M6DlEcjTQTh4kSkb0LA/edit?usp=sharing

PTAL @jerqi

jerqi · 2022-10-28T07:04:55Z

It's better to sort MapId before the data are flushed.It won't bring too much cost for non-AQE optimized stages.

zuston · 2022-10-28T07:08:09Z

It's better to sort MapId before the data are flushed.It won't bring too much cost for non-AQE optimized stages.

Does data need to sort by mapId?

jerqi · 2022-10-28T07:10:27Z

It's better to sort MapId before the data are flushed.It won't bring too much cost for non-AQE optimized stages.

Does data need to sort by mapId?

Yes, we only need local order. If we have local order, we can filter much data effectively.

zuston · 2022-10-28T07:43:58Z

It's better to sort MapId before the data are flushed.It won't bring too much cost for non-AQE optimized stages.

Does data need to sort by mapId?

Yes, we only need local order. If we have local order, we can filter much data effectively.

Emm... I remember you prefer only sort the index-file instead of data-file, which is mentioned in offline meeting. Do i misunderstand you?

jerqi · 2022-10-28T07:53:41Z

It's better to sort MapId before the data are flushed.It won't bring too much cost for non-AQE optimized stages.

Does data need to sort by mapId?

Yes, we only need local order. If we have local order, we can filter much data effectively.

Emm... I remember you prefer only sort the index-file instead of data-file, which is mentioned in offline meeting. Do i misunderstand you?

Give an example:
We have three buffers to flush, they taskId 1 block, taskId 2 block, taskId 3 block. We should sort them to taskId 1 block, taskId 2 block, taskId 3 block. And then we can flush them to disks.Then we receive taskId 2 block, taskId 6 block, taskId 1 block, we sort them and flush them, so currently the data on the disk should be
taskId 1 block , taskId 2 block, taskId 3 block, taskId 1 block, taskId 2 block, taskId 6 block.
The data only have local order.

zuston · 2022-10-28T08:24:31Z

taskId-1 block , taskId-2 block, taskId-3 block, taskId-1 block, taskId-2 block, taskId-6 block.

If one reader want the data from taskId=1, so it still want to read the data segment from taskId-1 block , taskId-2 block, taskId-3 block, taskId-1 block. The data of taskId-2 block, taskId-3 block is unnecessary for this reader. Right?

jerqi · 2022-10-28T08:35:30Z

taskId-1 block , taskId-2 block, taskId-3 block, taskId-1 block, taskId-2 block, taskId-6 block.

If one reader want the data from taskId=1, so it still want to read the data segment from taskId-1 block , taskId-2 block, taskId-3 block, taskId-1 block. The data of taskId-2 block, taskId-3 block is unnecessary for this reader. Right?

Yes.

zuston · 2022-10-28T08:40:39Z

This looks ineffective and it's the same with the original block filter.

jerqi · 2022-10-28T08:42:44Z

This looks ineffective and it's the same with the original block filter.

Actually considering random io, It will cost the same time when you read 3 records or 2 records.

zuston · 2022-10-28T08:48:05Z

This looks ineffective and it's the same with the original block filter.

Actually considering random io, It will cost the same time when you read 3 records or 2 records.

Yes. According to the problems mentioned by proposal design motivation section, the key point is a lot of data read by multiple times which depends on split number optimized by AQE. From this view, we should sort the data file.

jerqi · 2022-10-28T09:00:08Z

This looks ineffective and it's the same with the original block filter.

Actually considering random io, It will cost the same time when you read 3 records or 2 records.

Yes. According to the problems mentioned by proposal design motivation section, the key point is a lot of data read by multiple times which depends on split number optimized by AQE. From this view, we should sort the data file.

We don't need global order, local order should be enough.

zuston · 2022-11-03T11:22:33Z

#293

…ids (#358) ### What changes were proposed in this pull request? Support getting memory data skip by upstream task ids ### Why are the changes needed? In current codebase, when the shuffle-server memory is large and job is optimized by AQE skew rule, the multiple readers of the same partition will get the shuffle data from the same shuffle-server. To avoid reading unused localfile/HDFS data, the PR of #137 has introduce the LOCAL_ORDER mechanism to filter the most of data. But for the storage of MEMORY, it still suffer from this. So this PR is to avoid reading unused data for one reader, by expectedTaskIds bitmap to filter. And this optimization is only enabled when AQE skew is applied. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 1. UTs ### Benchmark #### Table Table1: 100g, dtypes: Array[(String, String)] = Array((v1,StringType), (k1,IntegerType)). And all columns of k1 have the same value (value = 10) Table2: 10 records, dtypes: Array[(String, String)] = Array((k2,IntegerType), (v2,StringType)). And it has the only one record of k2=10 #### Env Spark Resource Profile: 10 executors(1core2g) Shuffle-server Environment: 10 shuffle servers, 10g for buffer read and write. Spark Shuffle Client Config: storage type: MEMORY_LOCALFILE with LOCAL_ORDER SQL: spark.sql("select * from Table1,Table2 where k1 = k2").write.mode("overwrite").parquet("xxxxxx") #### Result __ESS__: cost `3min` __Uniffle without patch__: cost `11.6min` (2.1 + 9.5) __Uniffle with patch__: cost `3.5min` (2.1 + 1.4) Co-authored-by: xianjingfeng <583872483@qq.com>

zuston changed the title ~~[Improvement][Aqe] Sort MapId before the data are flushed~~ [Improvement][AQE] Sort MapId before the data are flushed Oct 9, 2022

zuston self-assigned this Oct 9, 2022

jerqi linked a pull request Nov 3, 2022 that will close this issue

[ISSUE-137][Improvement][AQE] Sort MapId before the data are flushed #293

Merged

jerqi closed this as completed in #293 Nov 5, 2022

This was referenced Nov 5, 2022

[Subtask] [Improvement][AQE][LocalOrder] Merge continuous ShuffleDataSegment into single one #301

Closed

[Subtask][Improvement][AQE][LocalOder] Introduce the new MergedShuffleDataSegment to reduce number of rpc #302

Open

zuston mentioned this issue Nov 14, 2022

[AQE][LocalOrder] Fix wrong param of expectedTaskIds in LocalOrderSegmentSplit #319

Merged

zuston mentioned this issue Nov 25, 2022

[Improvement][AQE] Support getting memory data skip by upstream task ids #358

Merged

xianjingfeng mentioned this issue May 13, 2024

[FEATURE] support use skip list to store shuffleBuffer in memory #1708

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement][AQE] Sort MapId before the data are flushed #137

[Improvement][AQE] Sort MapId before the data are flushed #137

jerqi commented Aug 6, 2022

zuston commented Aug 28, 2022

jerqi commented Aug 28, 2022

zuston commented Oct 9, 2022 •

edited

jerqi commented Oct 9, 2022

zuston commented Oct 28, 2022

jerqi commented Oct 28, 2022

zuston commented Oct 28, 2022

jerqi commented Oct 28, 2022

zuston commented Oct 28, 2022

jerqi commented Oct 28, 2022

zuston commented Oct 28, 2022

jerqi commented Oct 28, 2022

zuston commented Oct 28, 2022

jerqi commented Oct 28, 2022

zuston commented Oct 28, 2022 •

edited

jerqi commented Oct 28, 2022

zuston commented Nov 3, 2022

[Improvement][AQE] Sort MapId before the data are flushed #137

[Improvement][AQE] Sort MapId before the data are flushed #137

Comments

jerqi commented Aug 6, 2022

zuston commented Aug 28, 2022

jerqi commented Aug 28, 2022

zuston commented Oct 9, 2022 • edited

jerqi commented Oct 9, 2022

zuston commented Oct 28, 2022

jerqi commented Oct 28, 2022

zuston commented Oct 28, 2022

jerqi commented Oct 28, 2022

zuston commented Oct 28, 2022

jerqi commented Oct 28, 2022

zuston commented Oct 28, 2022

jerqi commented Oct 28, 2022

zuston commented Oct 28, 2022

jerqi commented Oct 28, 2022

zuston commented Oct 28, 2022 • edited

jerqi commented Oct 28, 2022

zuston commented Nov 3, 2022

zuston commented Oct 9, 2022 •

edited

zuston commented Oct 28, 2022 •

edited