-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Improvement][AQE] Sort MapId before the data are flushed #137
Comments
Do we need to sort data by MapID of one partition before flushing data for all jobs? I think no. This will bring unused cost for those non-AQE optimized stages. Maybe we could sort the partition data by MapId when AQE's specified |
You are right. |
Do u have implemented this in your internal version? If not, I'm interested on this. @jerqi |
No. You can go ahead. |
I propose the design of this issue. https://docs.google.com/document/d/1G0cOFVJbYLf2oX1fiadh7zi2M6DlEcjTQTh4kSkb0LA/edit?usp=sharing PTAL @jerqi |
It's better to sort MapId before the data are flushed.It won't bring too much cost for non-AQE optimized stages. |
Does data need to sort by mapId? |
Yes, we only need local order. If we have local order, we can filter much data effectively. |
Emm... I remember you prefer only sort the index-file instead of data-file, which is mentioned in offline meeting. Do i misunderstand you? |
Give an example: |
If one reader want the data from taskId=1, so it still want to read the data segment from |
Yes. |
This looks ineffective and it's the same with the original block filter. |
Actually considering random io, It will cost the same time when you read 3 records or 2 records. |
Yes. According to the problems mentioned by proposal design motivation section, the key point is a lot of data read by multiple times which depends on split number optimized by AQE. From this view, we should sort the data file. |
We don't need global order, local order should be enough. |
…ids (#358) ### What changes were proposed in this pull request? Support getting memory data skip by upstream task ids ### Why are the changes needed? In current codebase, when the shuffle-server memory is large and job is optimized by AQE skew rule, the multiple readers of the same partition will get the shuffle data from the same shuffle-server. To avoid reading unused localfile/HDFS data, the PR of #137 has introduce the LOCAL_ORDER mechanism to filter the most of data. But for the storage of MEMORY, it still suffer from this. So this PR is to avoid reading unused data for one reader, by expectedTaskIds bitmap to filter. And this optimization is only enabled when AQE skew is applied. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 1. UTs ### Benchmark #### Table Table1: 100g, dtypes: Array[(String, String)] = Array((v1,StringType), (k1,IntegerType)). And all columns of k1 have the same value (value = 10) Table2: 10 records, dtypes: Array[(String, String)] = Array((k2,IntegerType), (v2,StringType)). And it has the only one record of k2=10 #### Env Spark Resource Profile: 10 executors(1core2g) Shuffle-server Environment: 10 shuffle servers, 10g for buffer read and write. Spark Shuffle Client Config: storage type: MEMORY_LOCALFILE with LOCAL_ORDER SQL: spark.sql("select * from Table1,Table2 where k1 = k2").write.mode("overwrite").parquet("xxxxxx") #### Result __ESS__: cost `3min` __Uniffle without patch__: cost `11.6min` (2.1 + 9.5) __Uniffle with patch__: cost `3.5min` (2.1 + 1.4) Co-authored-by: xianjingfeng <583872483@qq.com>
When we use aqe, we need use mapId to filter the data which we don't need, If we sort MapId before the data are flushed. We split the data to segments, if a segment don't have the data which we want to read, we will drop the data. If data is sorted by mapId, we can filter more data and mprove our performance.
The text was updated successfully, but these errors were encountered: