-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long time spent in function 'listJoinResults' for array mode join probe #9078
Comments
Hello @mbasmanova , could you help check this? Thanks. |
@zhli1142015 Thank you for reporting and investigating this issue. I have some questions. What is the performance of vanilla Spark on this query?
How many build-side rows are there per unique key? Is there a skew where some of the keys have a lot of build-side rows while others a few? I see that number of join output rows is ~ 2x of the number of probe rows. Is it the case that only small subset of probe rows match, but they match with many build side rows? I don't have easy access to TPC-DS dataset, hence, cannot easily check these cardinalities myself. Thank you also for proposing a solution. I see that the solution applies only to kArrayMode, but isn't this problem more generic and may happen in any hash mode? If that's the case, let's make sure the solution is generic as well. To help evaluated proposed solution and, perhaps, iterate on other options, it would be helpful to create a benchmark that reproduces the issue. Would you be willing to help with that?
I see you are using std::unordered_map. It might be more efficient to use folly::F14FastMap. I also noticed that memory for duplicateRows_ is allocated via malloc directly and therefore is not accounted for in Velox memory pools. Let's make sure we allocate memory from a pool. See StlAllocator and AlignedStlAllocator in velox/common/memory/HashStringAllocator.h I also noticed that you use std::shared_ptr over the vector. What's the motivation for doing that? Why no use std::vector directly? Once we have a benchmark, it would be nice to check whether this optimization always works or if there is a regression in some cases, i.e. when the number of duplicates is low (2). Let me know how you'd like to proceed. CC: @Yuhta @xiaoxmeng |
Thanks @mbasmanova for your suggestions.
In our test, when using Spark, the latency is 39 seconds, and when using Velox, the latency is 60 seconds. After applying this fix, the latency is reduced to 40 seconds.
Yes, this is the scenario I observed. I collected the sizes of all duplicate row vextors by logs. The average size is over sixty.
I feel this problem is a common issue for all hash modes too. I fix it only for array mode, as this is the only pattern we obsrved in TPCDS. I can try different modes with benchmark to see if this is a common issue. Do you think if it's ok to address your comments and include a benchmark in the same pull request? Some comments from you:
The scenarios to verify would include: 1) when the number of duplicates is low (e.g., 2), and 2) when there are a high number of duplicates (e.g., 100 or more). Please let me know if there are any other cases I should cover in the benchmark. |
We also need to update the address list when we erase rows. I would suggest we put the list inside row container and avoid a second probe for duplicates. |
Got it. Would you clarify a bit further about the distribution? For example, can you tell what are p25, p50, p90, p95 and max or describe the distribution in some other way?
That would be great. Thanks.
I suggest to work with a single PR for now. Once we have the full solution we can decide whether it needs to be split into multiple PRs. The first step is to figure out in which cases we have a problem and what's the best way to address all these cases without regressing in other cases.
It would be nice to write the benchmark in a way that allows us to easily test different distributions, e.g. 50% row have 2 dups, 35 have 10 dups, 10 have 50 dups, 5 have 100 dups (or something along these lines). |
Also would be nice if we have a benchmark for erasing performance. This list will be faster to traverse but slower to update. |
Hello, [This is solved.] Additionally, I've noticed that multiple threads may simultaneously allocate memory for different next row vectors when parallel table building is enabled. Therefore, it's essential to add a mutex to protect the Regarding row earsing, I believe there's no need to update the content of next row vectors, as rows with identical keys are always partitioned to the same partition. I add logic to release references to the next row vectors during earsing. Below is the benchmark comparsion:
After this PR:
Based on the observations above, Please let me know if you have any more comments for this. |
Thank you for sharing. Apparently all keys repeat quite a bit. 34 times min and 60 on average. |
I assume each thread processes rows from a single "partition", e.g. there is no key overlap between threads. If that's the case, perhaps, each thread can create its own set of next-row-vectors. |
Would you clarify how benchmark name should be interpreted? What different parts of the name mean? |
I created the next-row-vector using
|
See #9078 (comment) |
{hash mode} _ {number of fields (keys + dependent fields)} _ {key repetition distribution}, for example: array_1_20%:1;80%:0 means the join is with array hash mode, and each build row consists of only one field. In the build side row vector, 20% of the rows have one duplication, while 80% of the rows have no duplication.. |
I think even we do this, they also allocate memory from same |
@zhli1142015 When building hash table in parallel, each HashTable has its own HSA, no? |
Thanks, I think I get your point, updated the PR to remove the mutex for HSA. |
…or#9079) Summary: Problem When there are a large number of rows with the same key in the build side, the `listJoinResults` function becomes very time-consuming. Design `appendNextRow` Create a next-row-vector if it doesn't exist. Append the row address to the next-row-vector, and store the address of the next-row-vector in the `nextOffset_` slot for all duplicate rows. `listJoinResults` To retrieve the addresses of all rows with the same keys, we first obtain the address of the first row using the hash function, then, by the `nextOffset_`, we retrieve the address of the next-row-vector. Then, we iterate through the next-row-vector to obtain the addresses of the remaining rows. We can utilize SIMD instructions to accelerate the next-row-vector access. When a row needs to be erased, if value in `nextOffset_` slot is not null, then it will be removed from corresponding next-row-vector and set it's `nextOffset_` slot as null. The current design is applicable to all hash modes. Benchmark The results indicate that this PR can accelerate the `listJoinResults` function, with the acceleration effect becoming more pronounced as the proportion of rows with the same key increases. Fixes facebookincubator#9078 Pull Request resolved: facebookincubator#9079 Reviewed By: mbasmanova Differential Revision: D55428528 Pulled By: Yuhta fbshipit-source-id: dfce20c1ecad3eaddc6c5e024a3b21a800d54965
…or#9079) Summary: Problem When there are a large number of rows with the same key in the build side, the `listJoinResults` function becomes very time-consuming. Design `appendNextRow` Create a next-row-vector if it doesn't exist. Append the row address to the next-row-vector, and store the address of the next-row-vector in the `nextOffset_` slot for all duplicate rows. `listJoinResults` To retrieve the addresses of all rows with the same keys, we first obtain the address of the first row using the hash function, then, by the `nextOffset_`, we retrieve the address of the next-row-vector. Then, we iterate through the next-row-vector to obtain the addresses of the remaining rows. We can utilize SIMD instructions to accelerate the next-row-vector access. When a row needs to be erased, if value in `nextOffset_` slot is not null, then it will be removed from corresponding next-row-vector and set it's `nextOffset_` slot as null. The current design is applicable to all hash modes. Benchmark The results indicate that this PR can accelerate the `listJoinResults` function, with the acceleration effect becoming more pronounced as the proportion of rows with the same key increases. Fixes facebookincubator#9078 Pull Request resolved: facebookincubator#9079 Reviewed By: mbasmanova Differential Revision: D55428528 Pulled By: Yuhta fbshipit-source-id: dfce20c1ecad3eaddc6c5e024a3b21a800d54965
Bug description
We observed one join operator in TPCDS query72 (CBO off, 1TB TPCDS, 8*8 cores) with gluten / velox is very slow comparing with vanilla spark runs.
Shuffle Join in TPCDS Query72
![image](https://private-user-images.githubusercontent.com/10524738/312704813-e1b3385b-35e6-419a-a6aa-19bdd975eda7.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjEyNDA4ODUsIm5iZiI6MTcyMTI0MDU4NSwicGF0aCI6Ii8xMDUyNDczOC8zMTI3MDQ4MTMtZTFiMzM4NWItMzVlNi00MTlhLWE2YWEtMTliZGQ5NzVlZGE3LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzE3VDE4MjMwNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTkzZWIzNGNmY2I0MjgwNjI0Mjg3OTY4ZjkwZWYyZDQxODg0NzQwNDdiN2ZiNWRhMTMwMjk5NDM3YzBiYjI2MDAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.mEr-niZROI1RREvhXsjs58XgC1ZCnoq7PoMpJj-8FVs)
Through investigation, we found that half of the time in the join probe is spent within this function
![image](https://private-user-images.githubusercontent.com/10524738/312751276-0a323b7b-cc72-4ee0-89cb-195547bf68e5.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjEyNDA4ODUsIm5iZiI6MTcyMTI0MDU4NSwicGF0aCI6Ii8xMDUyNDczOC8zMTI3NTEyNzYtMGEzMjNiN2ItY2M3Mi00ZWUwLTg5Y2ItMTk1NTQ3YmY2OGU1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzE3VDE4MjMwNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTk2Yjc4Y2Q2Yzg4ZWUyOWUyNTA5ZjI1ZGU2OGFiOTk3NjQzZWRmZjhlNzI5NzYxYmE4MmY2ODQzNmVhMDcxZTMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.4-ouZDxxWbFoNdRHoNTNU8vOfr8mjMY_5hqKAJWcf4I)
listJoinResults
. Within this join, there is a significant number of duplicate rows on the build side. Additionally, in the current implementation, duplicate rows are linked together via thenextRow
field. To traverse all duplicate rows, we would need to access thenextRow
field of each row to obtain the address of the next duplicate row. This approach looks not very efficient.This join operator can be reproduced by below simple query:
Proposed solution
We suggest storing the addresses of duplicate rows using a vector, as this would expedite the access to duplicate row addresses.
Below is the join after applying above proposal. we can see the probe latency reduces ~20 seconds.
![image](https://private-user-images.githubusercontent.com/10524738/312728350-5fb51c30-24d0-47f8-af08-80bd8c45b52c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjEyNDA4ODUsIm5iZiI6MTcyMTI0MDU4NSwicGF0aCI6Ii8xMDUyNDczOC8zMTI3MjgzNTAtNWZiNTFjMzAtMjRkMC00N2Y4LWFmMDgtODBiZDhjNDViNTJjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzE3VDE4MjMwNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTk4MjM2NjRhYWVkNTZmMjEzMjgxYjA4NGVhMDcxZmVmYWEwMmRkNDQ0MDE4ZDYxZmYxNDA4MjZlMmUwMzkwOWEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.yZPyXein0BpHFmTLfhbBtg02n4OK2s_1AiBokC9ieoI)
Shuffle join in TPCDS Query72 after
System information
Velox System Info v0.0.2
Commit: 874f1dd
CMake Version: 3.22.1
System: Linux-5.15.146.1-microsoft-standard-WSL2
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 11.4.0
C Compiler: /usr/bin/cc
C Compiler Version: 11.4.0
CMake Prefix Path: /usr/local;/usr;/;/usr;/usr/local;/usr/X11R6;/usr/pkg;/opt
Relevant logs
No response
The text was updated successfully, but these errors were encountered: