-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat: selectivity metrics (for Explain Analyze) in Hash Join #18488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
7ba1336 to
e00d993
Compare
|
Thank you, I think the implementation is correct. The only consideration is performance, Hash Join implementation is definitely on the performance critical path, so we have to be careful not to introduce additional overhead. This PR should be good to go if we can verify it has no influence on the performance. In this PR, the extra overhead is for each batch, count the I believe these metrics provide more insight than simply computing |
+1. Better to do some profiling for the classic join patterns to have a clear print. |
|
🤖 |
Kicked it off |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Considering the results, I think we should move it to |
It's running in a noisy cloud environment, and tpch_mem takes quite short time, so it might not be accurate. I’ve verified this with tpch_mem10 locally, and it actually slows down several queries. I tried to make this count distinct indices faster (and sent a PR feniljain#2), and after that I think the result looks good: |
Very interesting
Thanks for confirming!
Very curious about this PR, cause it seems you have written the same logic as mine, but using loops instead, am I missing some detail? Is it that the |
Which issue does this PR close?
Explain Analyze) in Hash Join #18409What changes are included in this PR?
Added a distinct element calculator in core hash join loop. It also works on an assumption that indices will be returned in an increasing order, I couldn't see a place where this assumption is broken, but if that's not the case, please do help me out.
Also, I am not 100% sure my implementation for
avg_fanoutis correct, so do let me know if that needs changes.Are these changes tested?
No failures in
sqllogictests/tests indatafusion/core/tests/sql/, should I add a test case for this?