New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rowexec: add merge hash join processor #40393
Conversation
Here are the benchmarks:
When |
I would also benchmark against a segmented sort + merge join (this is what I expect the best plan to be currently). |
c53519d
to
213d897
Compare
Added the benchmark for Sort Chunks + Merge Joiner:
It seems like that combo is faster only when we have groups of significant size on both sides. I wonder whether I'm doing something wrong here. |
This commit fixes several typos as well as slightly refactors the code in a few places in order to expose the functions for reuse. Release note: None
213d897
to
1235809
Compare
Rebased it on top of the current master. PTAL. |
Can you explain what |
Sure. If a source is not repeated, then its rows are constructed as
If a source is repeated, then its rows are constructed as
Looking into this carefully made me realize that currently I ran the benchmarks on four configs:
|
Can you also explain how we are generating rows for the benchmark? There should be a "number of chunks" (or "chunk size") parameter, it seems that this is hardcoded to "1 chunk" for RepeatSide=left or right, or "sqrt(inputSize)" for RepeatSide=both? Why? Also |
Yes, you're right. I thought I was doing something wrong in the benchmarks. |
Note that in the I think it would be better to generate a "group index" as the first column and a random second column. The number of groups should be a parameter so we can benchmark small-group and large-group cases. (I'd expect SortMerge to perform well on many small groups and badly on large groups). |
This commit adds a new merge hash join processor which can be used when we have ordering on the subset of equality columns. It first applies the merging logic only on the ordering columns to find merge groups, and then performs a hash join only within those merge groups. At the moment, only INNER join is supported, and the processor is not being planned. Release note: None
1235809
to
0d3e5fa
Compare
I updated the benchmark as you suggested:
|
The latest benchmarks showed that there was not much of an improvement for |
sql: fix some typos and move some functions
This commit fixes several typos as well as slightly refactors the
code in a few places in order to expose the functions for reuse.
rowexec: add merge hash join processor
This commit adds a new merge hash join processor which can be used
when we have ordering on the subset of equality columns. It first
applies the merging logic only on the ordering columns to find
merge groups, and then performs a hash join only within those merge
groups. At the moment, only INNER join is supported, and the processor
is not being planned.
Release note: None