New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17517][SQL]Improve generated Code for BroadcastHashJoinExec #15071
Conversation
What is the improvement? |
For current
For some cases, we don't need to check/read/write the steam side repeatedly in such while circle, e.g.
|
Ok, how about you add your comment to the description? The JVM is pretty clever during C2, so this might not yield as much as one might think. Could you provide some performance benchmarks? |
@hvanhovell thanks very much for your suggestions. I have added my comments to the description, and I will run benchmark later to see if it works |
@hvanhovell I have added a benchmark test for this, could you please help me to review? thanks. |
BroadcastHashJoin w duplicated keys: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------ | ||
codegen = F 226 / 344 0.3 3447.1 1.0X | ||
codegen = T avoid = F 168 / 173 0.4 2565.2 1.3X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make these benchmarks bigger? I think you are mainly measuring all sorts of initialization slow downs here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check this blog post, for some context on these figures: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK,maybe now it is big enough
ping @hvanhovell |
@yaooqinn why do the fields have to be fixed length for this to work? |
@hvanhovell When |
cc @davies |
Test build #3489 has finished for PR 15071 at commit
|
@yaooqinn Can you close this because it's not long time for a long time. |
What changes were proposed in this pull request?
For current
BroadcastHashJoinExec
, we generate join code for key is not unique like this:For some cases, we don't need to check/read/write the steam side repeatedly in such while circle, e.g.
Inner Join with BuildRight
, orBuildLeft && all left side fields are fixed length
and so on. we may generate the code as below:How was this patch tested?
1. add ut
2. benchmark
3. manually check
before
after