-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Ballista: Implement scalable distributed joins #634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f597c0c to
8acdd12
Compare
8acdd12 to
6f4cfd8
Compare
|
@edrevo fyi |
| .with_repartition_joins(false) | ||
| .with_repartition_aggregations(false) | ||
| .with_physical_optimizer_rules(rules); | ||
| let config = ExecutionConfig::new().with_concurrency(2); // TODO: this is hack to enable partitioned joins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the idea here for later? I guess the repartitioning needs to be applied with concurrency=1 too to avoid inefficient plans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed https://github.com/apache/arrow-datafusion/issues/661 to discuss this
Dandandan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing 😎😎😎
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready to merge; very neat solution! 💯
Closes apache#672 rustls Closes #682 syn Closes apache#653 parking_lot closes apache#648 object_store Closes apache#625 h2 Closes apache#623 tokio Closes apache#608 mio Closes apache#597 pyo3 Closes apache#642 pyo3-build-config Closes apache#627 prost Closes apache#634 prost-types Closes apache#637 async-trait
Which issue does this PR close?
Closes #63.
This PR removes previous hacks around partitioning and now faithfully translates the DataFusion query plan, including
RepartitionExec. I have tested with TPC-H query 12 and see consistent results between DataFusion and Ballista with the 100GB data set, where each table has 8 partitions. I have tested with multiple executors as well as single executors.There is more work to do but I think this is at a good point to merge since it fixes some correctness issues.
Rationale for this change
Ballista cannot scale well without this because work is duplicated across all partitions to load the entire left side of the join into memory currently.
What changes are included in this PR?
RepartitionExecin Ballista query plans and translate them to shufflesAre there any user-facing changes?
Query plans will change.