New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[multistage][feature] support RelDistribution trait planning #11976
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #11976 +/- ##
============================================
- Coverage 61.65% 61.62% -0.03%
- Complexity 1150 1152 +2
============================================
Files 2385 2385
Lines 129250 129324 +74
Branches 20007 20022 +15
============================================
+ Hits 79690 79701 +11
- Misses 43748 43818 +70
+ Partials 5812 5805 -7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
* deprecated. The idea is to associate every node with a RelDistribution derived from {@link RelNode#getInputs()} | ||
* or from the node itself (via hints, or special handling of the type of node in question). | ||
*/ | ||
public class PinotRelDistributionTraitRule extends RelOptRule { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had worked on something similar earlier this year, and had come to the conclusion that Calcite's RelDistribution and Exchange are not good enough for some of our use-cases because of the following reasons:
- A given
RelNode
can practically have multipleRelDistribution
. e.g. in a join node where both inputs are table-scan and partitioned by the join-key, the join node can be said to be distributed on bothLeftTable.key
andRightTable.key
. But given how RelDistribution is, we can only keep information about one of the keys. This is in spite of the fact that RelDistribution is a RelMultipleTrait. For some reason the TraitSet only keeps one RelDistribution (I forgot the reasoning for this) - I had to also build my own Exchange nodes, because iirc I wanted to have "Identity" exchange support (i.e. the scenario where shuffle is not needed and partitions in input can be mapped 1:1 to partitions in output).
In the linked PR above you can refer to the JSON Test File changes to see how the plan changes after my changes. I remember that I had gotten all the UTs working. I had abandoned this at the time because some of the other component design was not finalized so it was hard to get consensus on this big a change.
We can discuss this in a call perhaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good point. i think the entire way of how we handle distribution and exchanges needs to be revisited
Status quo
we currently
- explicitly insert logical exchanges where we might require data shuffling; and
- then determine whether those exchanges are real data shuffle or possible passthrough;
- then we determine whether to assign more/less servers to run either side of the exchanges
Current solution
This PR only addresses step-2: to give it a better idea on whether the RelDistribution is the same before & after an exchange. for this purpose, it is OK (at this point) to only keep one of the 2 sides for JOIN Rel.
Needs revisit
there are several problems
- should we explicitly insert exchange or should we use other abstractions?
- there are other ways to add exchange nodes that are more "Calcite-suggested" when managing the Exchange insert instead of applying them during optimization IMO.
- should the exchange insert be RelDistribution-based or physical-based?
- we are mixing the concept of Exchange usage: it can mean (1) altering logical RelDistribution, (2) indicating there potentially could be physical layout differences, (3) whether we can apply leaf-stage optimization
- although (3) will be addressed by [multistage][feature] leaf planning with multi-semi join support #11937, we should consider whether to still use ExchangeNode as our abstraction or create our own to avoid confusion
- should we apply RelDistribution trait before or after Exchange insert
- currently we have to do this after insert, but technically if we addresses question (2) we can potentially apply that beforehand
ultimately utilizing Exchange was a quick decision during early stage multi-stage engine development and it might not have been the best option. it is worth taking some time to revisit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a newbie here, but IMHO the reason RelDistribution
seems to be not enough for our use case is because we are not actually using Calcite as it is designed to be used. Specifically, we are not correctly using conventions to color the AST in order to decide which part is going to be executed on each node. In this talk, @devozerov indicates that Drill or Flink use Calcite in that way.
What I think we should be doing is to color the nodes we are sure how to color and then use the optimizer to decide what to do in joins. Given we don't inject metrics into Calcite, this is not a shot term solution, but that should be the long term solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 your analysis is absolutely correct. following up on this we will continue the discussion in
- [multistage][feature] RelDistribution-based optimization #12015 for better partition detection
- [multistage][feature] revisit RelDistribution detection logic #12012 for correctly utilize reldistribution trait.
ultimately the way we use trait is kind of a workaround shortcut. should really do this properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the paper I thought that the convention is supposed to determine the engine. https://www.osti.gov/servlets/purl/1474637
In addition to these properties, one of the main features of Calcite
is the calling convention trait. Essentially, the trait represents the
data processing system where the expression will be executed.
Including the calling convention as a trait allows Calcite to meet
its goal of optimizing transparently queries whose execution might
span over different engines i.e., the convention will be treated as
any other physical property
090e2a1
to
5194231
Compare
32be85a
to
e2224aa
Compare
…hinted hash/partition exchange works
468f6d8
to
d7a9071
Compare
d7a9071
to
52e6f5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also want to check if the data is actually colocated when the hint is applied. If it is too hard, we can follow it up in a separate PR
...uery-planner/src/main/java/org/apache/calcite/rel/rules/PinotJoinToDynamicBroadcastRule.java
Show resolved
Hide resolved
...-query-planner/src/main/java/org/apache/calcite/rel/rules/PinotRelDistributionTraitRule.java
Outdated
Show resolved
Hide resolved
correct. checking and validating colocation is the next step in #12015 |
Descriptions
this PR covers #11831 and provide RelDistribution for each logical plan node.
Backward Incompat
is_colocated_by_join_keys
hint is now required for making colocated joinsDetails
is_colocated_by_join_keys
query option to ensure dynamic broadcast can also benefit from direct exchange optimizationTODO
Long-Term Plan
see: #12012