Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[multistage][feature] support RelDistribution trait planning #11976

Merged
merged 9 commits into from Nov 17, 2023

Conversation

walterddr
Copy link
Contributor

@walterddr walterddr commented Nov 8, 2023

Descriptions

this PR covers #11831 and provide RelDistribution for each logical plan node.

Backward Incompat

  • is_colocated_by_join_keys hint is now required for making colocated joins
    • it should only affect semi-join b/c it is the only one utilizing broadcast exchange but were pulled to act as direct exchange.
    • inner/left/right/full join should automatically apply colocation thus the backward incompatibility should not affect these.

Details

  • Also apply optimization based on distribution trait in the mailbox/worker assignment stage
    • Fix previously direct exchange was decided based on the table partition hint; now direct exchange is decided via distribution trait: it will applied if-and-only-if the trait propagated matches the exchange requirement.
    • As a side effect, we need to reintroduce is_colocated_by_join_keys query option to ensure dynamic broadcast can also benefit from direct exchange optimization
  • Allow propagation of partition distribution trait info across the tree to be used during Physical Planning phase. It can be used in the following scenarios (will follow up in separate PRs)
    • unnecessary shuffling can be changed into direct (if the list of servers are identical)
    • multi-exchanges can be collapsed
    • allow more leaf-stage optimization rules to be run.

TODO

  • This PR only allows LEFT-side trait propagation for JOIN nodes;
  • This PR does not support trait propagation for SET-UNION / WINDOW nodes.
  • the physical plan does not contain partition function info (see newly added ignored tests)

Long-Term Plan

see: #12012

@codecov-commenter
Copy link

codecov-commenter commented Nov 8, 2023

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (62abf1f) 61.65% compared to head (64daf29) 61.62%.
Report is 8 commits behind head on master.

Files Patch % Lines
...he/pinot/query/planner/logical/LogicalPlanner.java 33.33% 2 Missing ⚠️
.../pinot/query/planner/plannode/MailboxSendNode.java 81.81% 1 Missing and 1 partial ⚠️
.../query/planner/logical/RelToPlanNodeConverter.java 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #11976      +/-   ##
============================================
- Coverage     61.65%   61.62%   -0.03%     
- Complexity     1150     1152       +2     
============================================
  Files          2385     2385              
  Lines        129250   129324      +74     
  Branches      20007    20022      +15     
============================================
+ Hits          79690    79701      +11     
- Misses        43748    43818      +70     
+ Partials       5812     5805       -7     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (ø)
integration <0.01% <0.00%> (ø)
integration1 <0.01% <0.00%> (ø)
integration2 0.00% <0.00%> (ø)
java-11 61.58% <94.11%> (-0.03%) ⬇️
java-21 61.50% <94.11%> (+<0.01%) ⬆️
skip-bytebuffers-false 61.61% <94.11%> (-0.02%) ⬇️
skip-bytebuffers-true 61.47% <94.11%> (+<0.01%) ⬆️
temurin 61.62% <94.11%> (-0.03%) ⬇️
unittests 61.62% <94.11%> (-0.03%) ⬇️
unittests1 46.96% <94.11%> (+0.02%) ⬆️
unittests2 27.57% <0.00%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@walterddr walterddr added feature multi-stage Related to the multi-stage query engine labels Nov 9, 2023
* deprecated. The idea is to associate every node with a RelDistribution derived from {@link RelNode#getInputs()}
* or from the node itself (via hints, or special handling of the type of node in question).
*/
public class PinotRelDistributionTraitRule extends RelOptRule {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had worked on something similar earlier this year, and had come to the conclusion that Calcite's RelDistribution and Exchange are not good enough for some of our use-cases because of the following reasons:

  • A given RelNode can practically have multiple RelDistribution. e.g. in a join node where both inputs are table-scan and partitioned by the join-key, the join node can be said to be distributed on both LeftTable.key and RightTable.key. But given how RelDistribution is, we can only keep information about one of the keys. This is in spite of the fact that RelDistribution is a RelMultipleTrait. For some reason the TraitSet only keeps one RelDistribution (I forgot the reasoning for this)
  • I had to also build my own Exchange nodes, because iirc I wanted to have "Identity" exchange support (i.e. the scenario where shuffle is not needed and partitions in input can be mapped 1:1 to partitions in output).

In the linked PR above you can refer to the JSON Test File changes to see how the plan changes after my changes. I remember that I had gotten all the UTs working. I had abandoned this at the time because some of the other component design was not finalized so it was hard to get consensus on this big a change.

We can discuss this in a call perhaps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good point. i think the entire way of how we handle distribution and exchanges needs to be revisited

Status quo

we currently

  1. explicitly insert logical exchanges where we might require data shuffling; and
  2. then determine whether those exchanges are real data shuffle or possible passthrough;
  3. then we determine whether to assign more/less servers to run either side of the exchanges

Current solution

This PR only addresses step-2: to give it a better idea on whether the RelDistribution is the same before & after an exchange. for this purpose, it is OK (at this point) to only keep one of the 2 sides for JOIN Rel.

Needs revisit

there are several problems

  1. should we explicitly insert exchange or should we use other abstractions?
    • there are other ways to add exchange nodes that are more "Calcite-suggested" when managing the Exchange insert instead of applying them during optimization IMO.
  2. should the exchange insert be RelDistribution-based or physical-based?
    • we are mixing the concept of Exchange usage: it can mean (1) altering logical RelDistribution, (2) indicating there potentially could be physical layout differences, (3) whether we can apply leaf-stage optimization
    • although (3) will be addressed by [multistage][feature] leaf planning with multi-semi join support #11937, we should consider whether to still use ExchangeNode as our abstraction or create our own to avoid confusion
  3. should we apply RelDistribution trait before or after Exchange insert
    • currently we have to do this after insert, but technically if we addresses question (2) we can potentially apply that beforehand

ultimately utilizing Exchange was a quick decision during early stage multi-stage engine development and it might not have been the best option. it is worth taking some time to revisit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a newbie here, but IMHO the reason RelDistribution seems to be not enough for our use case is because we are not actually using Calcite as it is designed to be used. Specifically, we are not correctly using conventions to color the AST in order to decide which part is going to be executed on each node. In this talk, @devozerov indicates that Drill or Flink use Calcite in that way.

See
image

What I think we should be doing is to color the nodes we are sure how to color and then use the optimizer to decide what to do in joins. Given we don't inject metrics into Calcite, this is not a shot term solution, but that should be the long term solution.

Copy link
Contributor Author

@walterddr walterddr Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 your analysis is absolutely correct. following up on this we will continue the discussion in

ultimately the way we use trait is kind of a workaround shortcut. should really do this properly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the paper I thought that the convention is supposed to determine the engine. https://www.osti.gov/servlets/purl/1474637

In addition to these properties, one of the main features of Calcite
is the calling convention trait. Essentially, the trait represents the
data processing system where the expression will be executed.
Including the calling convention as a trait allows Calcite to meet
its goal of optimizing transparently queries whose execution might
span over different engines i.e., the convention will be treated as
any other physical property

@walterddr walterddr added the backward-incompat Referenced by PRs that introduce or fix backward compat issues label Nov 16, 2023
@Jackie-Jiang Jackie-Jiang added release-notes Referenced by PRs that need attention when compiling the next release notes Configuration Config changes (addition/deletion/change in behavior) incompatible Indicate PR that introduces backward incompatibility and removed backward-incompat Referenced by PRs that introduce or fix backward compat issues labels Nov 17, 2023
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also want to check if the data is actually colocated when the hint is applied. If it is too hard, we can follow it up in a separate PR

@walterddr
Copy link
Contributor Author

correct. checking and validating colocation is the next step in #12015

@walterddr walterddr merged commit 09da0ea into apache:master Nov 17, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Configuration Config changes (addition/deletion/change in behavior) feature incompatible Indicate PR that introduces backward incompatibility multi-stage Related to the multi-stage query engine release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants