[multistage][feature] support RelDistribution trait planning #11976

walterddr · 2023-11-08T18:17:13Z

Descriptions

this PR covers #11831 and provide RelDistribution for each logical plan node.

Backward Incompat

is_colocated_by_join_keys hint is now required for making colocated joins
- it should only affect semi-join b/c it is the only one utilizing broadcast exchange but were pulled to act as direct exchange.
- inner/left/right/full join should automatically apply colocation thus the backward incompatibility should not affect these.

Details

Also apply optimization based on distribution trait in the mailbox/worker assignment stage
- Fix previously direct exchange was decided based on the table partition hint; now direct exchange is decided via distribution trait: it will applied if-and-only-if the trait propagated matches the exchange requirement.
- As a side effect, we need to reintroduce is_colocated_by_join_keys query option to ensure dynamic broadcast can also benefit from direct exchange optimization
Allow propagation of partition distribution trait info across the tree to be used during Physical Planning phase. It can be used in the following scenarios (will follow up in separate PRs)
- unnecessary shuffling can be changed into direct (if the list of servers are identical)
- multi-exchanges can be collapsed
- allow more leaf-stage optimization rules to be run.

TODO

This PR only allows LEFT-side trait propagation for JOIN nodes;
This PR does not support trait propagation for SET-UNION / WINDOW nodes.
the physical plan does not contain partition function info (see newly added ignored tests)

Long-Term Plan

see: #12012

codecov-commenter · 2023-11-08T18:57:34Z

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (62abf1f) 61.65% compared to head (64daf29) 61.62%.
Report is 8 commits behind head on master.

Files	Patch %	Lines
...he/pinot/query/planner/logical/LogicalPlanner.java	33.33%	2 Missing ⚠️
.../pinot/query/planner/plannode/MailboxSendNode.java	81.81%	1 Missing and 1 partial ⚠️
.../query/planner/logical/RelToPlanNodeConverter.java	80.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #11976      +/-   ##
============================================
- Coverage     61.65%   61.62%   -0.03%     
- Complexity     1150     1152       +2     
============================================
  Files          2385     2385              
  Lines        129250   129324      +74     
  Branches      20007    20022      +15     
============================================
+ Hits          79690    79701      +11     
- Misses        43748    43818      +70     
+ Partials       5812     5805       -7

Flag	Coverage Δ
custom-integration1	`<0.01% <0.00%> (ø)`
integration	`<0.01% <0.00%> (ø)`
integration1	`<0.01% <0.00%> (ø)`
integration2	`0.00% <0.00%> (ø)`
java-11	`61.58% <94.11%> (-0.03%)`	⬇️
java-21	`61.50% <94.11%> (+<0.01%)`	⬆️
skip-bytebuffers-false	`61.61% <94.11%> (-0.02%)`	⬇️
skip-bytebuffers-true	`61.47% <94.11%> (+<0.01%)`	⬆️
temurin	`61.62% <94.11%> (-0.03%)`	⬇️
unittests	`61.62% <94.11%> (-0.03%)`	⬇️
unittests1	`46.96% <94.11%> (+0.02%)`	⬆️
unittests2	`27.57% <0.00%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ankitsultana · 2023-11-10T00:33:21Z

...-query-planner/src/main/java/org/apache/calcite/rel/rules/PinotRelDistributionTraitRule.java

+ * deprecated. The idea is to associate every node with a RelDistribution derived from {@link RelNode#getInputs()}
+ * or from the node itself (via hints, or special handling of the type of node in question).
+ */
+public class PinotRelDistributionTraitRule extends RelOptRule {


I had worked on something similar earlier this year, and had come to the conclusion that Calcite's RelDistribution and Exchange are not good enough for some of our use-cases because of the following reasons:

A given RelNode can practically have multiple RelDistribution. e.g. in a join node where both inputs are table-scan and partitioned by the join-key, the join node can be said to be distributed on both LeftTable.key and RightTable.key. But given how RelDistribution is, we can only keep information about one of the keys. This is in spite of the fact that RelDistribution is a RelMultipleTrait. For some reason the TraitSet only keeps one RelDistribution (I forgot the reasoning for this)

I had to also build my own Exchange nodes, because iirc I wanted to have "Identity" exchange support (i.e. the scenario where shuffle is not needed and partitions in input can be mapped 1:1 to partitions in output).

In the linked PR above you can refer to the JSON Test File changes to see how the plan changes after my changes. I remember that I had gotten all the UTs working. I had abandoned this at the time because some of the other component design was not finalized so it was hard to get consensus on this big a change.

We can discuss this in a call perhaps.

this is a good point. i think the entire way of how we handle distribution and exchanges needs to be revisited

Status quo

we currently

explicitly insert logical exchanges where we might require data shuffling; and

then determine whether those exchanges are real data shuffle or possible passthrough;

then we determine whether to assign more/less servers to run either side of the exchanges

Current solution

This PR only addresses step-2: to give it a better idea on whether the RelDistribution is the same before & after an exchange. for this purpose, it is OK (at this point) to only keep one of the 2 sides for JOIN Rel.

Needs revisit

there are several problems

should we explicitly insert exchange or should we use other abstractions?

there are other ways to add exchange nodes that are more "Calcite-suggested" when managing the Exchange insert instead of applying them during optimization IMO.

should the exchange insert be RelDistribution-based or physical-based?

we are mixing the concept of Exchange usage: it can mean (1) altering logical RelDistribution, (2) indicating there potentially could be physical layout differences, (3) whether we can apply leaf-stage optimization

although (3) will be addressed by [multistage][feature] leaf planning with multi-semi join support #11937, we should consider whether to still use ExchangeNode as our abstraction or create our own to avoid confusion

should we apply RelDistribution trait before or after Exchange insert

currently we have to do this after insert, but technically if we addresses question (2) we can potentially apply that beforehand

ultimately utilizing Exchange was a quick decision during early stage multi-stage engine development and it might not have been the best option. it is worth taking some time to revisit

I'm a newbie here, but IMHO the reason RelDistribution seems to be not enough for our use case is because we are not actually using Calcite as it is designed to be used. Specifically, we are not correctly using conventions to color the AST in order to decide which part is going to be executed on each node. In this talk, @devozerov indicates that Drill or Flink use Calcite in that way.

See

What I think we should be doing is to color the nodes we are sure how to color and then use the optimizer to decide what to do in joins. Given we don't inject metrics into Calcite, this is not a shot term solution, but that should be the long term solution.

👍 your analysis is absolutely correct. following up on this we will continue the discussion in

[multistage][feature] RelDistribution-based optimization #12015 for better partition detection

[multistage][feature] revisit RelDistribution detection logic #12012 for correctly utilize reldistribution trait.

ultimately the way we use trait is kind of a workaround shortcut. should really do this properly

From the paper I thought that the convention is supposed to determine the engine. https://www.osti.gov/servlets/purl/1474637

In addition to these properties, one of the main features of Calcite is the calling convention trait. Essentially, the trait represents the data processing system where the expression will be executed. Including the calling convention as a trait allows Calcite to meet its goal of optimizing transparently queries whose execution might span over different engines i.e., the convention will be treated as any other physical property

- physical explain now prints partition info - fixing mailbox send/receive naming

…asy to spot

…xchange

…hinted hash/partition exchange works

Jackie-Jiang

We also want to check if the data is actually colocated when the hint is applied. If it is too hard, we can follow it up in a separate PR

...uery-planner/src/main/java/org/apache/calcite/rel/rules/PinotJoinToDynamicBroadcastRule.java

...-query-planner/src/main/java/org/apache/calcite/rel/rules/PinotRelDistributionTraitRule.java

walterddr · 2023-11-17T03:53:00Z

correct. checking and validating colocation is the next step in #12015

walterddr mentioned this pull request Nov 8, 2023

[multistage] add rel trait rule for individual hep optimization #11831

Closed

walterddr added feature multi-stage Related to the multi-stage query engine labels Nov 9, 2023

ankitsultana reviewed Nov 10, 2023

View reviewed changes

walterddr force-pushed the pr_trait_partition branch from 090e2a1 to 5194231 Compare November 13, 2023 19:35

Rong Rong added 6 commits November 15, 2023 10:40

trait rule to attach rel-distribution to RelNode

093ce60

adjust rules for partition worker/mailbox assignment

d06c7dc

testing

aa2453a

- physical explain now prints partition info - fixing mailbox send/receive naming

explicitly mark MailboxSendNode as partitioned in explain, makes it e…

1381da2

…asy to spot

also fix test string compare

440f9a0

fix test physical plan that currently doesn't consist leaf boundary e…

e2224aa

…xchange

walterddr force-pushed the pr_trait_partition branch from 32be85a to e2224aa Compare November 15, 2023 18:40

[test] add more runtime test to ensure the reuslts correct after the …

4a0cd3f

…hinted hash/partition exchange works

walterddr force-pushed the pr_trait_partition branch from 468f6d8 to d7a9071 Compare November 16, 2023 16:50

[fix regression] add back 'is_colocated_by_join_keys' option

52e6f5e

walterddr force-pushed the pr_trait_partition branch from d7a9071 to 52e6f5e Compare November 16, 2023 17:13

walterddr added the backward-incompat Referenced by PRs that introduce or fix backward compat issues label Nov 16, 2023

This was referenced Nov 16, 2023

[multistage][feature] revisit RelDistribution detection logic #12012

Open

[multistage][feature] RelDistribution-based optimization #12015

Open

Jackie-Jiang approved these changes Nov 17, 2023

View reviewed changes

...uery-planner/src/main/java/org/apache/calcite/rel/rules/PinotJoinToDynamicBroadcastRule.java Show resolved Hide resolved

...-query-planner/src/main/java/org/apache/calcite/rel/rules/PinotRelDistributionTraitRule.java Outdated Show resolved Hide resolved

address comments

64daf29

walterddr merged commit 09da0ea into apache:master Nov 17, 2023
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[multistage][feature] support RelDistribution trait planning #11976

[multistage][feature] support RelDistribution trait planning #11976

walterddr commented Nov 8, 2023 •

edited

codecov-commenter commented Nov 8, 2023 •

edited

ankitsultana Nov 10, 2023

walterddr Nov 13, 2023

gortiz Feb 7, 2024

walterddr Feb 7, 2024 •

edited

ankitsultana Feb 7, 2024 •

edited

Jackie-Jiang left a comment

walterddr commented Nov 17, 2023

[multistage][feature] support RelDistribution trait planning #11976

[multistage][feature] support RelDistribution trait planning #11976

Conversation

walterddr commented Nov 8, 2023 • edited

Descriptions

Backward Incompat

Details

TODO

Long-Term Plan

codecov-commenter commented Nov 8, 2023 • edited

Codecov Report

ankitsultana Nov 10, 2023

Choose a reason for hiding this comment

walterddr Nov 13, 2023

Choose a reason for hiding this comment

Status quo

Current solution

Needs revisit

gortiz Feb 7, 2024

Choose a reason for hiding this comment

walterddr Feb 7, 2024 • edited

Choose a reason for hiding this comment

ankitsultana Feb 7, 2024 • edited

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

walterddr commented Nov 17, 2023

walterddr commented Nov 8, 2023 •

edited

codecov-commenter commented Nov 8, 2023 •

edited

walterddr Feb 7, 2024 •

edited

ankitsultana Feb 7, 2024 •

edited