[CALCITE-2970] Add abstractConverter only between derived and required traitset by hsyuan · Pull Request #1860 · apache/calcite

hsyuan · 2020-03-18T19:33:21Z

JIRA: https://issues.apache.org/jira/browse/CALCITE-2970

Before this patch, the VolcanoPlanner couldn't distinguish traitset derived
from child operators and traitset required by parent operators.
AbstractConverters are added between all of these traitsets no matter it is
derived or required, which causes the explosion of search space. e.g.

SELECT a,b,c,max(d) FROM foo GROUP BY a,b,c;
Aggregate
+-- TableScan

For distributed system, suppose the Aggregate operator may require the
following traitsets from TableScan with exact match:

Singleton distribution
Hash distribution on a
Hash distribution on b
Hash distribution on c
Hash distribution on a,b
Hash distribution on b,c
Hash distribution on a,c
Hash distribution on a,b,c

VolcanoPlanner would add 7*7+8 = 57 abstract converters into the RelSet, e.g.
abstractConverter between [a] and [b,c], even if the satisfying match is
allowed, e.g. distribution on [a] statisfy distribution on [a,b,c], there are
still lots of abstract converters. But we only need 8.

This patch fixes above issue by adding state to RelSubset indicating whether
the added traitset is required or derived. The traitset can be both required
and derived. Only abstract converter from derived traitset to required traitset
is added.

By default, when adding a new RelNode to RelSet, we treat its traitset as
derived, when calling changeTraits, the traitset will be treated as required.
Unfortunately, almost all the RelNodes except AbstractConverter are added
through rule transformation, when the AbstractConverter is transformed to a
enforcing operator, e.g. PhysicalSort, the planner will still treat its
traitset as derived, which will trigger the creation of AbstractConverter
between this RelSubset and remaining RelSubsets in the RelSet. To avoid this
issue, though not clean but work, enforcing operator and AbstactConverter
should override isEnforcer() method indicating the RelNode is added due to
the desired traitset is not satisfied. The user needs to judge by his/her own
whether to mark enforcing operator.

core/src/main/java/org/apache/calcite/plan/RelEnforcer.java

chunweilei · 2020-03-19T02:44:20Z

core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java

+    return (state & DERIVED) == DERIVED;
+  }
+
+  public boolean isRequired() {


I am confused here. Is RelSubSet' state either DERIVED or REQUIRED?

The traitset can be both required and derived.

Maybe we can figure out a way that makes it more clear.

I also feel it's confusing to have REQUIRE/DERIVED status on RelSubset.

chunweilei · 2020-03-19T02:46:21Z

core/src/test/java/org/apache/calcite/test/StreamTest.java

+            + "    EnumerableSort(sort0=[$4], dir0=[ASC])\n"
+            + "      EnumerableCalc(expr#0..3=[{inputs}], expr#4=[CAST($t2):VARCHAR(32) NOT NULL], proj#0..4=[{exprs}])\n"
+            + "        EnumerableInterpreter\n"
+            + "          BindableTableScan(table=[[STREAM_JOINS, ORDERS, (STREAM)]])\n"


There are some cases that changed from HashJoin to MergeJoin. Are they expected?

I think so.

I'm confused by the MergeJoin is better, is there any problem with the cost estimation ?

The cost model is orthogonal with this change.

Previously MergeJoin was not taken because it did not really try to sort inputs. In other words, MergeJoin was taken only in case both inputs were already pre-sorted (which was only happening for literal VALUES).

I guess now it "abstract-converts" non-sorted rels to the sorted state, so MergeJoin can really succeed.

@hsyuan , is it the case?

@vlsi Correct.

This change reduces the amount the abstract converters and before this patch, the EnumerableMergeJoinRule did try to convert the input convention, i guess it is because there are redundant converters there so the cost estimation is affected.

If you turn on abstract converter in master branch, it generates the same plan.

Thanks, got your idea ~

vlsi · 2020-03-19T18:08:22Z

core/src/test/java/org/apache/calcite/test/JdbcTest.java

    checkJoinNWay(1);
    checkJoinNWay(3);
-    checkJoinNWay(6);
+    checkJoinNWay(13);


From Travis:

4.1sec, org.apache.calcite.test.JdbcTest > testJoinManyWay()

This is impressive!

vlsi · 2020-03-19T18:11:36Z

core/src/main/java/org/apache/calcite/rel/RelNode.java

+   *
+   * @return Whether it is an enforcer operator
+   */
+  boolean isEnforcer();


Does it make sense to add default implementation?

Yes, will do.

core/src/main/java/org/apache/calcite/adapter/enumerable/EnumerableConvention.java

danny0405 · 2020-03-20T04:04:44Z

core/src/main/java/org/apache/calcite/rel/RelNode.java

+   * @return Whether it is an enforcer operator
+   */
+  default boolean isEnforcer() {
+    return false;


Should we add this default interface ? There is already a default value in AbstractRelNode#isEnforcer.

we can remove AbstractRelNode. isEnforcer

chunweilei · 2020-03-24T08:32:59Z

core/src/main/java/org/apache/calcite/plan/volcano/RelSet.java

+
+      // it may be required only, or both derived and required,
+      // in which case, register again.
+      if (otherSubset.isRequired()) {


The first character should be upper case?

fixed, thanks.

chunweilei · 2020-03-24T08:38:05Z

core/src/main/java/org/apache/calcite/plan/volcano/RelSet.java

+  RelSubset getOrCreateSubset(RelOptCluster cluster, RelTraitSet traits) {
+    return getOrCreateSubset(cluster, traits, false /* required */);
+  }
+


Should delete the comment /* required */?

done, thanks

xndai · 2020-03-30T23:34:41Z

core/src/main/java/org/apache/calcite/plan/Convention.java

   */
-  boolean canConvertConvention(Convention toConvention);
+  default boolean canConvertConvention(Convention toConvention) {
+    return false;


CALCITE-1148 introduced it. It is just a default value of the interface, no behavior changes.

xndai · 2020-03-30T23:35:01Z

core/src/main/java/org/apache/calcite/plan/Convention.java

-      RelTraitSet toTraits);
+  default boolean useAbstractConvertersForConversion(RelTraitSet fromTraits,
+      RelTraitSet toTraits) {
+    return true;


Why do we want to change the base method?

CALCITE-1148 introduced it, which is a walk-around for the inefficiency. Now we can turn it on by default.

xndai · 2020-03-30T23:45:29Z

core/src/main/java/org/apache/calcite/plan/volcano/RelSet.java

  }

+  /**
+   * If the subset is required, convert derived subsets to this subset.


It's a little confusing to say "required subset" or "derived subset". I think what you mean is required/derived trait from subset.

It is ok to say that. Because we will call relsubset. isRequired() and relsubset.isDerived() on RelSubSet.

How about we rename the relsubset. isRequired() to relsubset. isTraitSetRequired()

I think it is good to call isRequired. Because it is the state of the RelSubset.

xndai · 2020-03-30T23:50:18Z

core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java

+    return (state & DERIVED) == DERIVED;
+  }
+
+  public boolean isRequired() {


I also feel it's confusing to have REQUIRE/DERIVED status on RelSubset.

xndai · 2020-03-31T01:09:17Z

core/src/main/java/org/apache/calcite/rel/core/Sort.java

  }

+  @Override public boolean isEnforcer() {
+    return offset == null && fetch == null


why offset and fetch have to be null? I feel that isEnforcer should be something passed down when the RelNode is created. It's hard to tell by itself that if an operator is served as an enforcer.

If they are not null, that means it is a LIMIT operator, which is not an enforcer. I already give the definition of Enforcer in RelNode interface. An enforcer should be known when it is created, if the operator can't tell by itself it is an enforcer or not, it is either the design's problem, or just leave it as a non-enforcer.

This is the definition from you - "As an enforcer, the operator must be created only when required traitSet is not satisfied by its input." So it sounds to me that only the caller (a RelOptRule or the framework) who creates the RelNode would know if the RelNode is a converter or not. If this information is not passed into the RelNode, how can we just derive this information from RelNode itself? In this particular case, if Sort is used in ORDER BY ... LIMIT ... scenario, it's still an enforcer. No?

Good question, the answer is no. The sort in your example is created no matter it satisfies parent's required trait or not. So this is not an enforcer.

What if just ORDER BY ...? So the Sort operator all the sudden become an enforcer when LIMIT is not presented?

Basically it is a limit operator. The parent of LIMIT doesn't require anything from it.

The collation the required by the root, same as the ORDER BY case.

No, the collation is required by LIMIT.

I know it is confusing, but it is the design (Sort operator mixes both sort and limit operator) that leads to the confusion. We should have a separate LIMIT operator, SORT operator should not have limit and offset.

When we have a query select * from foo limit 5, there is still a sort operator in the plan, but it doesn't do any sort work.

core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java

…d traitset Before this patch, the VolcanoPlanner couldn't distinguish traitset derived from child operators and traitset required by parent operators. AbstractConverters are added between all of these traitsets no matter it is derived or required, which causes the explosion of search space. e.g. SELECT a,b,c,max(d) FROM foo GROUP BY a,b,c; Aggregate +-- TableScan For distributed system, suppose the Aggregate operator may require the following traitsets from TableScan with exact match: - Singleton distribution - Hash distribution on a - Hash distribution on b - Hash distribution on c - Hash distribution on a,b - Hash distribution on b,c - Hash distribution on a,c - Hash distribution on a,b,c VolcanoPlanner would add 7*7+8 = 57 abstract converters into the RelSet, e.g. abstractConverter between [a] and [b,c], even if the satisfying match is allowed, e.g. distribution on [a] statisfy distribution on [a,b,c], there are still lots of abstract converters. But we only need 8. This patch fixes above issue by adding state to RelSubset indicating whether the added traitset is required or derived. The traitset can be both required and derived. Only abstract converter from derived traitset to required traitset is added. By default, when adding a new RelNode to RelSet, we treat its traitset as derived, when calling changeTraits, the traitset will be treated as required. Unfortunately, almost all the RelNodes except AbstractConverter are added through rule transformation, when the AbstractConverter is transformed to a enforcing operator, e.g. PhysicalSort, the planner will still treat its traitset as derived, which will trigger the creation of AbstractConverter between this RelSubset and remaining RelSubsets in the RelSet. To avoid this issue, though not clean but work, enforcing operator and AbstactConverter should override isEnforcer() method to indicate the RelNode is added due to the desired traitset is not satisfied. The user needs to judge by his/her own whether to mark enforcing operator. Close #1860

…ions to single collation Just add test cases for JIRA CALCITE-2593 and CALCITE-2010, which is actually fixed by f17367e (PR #1860), because parent RelNode never requires MultipleTrait from child RelNode. Close #1913

…ollations to single collation Just add test cases for JIRA CALCITE-2593 and CALCITE-2010, which is actually fixed by f17367e (PR #1860). But if we turn off abstract converter for EnumerableConvention, these problems still exist. The root cause is that EnumerableAggregate and EnumerableUnion make collation request to its children, but actually they don't require any collation. The fundamental change is fixing RelCompositeTrait, but that is a long never end discussion. Close #1914

hsyuan added the slow-tests-needed label Mar 18, 2020

chunweilei reviewed Mar 19, 2020

View reviewed changes

core/src/main/java/org/apache/calcite/plan/RelEnforcer.java Outdated Show resolved Hide resolved

chunweilei reviewed Mar 19, 2020

View reviewed changes

vlsi reviewed Mar 19, 2020

View reviewed changes

danny0405 reviewed Mar 20, 2020

View reviewed changes

core/src/main/java/org/apache/calcite/adapter/enumerable/EnumerableConvention.java Show resolved Hide resolved

danny0405 reviewed Mar 20, 2020

View reviewed changes

chunweilei reviewed Mar 24, 2020

View reviewed changes

hsyuan force-pushed the master branch from e42a79b to a0ef3c9 Compare March 29, 2020 16:52

xndai reviewed Mar 31, 2020

View reviewed changes

rubenada reviewed Mar 31, 2020

View reviewed changes

core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java Outdated Show resolved Hide resolved

xndai mentioned this pull request Apr 6, 2020

[CALCITE-3972] Allow RelBuilder to create RelNode with convention and use it for trait convert #1884

Closed

hsyuan added the LGTM-will-merge-soon Overall PR looks OK. Only minor things left. label Apr 9, 2020

rubenada mentioned this pull request Apr 10, 2020

[CALCITE-3833] Support SemiJoin in EnumerableMergeJoin #1883

Merged

hsyuan added 3 commits April 12, 2020 13:30

Merge relsubset state when merging relset

342060c

Address comments

7187ad7

hsyuan closed this in f17367e Apr 12, 2020

hsyuan deleted the CALCITE-2970 branch April 12, 2020 19:15

hsyuan mentioned this pull request Apr 12, 2020

[CALCITE-2593] [CALCITE-2010] Error when transforming multiple collations to single collation #1913

Closed

hsyuan mentioned this pull request Apr 13, 2020

[CALCITE-2593] [CALCITE-2010] Plan error when transforming multiple collations to single collation #1914

Closed

Conversation

hsyuan commented Mar 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsyuan Mar 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsyuan commented Mar 18, 2020 •

edited

Loading

hsyuan Mar 20, 2020 •

edited

Loading