[CALCITE-2970] Add abstractConverter only between derived and required traitset#1860
[CALCITE-2970] Add abstractConverter only between derived and required traitset#1860hsyuan wants to merge 3 commits intoapache:masterfrom hsyuan:CALCITE-2970
Conversation
| return (state & DERIVED) == DERIVED; | ||
| } | ||
|
|
||
| public boolean isRequired() { |
There was a problem hiding this comment.
I am confused here. Is RelSubSet' state either DERIVED or REQUIRED?
There was a problem hiding this comment.
The traitset can be both required and derived.
There was a problem hiding this comment.
Maybe we can figure out a way that makes it more clear.
There was a problem hiding this comment.
I also feel it's confusing to have REQUIRE/DERIVED status on RelSubset.
| + " EnumerableSort(sort0=[$4], dir0=[ASC])\n" | ||
| + " EnumerableCalc(expr#0..3=[{inputs}], expr#4=[CAST($t2):VARCHAR(32) NOT NULL], proj#0..4=[{exprs}])\n" | ||
| + " EnumerableInterpreter\n" | ||
| + " BindableTableScan(table=[[STREAM_JOINS, ORDERS, (STREAM)]])\n" |
There was a problem hiding this comment.
There are some cases that changed from HashJoin to MergeJoin. Are they expected?
There was a problem hiding this comment.
I'm confused by the MergeJoin is better, is there any problem with the cost estimation ?
There was a problem hiding this comment.
The cost model is orthogonal with this change.
There was a problem hiding this comment.
Previously MergeJoin was not taken because it did not really try to sort inputs. In other words, MergeJoin was taken only in case both inputs were already pre-sorted (which was only happening for literal VALUES).
I guess now it "abstract-converts" non-sorted rels to the sorted state, so MergeJoin can really succeed.
@hsyuan , is it the case?
There was a problem hiding this comment.
This change reduces the amount the abstract converters and before this patch, the EnumerableMergeJoinRule did try to convert the input convention, i guess it is because there are redundant converters there so the cost estimation is affected.
There was a problem hiding this comment.
If you turn on abstract converter in master branch, it generates the same plan.
| checkJoinNWay(1); | ||
| checkJoinNWay(3); | ||
| checkJoinNWay(6); | ||
| checkJoinNWay(13); |
There was a problem hiding this comment.
From Travis:
4.1sec, org.apache.calcite.test.JdbcTest > testJoinManyWay()
This is impressive!
| * | ||
| * @return Whether it is an enforcer operator | ||
| */ | ||
| boolean isEnforcer(); |
There was a problem hiding this comment.
Does it make sense to add default implementation?
core/src/main/java/org/apache/calcite/adapter/enumerable/EnumerableConvention.java
Show resolved
Hide resolved
| * @return Whether it is an enforcer operator | ||
| */ | ||
| default boolean isEnforcer() { | ||
| return false; |
There was a problem hiding this comment.
Should we add this default interface ? There is already a default value in AbstractRelNode#isEnforcer.
There was a problem hiding this comment.
we can remove AbstractRelNode. isEnforcer
|
|
||
| // it may be required only, or both derived and required, | ||
| // in which case, register again. | ||
| if (otherSubset.isRequired()) { |
There was a problem hiding this comment.
The first character should be upper case?
| RelSubset getOrCreateSubset(RelOptCluster cluster, RelTraitSet traits) { | ||
| return getOrCreateSubset(cluster, traits, false /* required */); | ||
| } | ||
|
|
There was a problem hiding this comment.
Should delete the comment /* required */?
| */ | ||
| boolean canConvertConvention(Convention toConvention); | ||
| default boolean canConvertConvention(Convention toConvention) { | ||
| return false; |
There was a problem hiding this comment.
CALCITE-1148 introduced it. It is just a default value of the interface, no behavior changes.
| RelTraitSet toTraits); | ||
| default boolean useAbstractConvertersForConversion(RelTraitSet fromTraits, | ||
| RelTraitSet toTraits) { | ||
| return true; |
There was a problem hiding this comment.
Why do we want to change the base method?
There was a problem hiding this comment.
CALCITE-1148 introduced it, which is a walk-around for the inefficiency. Now we can turn it on by default.
| } | ||
|
|
||
| /** | ||
| * If the subset is required, convert derived subsets to this subset. |
There was a problem hiding this comment.
It's a little confusing to say "required subset" or "derived subset". I think what you mean is required/derived trait from subset.
There was a problem hiding this comment.
It is ok to say that. Because we will call relsubset. isRequired() and relsubset.isDerived() on RelSubSet.
There was a problem hiding this comment.
How about we rename the relsubset. isRequired() to relsubset. isTraitSetRequired()
There was a problem hiding this comment.
I think it is good to call isRequired. Because it is the state of the RelSubset.
| return (state & DERIVED) == DERIVED; | ||
| } | ||
|
|
||
| public boolean isRequired() { |
There was a problem hiding this comment.
I also feel it's confusing to have REQUIRE/DERIVED status on RelSubset.
| } | ||
|
|
||
| @Override public boolean isEnforcer() { | ||
| return offset == null && fetch == null |
There was a problem hiding this comment.
why offset and fetch have to be null? I feel that isEnforcer should be something passed down when the RelNode is created. It's hard to tell by itself that if an operator is served as an enforcer.
There was a problem hiding this comment.
If they are not null, that means it is a LIMIT operator, which is not an enforcer. I already give the definition of Enforcer in RelNode interface. An enforcer should be known when it is created, if the operator can't tell by itself it is an enforcer or not, it is either the design's problem, or just leave it as a non-enforcer.
There was a problem hiding this comment.
This is the definition from you - "As an enforcer, the operator must be created only when required traitSet is not satisfied by its input." So it sounds to me that only the caller (a RelOptRule or the framework) who creates the RelNode would know if the RelNode is a converter or not. If this information is not passed into the RelNode, how can we just derive this information from RelNode itself? In this particular case, if Sort is used in ORDER BY ... LIMIT ... scenario, it's still an enforcer. No?
There was a problem hiding this comment.
Good question, the answer is no. The sort in your example is created no matter it satisfies parent's required trait or not. So this is not an enforcer.
There was a problem hiding this comment.
What if just ORDER BY ...? So the Sort operator all the sudden become an enforcer when LIMIT is not presented?
There was a problem hiding this comment.
Basically it is a limit operator. The parent of LIMIT doesn't require anything from it.
There was a problem hiding this comment.
The collation the required by the root, same as the ORDER BY case.
There was a problem hiding this comment.
No, the collation is required by LIMIT.
There was a problem hiding this comment.
I know it is confusing, but it is the design (Sort operator mixes both sort and limit operator) that leads to the confusion. We should have a separate LIMIT operator, SORT operator should not have limit and offset.
There was a problem hiding this comment.
When we have a query select * from foo limit 5, there is still a sort operator in the plan, but it doesn't do any sort work.
core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java
Outdated
Show resolved
Hide resolved
…d traitset Before this patch, the VolcanoPlanner couldn't distinguish traitset derived from child operators and traitset required by parent operators. AbstractConverters are added between all of these traitsets no matter it is derived or required, which causes the explosion of search space. e.g. SELECT a,b,c,max(d) FROM foo GROUP BY a,b,c; Aggregate +-- TableScan For distributed system, suppose the Aggregate operator may require the following traitsets from TableScan with exact match: - Singleton distribution - Hash distribution on a - Hash distribution on b - Hash distribution on c - Hash distribution on a,b - Hash distribution on b,c - Hash distribution on a,c - Hash distribution on a,b,c VolcanoPlanner would add 7*7+8 = 57 abstract converters into the RelSet, e.g. abstractConverter between [a] and [b,c], even if the satisfying match is allowed, e.g. distribution on [a] statisfy distribution on [a,b,c], there are still lots of abstract converters. But we only need 8. This patch fixes above issue by adding state to RelSubset indicating whether the added traitset is required or derived. The traitset can be both required and derived. Only abstract converter from derived traitset to required traitset is added. By default, when adding a new RelNode to RelSet, we treat its traitset as derived, when calling changeTraits, the traitset will be treated as required. Unfortunately, almost all the RelNodes except AbstractConverter are added through rule transformation, when the AbstractConverter is transformed to a enforcing operator, e.g. PhysicalSort, the planner will still treat its traitset as derived, which will trigger the creation of AbstractConverter between this RelSubset and remaining RelSubsets in the RelSet. To avoid this issue, though not clean but work, enforcing operator and AbstactConverter should override isEnforcer() method to indicate the RelNode is added due to the desired traitset is not satisfied. The user needs to judge by his/her own whether to mark enforcing operator. Close #1860
…ions to single collation Just add test cases for JIRA CALCITE-2593 and CALCITE-2010, which is actually fixed by f17367e (PR #1860), because parent RelNode never requires MultipleTrait from child RelNode. Close #1913
…ollations to single collation Just add test cases for JIRA CALCITE-2593 and CALCITE-2010, which is actually fixed by f17367e (PR #1860). But if we turn off abstract converter for EnumerableConvention, these problems still exist. The root cause is that EnumerableAggregate and EnumerableUnion make collation request to its children, but actually they don't require any collation. The fundamental change is fixing RelCompositeTrait, but that is a long never end discussion. Close #1914
JIRA: https://issues.apache.org/jira/browse/CALCITE-2970
Before this patch, the VolcanoPlanner couldn't distinguish traitset derived
from child operators and traitset required by parent operators.
AbstractConverters are added between all of these traitsets no matter it is
derived or required, which causes the explosion of search space. e.g.
SELECT a,b,c,max(d) FROM foo GROUP BY a,b,c;
Aggregate
+-- TableScan
For distributed system, suppose the Aggregate operator may require the
following traitsets from TableScan with exact match:
VolcanoPlanner would add 7*7+8 = 57 abstract converters into the RelSet, e.g.
abstractConverter between [a] and [b,c], even if the satisfying match is
allowed, e.g. distribution on [a] statisfy distribution on [a,b,c], there are
still lots of abstract converters. But we only need 8.
This patch fixes above issue by adding state to RelSubset indicating whether
the added traitset is required or derived. The traitset can be both required
and derived. Only abstract converter from derived traitset to required traitset
is added.
By default, when adding a new RelNode to RelSet, we treat its traitset as
derived, when calling changeTraits, the traitset will be treated as required.
Unfortunately, almost all the RelNodes except AbstractConverter are added
through rule transformation, when the AbstractConverter is transformed to a
enforcing operator, e.g. PhysicalSort, the planner will still treat its
traitset as derived, which will trigger the creation of AbstractConverter
between this RelSubset and remaining RelSubsets in the RelSet. To avoid this
issue, though not clean but work, enforcing operator and AbstactConverter
should override
isEnforcer()method indicating the RelNode is added due tothe desired traitset is not satisfied. The user needs to judge by his/her own
whether to mark enforcing operator.