-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-24167: TPC-DS query 14 fails while generating plan for the filter #5077
Conversation
Quality Gate passedIssues Measures |
return super.explainTerms(pw) | ||
.item("tableScanTrait", this.tableScanTrait) | ||
.itemIf("fromVersion", ((RelOptHiveTable) table).getHiveTableMD().getVersionIntervalFrom(), | ||
isNotBlank(((RelOptHiveTable) table).getHiveTableMD().getVersionIntervalFrom())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some attributes that could differentiate two HiveTableScans. We may not need these ones here.
There is one potential problem in my mind. HiveTableScan doesn't retain the equivalents of TableScanDesc#getPredicateString
or TableScanDesc#getRowLimit
. So, we can't unify ASTNodes or RelNodes of HiveTableScan based on signatures. Otherwise, Operators will be over-unified later.
Currently, we link only HiveFilter RelNodes using RelTreeSignatures(we link HiveTableScan with its ASTNode, but don't link it with the signature). So, the existence of TableScanDesc#getPredicateString
doesn't matter to us.
Also, CBO is disabled when TABLESAMPLE is used. So, we can say TableScanDesc#getRowLimit
doesn't cause an issue in the world of RelNodes. I expect we could push down the row count to HiveTableScan when we support TABLESAMPLE in CBO.
So, in my current understanding, this PR doesn't cause an immediate problem.
planMapper.link(cond, RelTreeSignature.of(where)); | ||
planMapper.link(cond, where); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We link RelTreeSignature first so that we can safely unify multiple filters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool! :D
I applied two changes. I'm waiting for CI to finish.
|
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/reloperators/HiveTableScan.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @okumin,
I was doing some testing with the proposed changes by setting the following properties in a wider scale.
set hive.optimize.cte.materialize.threshold=1;
set hive.optimize.cte.materialize.full.aggregate.only=false;
It seems that some queries still fail with java.lang.RuntimeException: equivalence mapping violation
.
For instance running cbo_query23.q with the above properties set raises an exception.
mvn test -Dtest=TestTezTPCDS30TBPerfCliDriver -Dqfile="cbo_query23.q" -Dtest.output.overwrite
Can you please look at the failures to see if this patch needs to be adapted.
Also it would be nice to do a sanity run by setting the above properties temporarily for all qtests to see if the problem disappears completely or if there are still cases that need to be handled.
Assuming that there are still failures to be address please create additional qtests.
Quality Gate passedIssues Measures |
Thanks. Predicate Push-Down and Partition Condition Remover cause the problem. I modified it on my local machine, and I will explain the details later.
Thanks for the great suggestion! I definitely will do it. Now, I have run positive LLAP tests and TPC-DS, and I found that No. 64 of TPD-DS fails. I created another ticket because I found the problem can happen regardless of materialization or PlamMapper while I was minimizing the test query. I would be glad if the community could take a look. I'm still testing the remaining qtests where CTEs are involved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had another look in this proposal relying on RelTreeSignature
but it seems that it is not general enough to get rid of the problem for good. Changing the hive.query.planmapper.link.relnodes
to false makes the new tests fail.
I was trying to come up with alternatives but don't have something viable yet.
--! qt:dataset:src | ||
|
||
set hive.optimize.cte.materialize.threshold=1; | ||
set hive.optimize.cte.materialize.full.aggregate.only=false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set hive.query.planmapper.link.relnodes=false;
By setting this property the test fails with the same RuntimeException: equivalence mapping violation
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm checking the whole execution and what the flag does. Disabling hive.query.planmapper.link.relnodes
, no aux signature is created, and no merge across Operators doesn't happen. I feel it is not consistent with the description, Whether to link Calcite nodes to runtime statistics
.
I guess the direct problem will be resolved if we skip linking RelNodes with Operators when hive.query.planmapper.link.relnodes=false
is configured. That sounds more consistent with the description.
I'm putting my additional notes here. I tried to put some to-be solutions but it could not be very easy.
https://docs.google.com/document/d/1LCST23cSBZglBzjhnCqHlpcLv6xrXRBxU_wszSDAk9w/edit?usp=sharing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess hive.query.planmapper.link.relnodes
can be implemented like this. @kgyrtkirk might have some more contexts.
okumin@cdab2c1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this fails if we disable hive.query.planmapper.link.relnodes
.
create table src (key string, value string);
set hive.cbo.fallback.strategy=NEVER;
set hive.query.planmapper.link.relnodes=false;
set hive.optimize.ppd=false;
EXPLAIN
SELECT * FROM src WHERE key = '5'
UNION ALL
SELECT * FROM src WHERE key = '5';
@zabetak As a potential workaround, I'm wondering if it makes sense to relax the validation and disable features with which PlanMapper is involved when "equivalence mapping violation" happens.
I presume PlanMapper succeeds in 99.9% cases as most qtests succeed. Currently, the remaining 0.1% fails and it sounds like an excessive penalty. We can keep the validation for qtests so that we can minimize the risk of degradation.
In my feeling, it could be possible to decrease the 0.1% to 0.05%, but could be tough to achieve 0% with a single patch.
okumin@dd5be15
I also think the current one could fail somewhere especially if we follow minor options. Now, I have achieved some additional experience and knowledge, and may come up with the to-be approach. I will work on it tomorrow(I am unavailable today because we host a tech event including a Hive 4 session in my city!) if I don't see very viable options here. P.S. checking the intention and purposes of hive.query.planmapper.link.relnodes |
While running negative tests with |
What's the current status of this PR? it looks like we have some problems with the testing. |
@dengzhhu653 @zabetak Thanks. I understand we need some more consensus on the expectations of HIVE-24167. We know the current PR doesn't work with some sets of Hive parameters. I have a question here, "Should HIVE-24167 let any SQLs with any set of parameters successfully run?". If the answer is yes, I recommend avoiding the current approach "let the entire query crush" and fixing known problems incrementally in the following tickets because it sounds hard to immediately find the perfect solution. If the answer is no, we may need a consensus on what queries/parameters we should support in HIVE-24167, e.g. we prioritize a set of default parameters. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
What changes were proposed in this pull request?
Link Calcite's RelNodes with the same RelTreeSignature.
https://issues.apache.org/jira/browse/HIVE-24167
Why are the changes needed?
It is known that the current unification flow could fail when CTEs are materialized. I summarized the problem in this gist.
This PR implements the option "Combine RelNodes with RelTreeSignature" in the gist.
See also the PR where I implemented another option, "Don't pull groups of Operators by signatures".
Does this PR introduce any user-facing change?
No.
Is the change a dependency upgrade?
No.
How was this patch tested?
I added a new test case to reproduce the issue and restore query14 of TPC-DS which fails on the current master.