-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-24167: TPC-DS query 14 fails while generating plan for the filter #5037
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minimal change request, and comment
ql/src/java/org/apache/hadoop/hive/ql/plan/mapper/PlanMapper.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/plan/mapper/PlanMapper.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/plan/mapper/PlanMapper.java
Outdated
Show resolved
Hide resolved
…n StatsRulesProcFactory" This reverts commit 393bc43.
666e270
to
0430662
Compare
0430662
to
714bdf8
Compare
…tatsRulesProcFactory
714bdf8
to
e8b694f
Compare
Quality Gate passedThe SonarCloud Quality Gate passed, but some issues were introduced. 1 New issue |
} | ||
|
||
if (!mayMerge) { | ||
throw new RuntimeException(String.format( | ||
"Failed to link %s and %s. This error mostly means a bug of Hive", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: An uncaught RuntimeException
almost always denotes a bug of the application. I don't think we need to make the exception that verbose stating the obvious.
Adding the operators which led to exception in the message is good idea, makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this error message pretty cloudy on what the issue is...
I think the sentecte "Equivalence mapping violation" should be there - as that's what the problem is...we have 2 groups; not allowed to put them into the same equiv group -> big trouble...
registerSignature(o1); | ||
registerSignature(o2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linking operators implicitly (using signatures) was added by @kgyrtkirk in HIVE-18926. This change reverts that logic and takes signatures out of the equation for determining equivalent groups.
My understanding is that if this change goes in operators can be linked with signatures only explicitly. I don't see test failures and .q.out changes due to this so it seems that the change does not have a big impact on existing use-cases. Moreover, it fixes some known compilation crashes so I consider this an improvement over the current situation.
I don't see a significant risk in merging this change but I will let @kgyrtkirk comment in case I am missing something.
Assuming that we do not allow implicit linking via signatures I don't know what's the point of registering the signature at this stage. My understanding is that the cache is meant to improve performance not a means to perform equivalence checks. If we don't make use of the signature here then I don't think we should compute it. @okumin although you added some comments justifying the registration I would appreciate some additional clarifications regarding these additions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change reverts that logic and takes signatures out of the equation for determining equivalent groups.
haven't read all of it - but if thats true - please don't do that...
looking at the volume of changes - it must be changing something fundamental...
what's the issue - is it an equiv violation? can you put the stacktrace somewhere?
I think that should be fixed without altering this part - it might be a missing link; or a missing signature annotation somewhere - you should not change this at all...
all these things could enable to back-map runtime statistict up to the calcite join planning phase...but I think that feature was never completed/enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to get a failing test for these changes you would need 2 ops which have conflicting signatures stored into the metastore ; loaded back and they might get applied incorrectly...
possibly pretty hard to do...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that we do not allow implicit linking via signatures I don't know what's the point of registering the signature at this stage
That's because some optimization can mutate Operators, changing their true signatures. One example I remember is TableScanPPD. I'm not 100% sure which signatures should be used in that case, but it sounds more reasonable and predictable for me to use ones before optimization than ones in the middle of optimization. I would say Operators should not be mutable for the signing purpose, but it is a too fundamental change.
I will make a reply to @kgyrtkirk in another thread as I guess those comments were written before reading my gist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed mentioning what are affected by such optimizations. Many test cases on the following revision failed with equivalence mapping violation
.
217b26d
That's because it tries to LINK a pre-optimized operator with a post-optimized operator.
- Before optimization
- Operator A has a signature SA and is grouped as GA
- Operator B has a signature SB and is grouped as GB
- AuxSignatureLinker doesn't MERGE GA and GB as A and B have different signatures. Note that non aux signatures are not cached here
- After optimization
- Operator B is optimized, and its signature becomes SA
- StatsRulePsrocFactory tries to LINK GA and GB as A and B share the same signature SA -> equivalence mapping violation
Potentially, GA and GB are actually mergeable though the current implementation doesn't do so.
WITH materialized_cte AS ( | ||
SELECT key, value FROM src WHERE key != '100' | ||
), | ||
another_materialized_cte AS ( | ||
SELECT key, value FROM src WHERE key != '100' | ||
) | ||
SELECT a.key, a.value, b.key, b.value | ||
FROM materialized_cte a | ||
JOIN another_materialized_cte b ON a.key = b.key | ||
ORDER BY a.key; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is in the compilation phase so I don't think we need to actually run the query. Consider dropping the execution. If you opt to keep it then I would suggest crafting a much simpler test (without 1K lines in the output). It would be nice to keep the execution time of our test suite as low as possible. Moreover, it is not easy to see what the src
table contains so verifying that the result is indeed correct is cumbersome.
EXPLAIN CBO | ||
WITH materialized_cte AS ( | ||
SELECT key, value FROM src WHERE key != '100' | ||
), | ||
another_materialized_cte AS ( | ||
SELECT key, value FROM src WHERE key != '100' | ||
) | ||
SELECT a.key, a.value, b.key, b.value | ||
FROM materialized_cte a | ||
JOIN another_materialized_cte b ON a.key = b.key | ||
ORDER BY a.key; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding (or replacing this with) a traditional EXPLAIN
where we can see the effect of hive.optimize.cte.materialize.threshold
and the full plan for the materialized ctes.
registerSignature(o1); | ||
registerSignature(o2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change reverts that logic and takes signatures out of the equation for determining equivalent groups.
haven't read all of it - but if thats true - please don't do that...
looking at the volume of changes - it must be changing something fundamental...
what's the issue - is it an equiv violation? can you put the stacktrace somewhere?
I think that should be fixed without altering this part - it might be a missing link; or a missing signature annotation somewhere - you should not change this at all...
all these things could enable to back-map runtime statistict up to the calcite join planning phase...but I think that feature was never completed/enabled
private void registerSignature(Object o) { | ||
if (o instanceof Operator) { | ||
getSignatureOf((Operator<?>) o); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats the point of this method? the cache will compute it if needed...what the point of pre-registering - if that's needed its not a cache anymore!
// Caches signatures on the first access. A signature of an Operator could change as optimizers could mutate it, | ||
// keeping its semantics. The current implementation caches the signature before optimizations so that we can | ||
// link Operators with their signatures at consistent timing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove this comment and don't pre-heat caches unless it gives real benefits
the purpose of the cache is not something like this ; but to prevent O(N^2)
computation ( and also to reduce storage size by making it more serialize friendly...)
registerSignature(o1); | ||
registerSignature(o2); | ||
final EquivGroup linkedGroup1 = objectMap.get(o1); | ||
final EquivGroup linkedGroup2 = objectMap.get(o2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes will weaken this stuff - so that 2 OPs which have the same signature will be allowed -> that's not good....
its not causing much issues yet - because almost all existing operators have good signatures; but if that degrades...runtime stats will be applied incorrectly
groups.add(targetGroup); | ||
targetGroup.add(o1); | ||
targetGroup.add(o2); | ||
if (linkedGroup1 == null || linkedGroup2 == null || linkedGroup1 == linkedGroup2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's all these changes do better than the old?
} | ||
|
||
if (!mayMerge) { | ||
throw new RuntimeException(String.format( | ||
"Failed to link %s and %s. This error mostly means a bug of Hive", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this error message pretty cloudy on what the issue is...
I think the sentecte "Equivalence mapping violation" should be there - as that's what the problem is...we have 2 groups; not allowed to put them into the same equiv group -> big trouble...
registerSignature(o1); | ||
registerSignature(o2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to get a failing test for these changes you would need 2 ops which have conflicting signatures stored into the metastore ; loaded back and they might get applied incorrectly...
possibly pretty hard to do...
just seen the details you linked to the PR. I believe the |
I guess this mentions the link on creating a FilterOperator?
I guess the SWO stands for SharedWorkOptimizer. With CTE materialization, each CTE is compiled separately. It means Just for a note. We use CTE because we need to support some cases which SharedWorkOptimizer doesn't support. An example is complex CTE + UNION ALL. We may replace CTE materialization with SharedWorkOptimizer when it becomes more mature. |
got it now - I was thinking its the other way around I was thinking about this a bit in the last couple days; here is a summary of my thoughts:
from the listed approaches in the doc:
note: you could reachable on the ASF slack as well - we can talk on |
@kgyrtkirk Could you please take a look and see if it seems feasible? If we prefer #5077 , I will void this PR and try to get it merged. |
Let me close this one because we will likely go with #5077 or another approach. |
What changes were proposed in this pull request?
Make it possible for PlanMapper to link related Operators or RelNodes correctly.
For your reference, I made a Gist page explaining the behaviors of PlanMapper and alternative approaches in my mind.
https://gist.github.com/okumin/b111fe0a911507bdf6a7204f49b9cb72
See also the PR where I implemented another option, "Combine RelNodes with RelTreeSignature".
Why are the changes needed?
No. 14 of TPC-DS or other complex queries with CTE materialized fail due to this bug.
https://issues.apache.org/jira/browse/HIVE-24167
Does this PR introduce any user-facing change?
No. This is a bug fix.
Is the change a dependency upgrade?
No.
How was this patch tested?
I added a new test case, which fails with the current master branch, and restored No. 14 of TPC-DS.