Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-24167: TPC-DS query 14 fails while generating plan for the filter #5077

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

okumin
Copy link
Contributor

@okumin okumin commented Feb 9, 2024

What changes were proposed in this pull request?

Link Calcite's RelNodes with the same RelTreeSignature.
https://issues.apache.org/jira/browse/HIVE-24167

Why are the changes needed?

It is known that the current unification flow could fail when CTEs are materialized. I summarized the problem in this gist.

This PR implements the option "Combine RelNodes with RelTreeSignature" in the gist.

See also the PR where I implemented another option, "Don't pull groups of Operators by signatures".

Does this PR introduce any user-facing change?

No.

Is the change a dependency upgrade?

No.

How was this patch tested?

I added a new test case to reproduce the issue and restore query14 of TPC-DS which fails on the current master.

Copy link

sonarcloud bot commented Feb 11, 2024

Quality Gate Passed Quality Gate passed

Issues
5 New issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

return super.explainTerms(pw)
.item("tableScanTrait", this.tableScanTrait)
.itemIf("fromVersion", ((RelOptHiveTable) table).getHiveTableMD().getVersionIntervalFrom(),
isNotBlank(((RelOptHiveTable) table).getHiveTableMD().getVersionIntervalFrom()));
Copy link
Contributor Author

@okumin okumin Feb 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some attributes that could differentiate two HiveTableScans. We may not need these ones here.

There is one potential problem in my mind. HiveTableScan doesn't retain the equivalents of TableScanDesc#getPredicateString or TableScanDesc#getRowLimit. So, we can't unify ASTNodes or RelNodes of HiveTableScan based on signatures. Otherwise, Operators will be over-unified later.
Currently, we link only HiveFilter RelNodes using RelTreeSignatures(we link HiveTableScan with its ASTNode, but don't link it with the signature). So, the existence of TableScanDesc#getPredicateString doesn't matter to us.
Also, CBO is disabled when TABLESAMPLE is used. So, we can say TableScanDesc#getRowLimit doesn't cause an issue in the world of RelNodes. I expect we could push down the row count to HiveTableScan when we support TABLESAMPLE in CBO.
So, in my current understanding, this PR doesn't cause an immediate problem.

@okumin okumin changed the title HIVE-24167: TPC-DS query 14 fails while generating plan for the filter Don't review Feb 11, 2024
planMapper.link(cond, RelTreeSignature.of(where));
planMapper.link(cond, where);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We link RelTreeSignature first so that we can safely unify multiple filters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool! :D

@okumin okumin changed the title Don't review HIVE-24167: TPC-DS query 14 fails while generating plan for the filter Feb 11, 2024
@zhangbutao zhangbutao marked this pull request as ready for review February 15, 2024 00:09
@zhangbutao zhangbutao marked this pull request as draft February 15, 2024 00:13
@okumin
Copy link
Contributor Author

okumin commented Apr 8, 2024

I applied two changes. I'm waiting for CI to finish.

  1. Rebased this branch based on the latest master
  2. Added b11e3f8

Copy link
Contributor

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @okumin,

I was doing some testing with the proposed changes by setting the following properties in a wider scale.

set hive.optimize.cte.materialize.threshold=1;
set hive.optimize.cte.materialize.full.aggregate.only=false;

It seems that some queries still fail with java.lang.RuntimeException: equivalence mapping violation.

For instance running cbo_query23.q with the above properties set raises an exception.

mvn test -Dtest=TestTezTPCDS30TBPerfCliDriver -Dqfile="cbo_query23.q" -Dtest.output.overwrite

Can you please look at the failures to see if this patch needs to be adapted.

Also it would be nice to do a sanity run by setting the above properties temporarily for all qtests to see if the problem disappears completely or if there are still cases that need to be handled.

Assuming that there are still failures to be address please create additional qtests.

Copy link

sonarcloud bot commented Apr 19, 2024

Quality Gate Passed Quality Gate passed

Issues
5 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

@okumin
Copy link
Contributor Author

okumin commented Apr 22, 2024

@zabetak

For instance running cbo_query23.q with the above properties set raises an exception.

Thanks. Predicate Push-Down and Partition Condition Remover cause the problem. I modified it on my local machine, and I will explain the details later.

Also it would be nice to do a sanity run by setting the above properties temporarily for all qtests to see if the problem disappears completely or if there are still cases that need to be handled.

Thanks for the great suggestion! I definitely will do it. Now, I have run positive LLAP tests and TPC-DS, and I found that No. 64 of TPD-DS fails. I created another ticket because I found the problem can happen regardless of materialization or PlamMapper while I was minimizing the test query. I would be glad if the community could take a look.
#5207

I'm still testing the remaining qtests where CTEs are involved.

Copy link
Contributor

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had another look in this proposal relying on RelTreeSignature but it seems that it is not general enough to get rid of the problem for good. Changing the hive.query.planmapper.link.relnodes to false makes the new tests fail.

I was trying to come up with alternatives but don't have something viable yet.

--! qt:dataset:src

set hive.optimize.cte.materialize.threshold=1;
set hive.optimize.cte.materialize.full.aggregate.only=false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set hive.query.planmapper.link.relnodes=false;

By setting this property the test fails with the same RuntimeException: equivalence mapping violation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm checking the whole execution and what the flag does. Disabling hive.query.planmapper.link.relnodes, no aux signature is created, and no merge across Operators doesn't happen. I feel it is not consistent with the description, Whether to link Calcite nodes to runtime statistics.
I guess the direct problem will be resolved if we skip linking RelNodes with Operators when hive.query.planmapper.link.relnodes=false is configured. That sounds more consistent with the description.

I'm putting my additional notes here. I tried to put some to-be solutions but it could not be very easy.
https://docs.google.com/document/d/1LCST23cSBZglBzjhnCqHlpcLv6xrXRBxU_wszSDAk9w/edit?usp=sharing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess hive.query.planmapper.link.relnodes can be implemented like this. @kgyrtkirk might have some more contexts.
okumin@cdab2c1

Copy link
Contributor Author

@okumin okumin May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this fails if we disable hive.query.planmapper.link.relnodes.

create table src (key string, value string);

set hive.cbo.fallback.strategy=NEVER;
set hive.query.planmapper.link.relnodes=false;
set hive.optimize.ppd=false;

EXPLAIN
SELECT * FROM src WHERE key = '5'
UNION ALL
SELECT * FROM src WHERE key = '5';

@zabetak As a potential workaround, I'm wondering if it makes sense to relax the validation and disable features with which PlanMapper is involved when "equivalence mapping violation" happens.
I presume PlanMapper succeeds in 99.9% cases as most qtests succeed. Currently, the remaining 0.1% fails and it sounds like an excessive penalty. We can keep the validation for qtests so that we can minimize the risk of degradation.
In my feeling, it could be possible to decrease the 0.1% to 0.05%, but could be tough to achieve 0% with a single patch.
okumin@dd5be15

@okumin
Copy link
Contributor Author

okumin commented Apr 24, 2024

I also think the current one could fail somewhere especially if we follow minor options. Now, I have achieved some additional experience and knowledge, and may come up with the to-be approach. I will work on it tomorrow(I am unavailable today because we host a tech event including a Hive 4 session in my city!) if I don't see very viable options here.

P.S. checking the intention and purposes of hive.query.planmapper.link.relnodes

@okumin
Copy link
Contributor Author

okumin commented May 3, 2024

While running negative tests with hive.optimize.cte.materialize.threshold=1 and hive.optimize.cte.materialize.full.aggregate.only=false, I found another error. I created another ticket because it is also unrelated to the PlanMapper.
#5232

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants