Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-23939: SharedWorkOptimizer: take the union of columns in mergeable TableScans #1324

Merged
merged 3 commits into from Aug 2, 2020

Conversation

kasakrisz
Copy link
Contributor

@kasakrisz kasakrisz commented Jul 27, 2020

Testing done:

mvn test -Dtest.output.overwrite -DskipSparkTests -Dtest=TestTezPerfCliDriver -Dqfile=query9.q -pl itests/qtest -Pitests
mvn test -Dtest.output.overwrite -DskipSparkTests -Dtest=TestTezPerfConstraintsCliDriver -Dqfile=query9.q -pl itests/qtest -Pitests

@HunterL
Copy link
Contributor

HunterL commented Jul 29, 2020

So looking through the Q tests (specifically auto_join_reordering_values.q.out) we have
explain select s.s_store_sk from store_n0 s join store_sales_n0 ss on (s.s_store_sk = ss.ss_store_sk) join store_n0 s1 on (s1.s_store_sk = ss.ss_store_sk) where s.s_floor_space > 1000

In the explain plan this produces
filterExpr: (((s_floor_space > 1000) and s_store_sk is not null) or s_store_sk is not null)
Logically, this filter expression can be (will be?) reduced down to s_store_sk is not null
e.g. ([P && Q] || Q) => Q
This isn't correct since its going to ignore the where s.s_floor_space > 1000

It seems to me that the correct behavior would be to either...

  1. Continue to combine them naively but with an and instead of an or and allow something downstream to optimize it out.
    e.g.
    ([P && Q] && Q) which will get reduced down to (P && Q)
    Thus the filter expression produced would be (((s_floor_space > 1000) and s_store_sk is not null) and s_store_sk is not null)

  2. Optimize the statement here so that when they are combined you just get (P && Q)
    Thus the filter expression would be ((s_floor_space > 1000) and s_store_sk is not null)

There are similar issues in other Q tests but I chose this one for simplicity, if this isn't clear I'd be happy to clarify more.

@kasakrisz
Copy link
Contributor Author

@HunterL
Thanks for reviewing this patch.
The expression

filterExpr: (((s_floor_space > 1000) and s_store_sk is not null) or s_store_sk is not null)

is a result of merging two TableScanOperators. Both of them are scanning the same table: alias: s but they had different filterExpr.
TS1: ((s_floor_space > 1000) and s_store_sk is not null)
TS2: s_store_sk is not null

SharedWorkOptimizer naively combined the filter expressions using or because we need the union of the records produced by both TS. You are right in this particular case the filter expression could be reduced to s_store_sk is not null

The new TS has two children

TableScan
  alias: s
  filterExpr: (((s_floor_space > 1000) and s_store_sk is not null) or s_store_sk is not null) (type: boolean)
  Filter Operator
    predicate: ((s_floor_space > 1000) and s_store_sk is not null) (type: boolean)
    ...
  Filter Operator
    predicate: s_store_sk is not null (type: boolean)
    ...

both of them are Filter operators which are the root of subtrees to broadcast the proper subset of records to each reducer edge (Reducer 2 and Reducer 3)

If and were used for combining the filter expressions of TS operators the branch which does not have the filter s_floor_space > 1000 would loose a subset of records.

@jcamachor jcamachor merged commit 933c023 into apache:master Aug 2, 2020
@kasakrisz kasakrisz deleted the HIVE-23939-master-swo-union-cols branch May 4, 2021 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants