Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CALCITE-3367] Add AntiJoinRule to convert project-filter-join-aggreg… #1469

Closed
wants to merge 2 commits into from

Conversation

jinxing64
Copy link
Contributor

@jinxing64 jinxing64 commented Sep 22, 2019

…ate into anti-join

This PR proposes to add AntiJoinRule to convert project-filter-join-aggregate into anti-join.

The idea is from SemiJoinRule.

This issue was found when I resolve CALCITE-3363 in #1466
I failed to construct an anti-join operator from SQL string.

@hsyuan
Copy link
Member

hsyuan commented Sep 24, 2019

I don't like the idea of SemiJoinRule, which is counter intuitive. A boolean context IN or existential subquery should be decorrelated directly into SEMI/ANTI SEMI join, rather than transforming to Join/Outer Join, then to SEMI/ANTI join.

@danny0405
Copy link
Contributor

I don't like the idea of SemiJoinRule, which is counter intuitive. A boolean context IN or existential subquery should be decorrelated directly into SEMI/ANTI SEMI join, rather than transforming to Join/Outer Join, then to SEMI/ANTI join.

I'm +1 on this idea, we may need some refactoring for the RelDecorrelator.

@julianhyde
Copy link
Contributor

We should definitely keep SemiJoinRule around. People may still write queries that do semi-joins without using IN or EXISTS.

I agree that we could have a direct path from IN/EXISTS to semi- or anti-semi Join. As long as it doesn't add too much code. (If it adds code, that code will probably be wrong at first, because IN/EXISTS are very complicated, and will definitely need maintaining.)

@hsyuan
Copy link
Member

hsyuan commented Sep 25, 2019

I agree that people may still write queries with direct SEMI/ANIT join, but what the SemiJoinRule does is transforming a join on top of aggregate into a semijoin. In many cases, the pattern is generated by decorrelating an IN/EXIST subquery (in boolean context).

@hsyuan
Copy link
Member

hsyuan commented Sep 25, 2019

Julian, the rule you are talking about is a rule that transforms a SEMI/ANTI join into a join on top of aggregate, which is indeed needed if we want to do join reordering (in case inner rel is larger). However I don't think we have that rule.

@julianhyde
Copy link
Contributor

I am talking about the current rule, which transforms Join(Scan(emp), Aggregate(Scan(dept))) into Join(SEMI, Scan(emp), Scan(dept)).

Ideally it would also match Join(Scan(emp), Scan(dept)) (i.e. without the Aggregate) provided that there was a uniqueness constraint.

The other direction is also valid, but as you say, we don't have that rule.

@hsyuan
Copy link
Member

hsyuan commented Sep 25, 2019

Ideally it would also match Join(Scan(emp), Scan(dept)) (i.e. without the Aggregate) provided that there was a uniqueness constraint.

If there is a uniqueness constraint, what is the value of transforming into a semijoin? I don't think there is any benefit.

@julianhyde
Copy link
Contributor

Semi-join may have a lower cost implementation than a regular Join. It’s worth knowing that you have one.

Also, because it projects fewer columns, it may match materialized views that an equivalent ordinary join would not.

Lastly, if your plan is using a semi-join operator you can safely weaken its input constraint from must-be-unique to should-be-unique.

@hsyuan
Copy link
Member

hsyuan commented Sep 25, 2019

Also, because it projects fewer columns, it may match materialized views that an equivalent ordinary join would not.

It is indeed true, inner/outer join in Calcite must output all the input columns instead of some.

@jinxing64
Copy link
Contributor Author

jinxing64 commented Sep 25, 2019

@julianhyde @hsyuan @danny0405
Thanks a lot for discussion, which helps a lot for my understanding.

A boolean context IN or existential subquery should be decorrelated directly into SEMI/ANTI SEMI join.

I think this idea makes a lot of sense. It is more straightforward to transform IN/EXIST to SEMI/ANTI Join when decorrelation.

But SemiJoinRule also provides a way to optimize Join(Scan(emp), Aggregate(Scan(dept))) to a simple semi-join, in which there's no operation of aggregation and only needs to output columns of the left hand side. Think about the scenario that dept is almost unique but not exactly unique. Removing the Aggregate will gain some benefit.

@jinxing64
Copy link
Contributor Author

And my question is that if SemiJoinRule deserves to be kept, how about the idea in this PR ?
If AntiJoinRule doesn't make much sense, I'm ok to hold or close it.
If this one is interested, I can keep refining. And you can comment when you have time.
Thanks a lot !

@hsyuan
Copy link
Member

hsyuan commented Sep 25, 2019

The AntiJoinRule does make sense and it does provide optimization, however besides generated from NOT IN/NOT EXISTS subquery, I don't know how many people or BI tools actually write that kind of query with the pattern.

@jinxing64
Copy link
Contributor Author

Thanks @hsyuan
I can understand and agree with your concern. The matching pattern in AntiJoinRule strict and limited.
I will close this PR.

A boolean context IN or existential subquery should be decorrelated directly into SEMI/ANTI SEMI join

I filed a JIRA and propose to provide a direct path from IN/EXISTS to semi-join or anti-join during decorrelation. Please check when you have time

@jinxing64 jinxing64 closed this Sep 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants