Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-4232: Support for EXCEPT and INTERSECT set operator #2599

Merged
merged 5 commits into from
Feb 12, 2023

Conversation

Leon-WTF
Copy link
Contributor

@Leon-WTF Leon-WTF commented Jul 17, 2022

DRILL-4232: Support for EXCEPT and INTERSECT set operator

Description

Can have hash set operator and sorted set operator, only implement hash version in this PR.
Compute number of left-input duplicates(numLeft, probe side) and number of right-input duplicates(numRight, build side) for each same tuple:
INTERSECT: if numRight > 0 and numLeft > 0, output one tuple
INTERSECT ALL: if numRight > 0 and numLeft > 0, output min(numLeft,numRight) tuples
EXCEPT: if numRight = 0 and numLeft > 0, output one tuple
EXCEPT ALL: if numLeft>=numRight, output numLeft - numRight tuples

Documentation

TODO

Testing

See TestSetOp

@Leon-WTF Leon-WTF marked this pull request as draft July 17, 2022 12:01
@cgivre
Copy link
Contributor

cgivre commented Jul 20, 2022

@Leon-WTF You might want to wait until #2602 is merged before you continue this.

@Leon-WTF
Copy link
Contributor Author

Leon-WTF commented Jul 24, 2022

@Leon-WTF You might want to wait until #2602 is merged before you continue this.

@cgivre Thanks for the info, #2602 is very needed, I will focus on implementing the operator firstly, and adapt to the new Calcite when it's merged.

@Leon-WTF Leon-WTF requested a review from cgivre July 24, 2022 07:55
@Leon-WTF Leon-WTF force-pushed the set_operation branch 2 times, most recently from 3a94b32 to 4966db9 Compare August 27, 2022 09:18
@cgivre
Copy link
Contributor

cgivre commented Aug 28, 2022

@Leon-WTF Is this ready for review?

@Leon-WTF
Copy link
Contributor Author

Leon-WTF commented Aug 28, 2022

@Leon-WTF Is this ready for review?

@cgivre Not yet, I'm handling the EXCEPT case, it needs to remove the duplicate records for probe side(for other three cases, it's like semi-join, I just added the records num to the hash map), I'm trying to add an Agg phase after setop phase. Any suggestion on this?

@cgivre
Copy link
Contributor

cgivre commented Sep 22, 2022

Hi @Leon-WTF how is this coming?

@cgivre cgivre added enhancement PRs that add a new functionality to Drill doc-impacting PRs that affect the documentation labels Oct 2, 2022
@cgivre
Copy link
Contributor

cgivre commented Oct 2, 2022

@Leon-WTF I don't know if you saw this or not, but #2602 has been merged. We are actually running on Calcite 1.32 now!

@Leon-WTF
Copy link
Contributor Author

@Leon-WTF I don't know if you saw this or not, but #2602 has been merged. We are actually running on Calcite 1.32 now!

@cgivre Sorry for the late response. Yes, I have merged that in. I'm almost done and I'm adding ut.

@cgivre
Copy link
Contributor

cgivre commented Oct 16, 2022

HI @Leon-WTF
I saw that you added unit tests. Is this PR ready for review? If not, what is remaining?

@Leon-WTF
Copy link
Contributor Author

HI @Leon-WTF I saw that you added unit tests. Is this PR ready for review? If not, what is remaining?

@cgivre I think it's just still need more ut to test more completely and fix bugs while adding ut. By the way, do we have any plan to release drill 2.0? I will try to catch that plan.

@cgivre
Copy link
Contributor

cgivre commented Oct 24, 2022

HI @Leon-WTF I saw that you added unit tests. Is this PR ready for review? If not, what is remaining?

@cgivre I think it's just still need more ut to test more completely and fix bugs while adding ut. By the way, do we have any plan to release drill 2.0? I will try to catch that plan.

We are getting ready to release Drill 1.20.3. Once that's done, I'd like to start discussions around Drill 2.0. There have been a lot of major work in Drill and I'd like to see that getting used.

Copy link
Member

@vvysotskyi vvysotskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Leon-WTF, thanks for adding this functionality! I have added several code review comments to address before we can merge it.

Leon-WTF pushed a commit to Leon-WTF/drill that referenced this pull request Jan 15, 2023
@Leon-WTF Leon-WTF changed the title DRILL-4232 Support for EXCEPT and INTERSECT set operator DRILL-4232: Support for EXCEPT and INTERSECT set operator Jan 15, 2023
@Leon-WTF Leon-WTF force-pushed the set_operation branch 4 times, most recently from 49a9307 to 7317fad Compare January 29, 2023 04:21
@Leon-WTF
Copy link
Contributor Author

Leon-WTF commented Feb 3, 2023

@vvysotskyi Hi, any more comments on this PR?

Copy link
Member

@vvysotskyi vvysotskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refactoring and addressing code review comments. I have found several places to improve, and it will be ready to go.

ImmutableBitSet.range(0, drillExceptRel.getInput(0).getRowType().getFieldList().size()), ImmutableList.of(), ImmutableList.of());
call.transformTo(drillExceptRel.copy(ImmutableList.of(aggNode, drillExceptRel.getInput(1)), true));
} else {
call.transformTo(new DrillAggregateRel(drillExceptRel.getCluster(), drillExceptRel.getTraitSet(), drillExceptRel.copy(true),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add aggregate on top of except?

Copy link
Contributor Author

@Leon-WTF Leon-WTF Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvysotskyi It's for performance, if the data cardinality is high, aggregate before except may not reduce many data, if the data after except left are few, aggregate after except will only handle few data which is faster than before except. This may be choosen by statistics info + CBO automaticlly in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But output values should be already distinct after the execution of except operator, so the aggregate will do nothing.

Copy link
Contributor Author

@Leon-WTF Leon-WTF Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not distinct for left table after except operator. I choosed to reuse an aggregate operator to do the distinct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, ok, if it is specific to our implementation of except operator, aggregation added here possibly could be removed by other Calcite rules which assume that results would be already distinct.

I think it would be better to add an aggregation when converting it to physical rel nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yes, I will refactor that.

Copy link
Contributor Author

@Leon-WTF Leon-WTF Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvysotskyi One more question about moving converting rule to physical phase, I need to add physical agg rel node, so needs to add both hash(distribute by all keys/single key) and stream agg, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drill doesn't use streaming aggregate for distinct calls, so only hash agg should be enough.

Copy link
Contributor Author

@Leon-WTF Leon-WTF Feb 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drill doesn't use streaming aggregate for distinct calls, so only hash agg should be enough.

@vvysotskyi I see it checks aggregate.containsDistinctCall() in StreamAggPrule, but It will generate steam agg for sql like "select a,b,c from foo group by a,b,c".

@cgivre
Copy link
Contributor

cgivre commented Feb 9, 2023

Hey @Leon-WTF Any chance you could address @vvysotskyi 's comments soon. This is one of the last PRs slated to be merged for the next release.

@Leon-WTF
Copy link
Contributor Author

Hey @Leon-WTF Any chance you could address @vvysotskyi 's comments soon. This is one of the last PRs slated to be merged for the next release.

I will do it by this weekend, is that ok?

@cgivre
Copy link
Contributor

cgivre commented Feb 10, 2023

@Leon-WTF This weekend would be great! Very excited to get this merged.

Copy link
Member

@vvysotskyi vvysotskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, +1

@cgivre
Copy link
Contributor

cgivre commented Feb 12, 2023

@Leon-WTF Thanks for this. @vvysotskyi Thanks for the review. Merging now!

@cgivre cgivre merged commit c38091f into apache:master Feb 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-impacting PRs that affect the documentation enhancement PRs that add a new functionality to Drill
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants