New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-13268 [C++][Compute] Add ExecNode for semi and anti-semi join #10845
Conversation
|
Yes, thanks @michalursa. I missed this!
I see... The if (input_counter_.Cancel()) {
finished_.MarkFinished();
} else if (output_counter_.Cancel()) {
finished_.MarkFinished();
} and I was wondering why both the cases had the same code path. I thought it can be combined in a single statement. So, do you mean to say that
This was my thought process.
Sure!
Sure!
I will think about this one! :-)
I'll make this void!
Sure!
Yes, it is a copy from the GroupBy impl.
I see... but it looks like Pandas holds null/NaN/na as a valid key and if the users want to, they have to explicitly drop na values. |
0065e8b
to
c2cf1d4
Compare
034f5a7
to
674eb70
Compare
This reverts commit 0a3bcbf.
# Conflicts: # cpp/src/arrow/CMakeLists.txt # cpp/src/arrow/compute/exec/exec_plan.cc # cpp/src/arrow/compute/exec/exec_plan.h # cpp/src/arrow/compute/exec/plan_test.cc
@lidavidm I added a simple verification to the tests and added the changes discussed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Just one question about the input parameter validation.
@@ -39,7 +39,7 @@ Status ValidateJoinInputs(const std::shared_ptr<Schema>& left_schema, | |||
const std::shared_ptr<Schema>& right_schema, | |||
const std::vector<int>& left_keys, | |||
const std::vector<int>& right_keys) { | |||
if (left_keys.size() != right_keys.size()) { | |||
if (left_keys.size() != right_keys.size() && left_keys.size() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it valid to join with no keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK I don't think it's valid. We'd need some indexer if no key columns are specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I replied by email and it seems to have gotten messed up. Then maybe this should be
if (left_keys.size() == 0) { return Status::Invalid(...); }
if (left_keys.size() != right_keys.size())) { ...}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this. I added the changes now
Ah ok, I wasn't sure how to read the check here. Maybe instead of > 0, there should be a separate check for keys.size() == 0 that returns an error?
…On Sat, Aug 28, 2021, at 10:34, niranda perera wrote:
***@***.**** commented on this pull request.
In cpp/src/arrow/compute/exec/hash_join_node.cc <#10845 (comment)>:
> @@ -39,7 +39,7 @@ Status ValidateJoinInputs(const std::shared_ptr<Schema>& left_schema,
const std::shared_ptr<Schema>& right_schema,
const std::vector<int>& left_keys,
const std::vector<int>& right_keys) {
- if (left_keys.size() != right_keys.size()) {
+ if (left_keys.size() != right_keys.size() && left_keys.size() > 0) {
AFAIK I don't think it's valid. We'd need some indexer if no key columns are specified
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#10845 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACQB37IK6TUN3H7WDD7XTTT7DXY5ANCNFSM5BJEPLZQ>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Thanks for this @nirandaperera. @westonpace and @michalursa any other comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits
cpp/src/arrow/compute/exec/util.h
Outdated
@@ -19,6 +19,8 @@ | |||
|
|||
#include <atomic> | |||
#include <cstdint> | |||
#include <thread> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I hate to add this in late but I didn't notice it earlier. So far we have managed to keep <thread>
out of the public API surface. Is there anyway you can push this into the util.cc
? This utility (
arrow/cpp/src/arrow/util/io_util.h
Line 346 in 4591d76
uint64_t GetThreadId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. that can be done :-)
// TODO(niranda) re-enable this! | ||
// if (GrouperFastImpl::CanUse(descrs)) { | ||
// return GrouperFastImpl::Make(descrs, ctx); | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make a JIRA for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh!!! I completely forgot about this TBH! :-( This needs to be added before merging this PR. I was waiting for #10858 PR to be merged to add this change!
g.ExpectFind("[[3], [3]]", "[0, 0]"); | ||
|
||
g.ExpectFind("[[3], [3]]", "[0, 0]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to run an identical test twice? Not sure if this is a copy/paste or you are testing for idempotence/deterministic behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that was the intent. I was following the tests for Consume
method.
arrow/cpp/src/arrow/compute/kernels/hash_aggregate_test.cc
Lines 438 to 440 in c9c97bd
g.ExpectConsume("[[3], [3]]", "[0, 0]"); | |
g.ExpectConsume("[[3], [3]]", "[0, 0]"); |
@@ -0,0 +1,18 @@ | |||
// Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, and I will add a JIRA for this
I see that the #10858 PR is merged now. I will add the changes and rebase this ASAP. |
I added the
@michalursa I see that the new PR #11150 contains these test cases. I am wondering if this is something you encountered previously. |
Are you able to merge the other PR on top of this one and see if that makes any difference? If so we could merge the two one immediately after the other. |
Some of this ended up being pulled into ARROW-13642/#11047 which is now merged, so closing this PR. Thanks @nirandaperera! |
Adding prelim impl of semi joins