ARROW-13268 [C++][Compute] Add ExecNode for semi and anti-semi join #10845

nirandaperera · 2021-07-30T20:06:34Z

Adding prelim impl of semi joins

github-actions · 2021-07-30T20:06:52Z

https://issues.apache.org/jira/browse/ARROW-13268

michalursa · 2021-07-30T21:36:48Z

The comment about UINT32_MAX probably needs to be updated
It seems to me that ConsumeCachedProbeBatches is only called for a single thread index - the one for the thread that reaches completion of build_counter_.
Because in StopProducing calls to Cancel() on two AtomicCounters are connected with ||, finished_.MarkFinished() can be called twice (first thread gets true from first counter Cancel() call and some other thread from the second Cancel() call). Also, shouldn't we always Cancel() both counters?
I wonder what happens with an empty input on build side?
I think we will only have one class for hash join. It should be fine to call it HashJoinNode and throw Status::NotImplemented() for join types outside of semi, anti-semi. JoinType enum could have elements from all other types as well. Also JoinType enum would be better as "enum class" although Arrow C++ probably has some policy about enums.
I would rename build_side_complete_ to hash_table_built_ or hash_table_build_complete_. Currently I get it confused with build_counter_ checks, where one means all build side input batches consumed by local state hash tables, and the other means hash table merge is complete.
Also it would be nice to tie these two conditions above to futures, so that a merge task and tasks to process cache probe side batches could be generated and scheduled to execute once these futures are complete. But the futures are not critical at this point, just something nice to have.
Status returned from CacheProbeBatch is always OK()
We probably don't support DictionaryArray in key columns in the code as it is right now, we should check and return Status::NotImplemented() when making hash join node (or make sure it works). Also there could be a scenario where one side of the join uses DictionaryArray while the other uses Array with the same underlying type for keys to compare.
In BuildSideMerge() ARROW_DCHECK(state->grouper). Perhaps it is a copy-paste from group by node, but it would be good to have a comment about why it is not possible to have states 0 and 2 initialized but not 1. This is not obvious. And maybe it should just be relaxed to skip processing if the local thread state with a given index is not initialized.
TotalReached() method added to AtomicCounter is not used anywhere.
There is a problem with null key. I believe in hash join with equality condition it should be that "null != null" (and there is usually a separate comparison operator that treats nulls as equal), while in group by "null==null" when matching groups. We should have a comment about it and document it for the users (maybe we don't have documentation strings for exec nodes yet). If needed we would have to filter out null keys separately from Grouper.

nirandaperera · 2021-08-02T14:21:17Z

2. It seems to me that ConsumeCachedProbeBatches is only called for a single thread index - the one for the thread that reaches completion of build_counter_.

Yes, thanks @michalursa. I missed this!

3. Because in StopProducing calls to Cancel() on two AtomicCounters are connected with ||, finished_.MarkFinished() can be called twice (first thread gets true from first counter Cancel() call and some other thread from the second Cancel() call). Also, shouldn't we always Cancel() both counters?

I see... The GroupByNode had this,

    if (input_counter_.Cancel()) {
      finished_.MarkFinished();
    } else if (output_counter_.Cancel()) {
      finished_.MarkFinished();
    }

and I was wondering why both the cases had the same code path. I thought it can be combined in a single statement.

So, do you mean to say that finished_.MarkFinished() should be called if build_counter_.Cancel() && out_counter_.Cancel()?

4. I wonder what happens with an empty input on build side?

This was my thought process. build_counter_ has -1 for total initially. So, until the build_input signals the InputFinished with 0, probe batches will be cached. And when it receives 0, it toggles build_side_complete_ and probe batches will be queried against an empty hashmap.
We could actually return a NullArray in the Grouper::Find method, prematurely (if the hashmap is empty). WDYT?

5. I think we will only have one class for hash join. It should be fine to call it HashJoinNode and throw Status::NotImplemented() for join types outside of semi, anti-semi. JoinType enum could have elements from all other types as well. Also JoinType enum would be better as "enum class" although Arrow C++ probably has some policy about enums.

Sure!

6. I would rename build_side_complete_ to hash_table_built_ or hash_table_build_complete_. Currently I get it confused with build_counter_ checks, where one means all build side input batches consumed by local state hash tables, and the other means hash table merge is complete.

Sure!

7. Also it would be nice to tie these two conditions above to futures, so that a merge task and tasks to process cache probe side batches could be generated and scheduled to execute once these futures are complete. But the futures are not critical at this point, just something nice to have.

I will think about this one! :-)

8. Status returned from CacheProbeBatch is always OK()

I'll make this void!

9. We probably don't support DictionaryArray in key columns in the code as it is right now, we should check and return Status::NotImplemented() when making hash join node (or make sure it works). Also there could be a scenario where one side of the join uses DictionaryArray while the other uses Array with the same underlying type for keys to compare.

Sure!

10. In BuildSideMerge() ARROW_DCHECK(state->grouper). Perhaps it is a copy-paste from group by node, but it would be good to have a comment about why it is not possible to have states 0 and 2 initialized but not 1. This is not obvious. And maybe it should just be relaxed to skip processing if the local thread state with a given index is not initialized.

Yes, it is a copy from the GroupBy impl.
Ah! Good catch! that is something I didnt think about! Are we talking about a case like this?
Ex: 4 threads, but only 1 input batch. So, before/while other batches being initialized, thread0 receives the batch and calls BuildSideMerge(). Now, other states could have null, and ideally we could continue the loop if that is the case (because it is guaranteed that those states wouldn't receive any more batches, because build_counter_ is already completed.)

11. TotalReached() method added to AtomicCounter is not used anywhere.

12. There is a problem with null key. I believe in hash join with equality condition it should be that "null != null" (and there is usually a separate comparison operator that treats nulls as equal), while in group by "null==null" when matching groups. We should have a comment about it and document it for the users (maybe we don't have documentation strings for exec nodes yet). If needed we would have to filter out null keys separately from Grouper.

I see... but it looks like Pandas holds null/NaN/na as a valid key and if the users want to, they have to explicitly drop na values.
https://stackoverflow.com/questions/23940181/pandas-merging-with-missing-values
I started a thread on this in Zulip https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Null.20values.20as.20keys

This reverts commit 0a3bcbf.

# Conflicts: # cpp/src/arrow/CMakeLists.txt # cpp/src/arrow/compute/exec/exec_plan.cc # cpp/src/arrow/compute/exec/exec_plan.h # cpp/src/arrow/compute/exec/plan_test.cc

nirandaperera · 2021-08-28T05:11:18Z

@lidavidm I added a simple verification to the tests and added the changes discussed.

lidavidm

Thanks for this! Just one question about the input parameter validation.

lidavidm · 2021-08-28T13:16:14Z

cpp/src/arrow/compute/exec/hash_join_node.cc

@@ -39,7 +39,7 @@ Status ValidateJoinInputs(const std::shared_ptr<Schema>& left_schema,
                          const std::shared_ptr<Schema>& right_schema,
                          const std::vector<int>& left_keys,
                          const std::vector<int>& right_keys) {
-  if (left_keys.size() != right_keys.size()) {
+  if (left_keys.size() != right_keys.size() && left_keys.size() > 0) {


Is it valid to join with no keys?

AFAIK I don't think it's valid. We'd need some indexer if no key columns are specified

Sorry I replied by email and it seems to have gotten messed up. Then maybe this should be

if (left_keys.size() == 0) { return Status::Invalid(...); } if (left_keys.size() != right_keys.size())) { ...}

?

Sorry I missed this. I added the changes now

cpp/src/arrow/compute/exec/hash_join_node.cc

lidavidm · 2021-08-28T14:38:29Z

Ah ok, I wasn't sure how to read the check here. Maybe instead of > 0, there should be a separate check for keys.size() == 0 that returns an error?

…

On Sat, Aug 28, 2021, at 10:34, niranda perera wrote: ***@***.**** commented on this pull request. In cpp/src/arrow/compute/exec/hash_join_node.cc <#10845 (comment)>: > @@ -39,7 +39,7 @@ Status ValidateJoinInputs(const std::shared_ptr<Schema>& left_schema, const std::shared_ptr<Schema>& right_schema, const std::vector<int>& left_keys, const std::vector<int>& right_keys) { - if (left_keys.size() != right_keys.size()) { + if (left_keys.size() != right_keys.size() && left_keys.size() > 0) { AFAIK I don't think it's valid. We'd need some indexer if no key columns are specified — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10845 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACQB37IK6TUN3H7WDD7XTTT7DXY5ANCNFSM5BJEPLZQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

lidavidm · 2021-08-30T19:17:53Z

Thanks for this @nirandaperera. @westonpace and @michalursa any other comments?

westonpace

A few nits

westonpace · 2021-09-01T04:17:45Z

cpp/src/arrow/compute/exec/util.h

@@ -19,6 +19,8 @@

 #include <atomic>
 #include <cstdint>
+#include <thread>


Hmm, I hate to add this in late but I didn't notice it earlier. So far we have managed to keep <thread> out of the public API surface. Is there anyway you can push this into the util.cc? This utility (

arrow/cpp/src/arrow/util/io_util.h

Line 346 in 4591d76

uint64_t GetThreadId();

) can probably prevent you from having to resort to pimpl.

sure. that can be done :-)

westonpace · 2021-09-01T04:18:43Z

cpp/src/arrow/compute/kernels/hash_aggregate.cc

+  // TODO(niranda) re-enable this!
+  //  if (GrouperFastImpl::CanUse(descrs)) {
+  //    return GrouperFastImpl::Make(descrs, ctx);
+  //  }


Can you make a JIRA for this?

Oh!!! I completely forgot about this TBH! :-( This needs to be added before merging this PR. I was waiting for #10858 PR to be merged to add this change!

westonpace · 2021-09-01T04:19:48Z

cpp/src/arrow/compute/kernels/hash_aggregate_test.cc

+    g.ExpectFind("[[3], [3]]", "[0, 0]");
+
+    g.ExpectFind("[[3], [3]]", "[0, 0]");


Did you mean to run an identical test twice? Not sure if this is a copy/paste or you are testing for idempotence/deterministic behavior.

I believe that was the intent. I was following the tests for Consume method.

arrow/cpp/src/arrow/compute/kernels/hash_aggregate_test.cc

Lines 438 to 440 in c9c97bd

g.ExpectConsume("[[3], [3]]", "[0, 0]");

g.ExpectConsume("[[3], [3]]", "[0, 0]");

westonpace · 2021-09-01T04:21:50Z

cpp/src/arrow/compute/exec/hash_join_node_benchmark.cc

@@ -0,0 +1,18 @@
+// Licensed to the Apache Software Foundation (ASF) under one


Remove this file?

yes, and I will add a JIRA for this

…by them

nirandaperera · 2021-09-27T19:32:40Z

I see that the #10858 PR is merged now. I will add the changes and rebase this ASAP.

…by them

nirandaperera · 2021-09-30T05:38:38Z

I added the GrouperFastImpl::Find. I tried reusing the GrouperFastImpl::ConsumeImplmethod, but it looks like parallel test cases are failing. Following is a local stacktrace I get

__GI_raise 0x00007fa593a0818b
__GI_abort 0x00007fa5939e7859
arrow::util::CerrLog::~CerrLog logging.cc:72
arrow::util::CerrLog::~CerrLog logging.cc:74
arrow::util::ArrowLog::~ArrowLog logging.cc:250
arrow::util::TempVectorStack::release util.h:101
arrow::util::TempVectorHolder<unsigned char>::~TempVectorHolder util.h:119
arrow::compute::Hashing::HashMultiColumn key_hash.cc:274
arrow::compute::internal::(anonymous namespace)::GrouperFastImpl::ConsumeImpl<true> hash_aggregate.cc:746
arrow::compute::internal::(anonymous namespace)::GrouperFastImpl::Find hash_aggregate.cc:804
arrow::compute::HashSemiJoinNode<false>::ConsumeProbeBatch hash_join_node.cc:373
arrow::compute::HashSemiJoinNode<false>::ConsumeCachedProbeBatches()::{lambda()#1}::operator()() hash_join_node.cc:314
arrow::internal::FnOnce<void ()>::FnImpl<arrow::compute::HashSemiJoinNode<false>::ConsumeCachedProbeBatches()::{lambda()#1}>::invoke() functional.h:152
arrow::internal::FnOnce<void ()>::operator()() && functional.h:140
arrow::internal::WorkerLoop thread_pool.cc:176
arrow::internal::ThreadPool::<lambda()>::operator()(void) const thread_pool.cc:336
std::__invoke_impl<void, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::<lambda()> >(std::__invoke_other, arrow::internal::ThreadPool::<lambda()> &&) invoke.h:60
std::__invoke<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::<lambda()> >(arrow::internal::ThreadPool::<lambda()> &&) invoke.h:95
std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::<lambda()> > >::_M_invoke<0>(std::_Index_tuple<0>) thread:244
std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::<lambda()> > >::operator()(void) thread:251
std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::<lambda()> > > >::_M_run(void) thread:195
<unknown> 0x00007fa594612de4
start_thread 0x00007fa593995609
clone 0x00007fa593ae4293

@michalursa I see that the new PR #11150 contains these test cases. I am wondering if this is something you encountered previously.

lidavidm · 2021-09-30T11:58:12Z

Are you able to merge the other PR on top of this one and see if that makes any difference? If so we could merge the two one immediately after the other.

lidavidm · 2021-09-30T19:30:01Z

Some of this ended up being pulled into ARROW-13642/#11047 which is now merged, so closing this PR. Thanks @nirandaperera!

nirandaperera marked this pull request as draft July 30, 2021 20:06

github-actions bot added the Component: C++ label Jul 30, 2021

nirandaperera force-pushed the ARROW-13268 branch from 0065e8b to c2cf1d4 Compare August 3, 2021 16:21

nirandaperera marked this pull request as ready for review August 4, 2021 23:00

nirandaperera added 19 commits August 6, 2021 09:21

init

b583196

adding Grouper::Find

83e3994

incomplete

8ddc475

mid way

2f7cb0f

untested

51ef659

code complete

db68536

adding test case dummy

7e06f56

adding PR comments

d5f4f4a

adding serial test case

cdf2771

passing test

72c672d

refactoring files

8816d9a

adding right semi join test

2d4b15f

using log instead of cout

0f9904a

minor changes

93da2ae

minor bug fix

4675c4f

adding empty tests

f2efe07

lint changes

686e08a

fixing c++/cli mutex import

8d286d2

adding anti-joins

674eb70

nirandaperera force-pushed the ARROW-13268 branch from 034f5a7 to 674eb70 Compare August 6, 2021 13:21

nirandaperera added 3 commits August 9, 2021 14:36

attempting to solve the threading issue

0a3bcbf

Revert "attempting to solve the threading issue"

f5d3c5f

This reverts commit 0a3bcbf.

Merge remote-tracking branch 'apache/master' into ARROW-13268

f8a474e

# Conflicts: # cpp/src/arrow/CMakeLists.txt # cpp/src/arrow/compute/exec/exec_plan.cc # cpp/src/arrow/compute/exec/exec_plan.h # cpp/src/arrow/compute/exec/plan_test.cc

nirandaperera added 2 commits August 27, 2021 15:23

Merge remote-tracking branch 'apache/master' into ARROW-13268

c9c97bd

adding verification to joins

49abd82

nirandaperera requested a review from lidavidm August 28, 2021 05:10

lidavidm approved these changes Aug 28, 2021

View reviewed changes

minor change

83e5105

minor change

e652193

westonpace reviewed Sep 1, 2021

View reviewed changes

nirandaperera mentioned this pull request Sep 1, 2021

ARROW-13532: [C++][Compute] - adding set membership type filtering to hash table interface #10858

Closed

nirandaperera added 2 commits September 1, 2021 16:37

changing threadIndexer to use GetThreadId

c74d1df

minor fix

a2a7439

michalursa added a commit to michalursa/arrow that referenced this pull request Sep 15, 2021

Adding tests from github.com/apache/pull/10845 and fixing bugs found …

edf2908

…by them

michalursa added a commit to michalursa/arrow that referenced this pull request Sep 15, 2021

Adding tests from github.com/apache/pull/10845 and fixing bugs found …

48e50b5

…by them

michalursa added a commit to michalursa/arrow that referenced this pull request Sep 16, 2021

Adding tests from github.com/apache/pull/10845 and fixing bugs found …

c0be8ea

…by them

lidavidm mentioned this pull request Sep 23, 2021

ARROW-13576: [C++] Replace ExecNode::InputReceived with ::MakeTask #11210

Closed

nealrichardson pushed a commit to michalursa/arrow that referenced this pull request Sep 23, 2021

Adding tests from github.com/apache/pull/10845 and fixing bugs found …

8e0c423

…by them

michalursa added a commit to michalursa/arrow that referenced this pull request Sep 29, 2021

Adding tests from github.com/apache/pull/10845 and fixing bugs found …

f5789ce

…by them

lidavidm and others added 5 commits September 29, 2021 10:46

Merge remote-tracking branch 'upstream/master' into ARROW-13268

1600465

ARROW-13268: [C++] Fix build error

179f01e

Merge branch 'master' into ARROW-13268

c6c2830

ARROW-13268: [C++] Fix build error

9160cc4

adding grouper fast find

81f8227

lidavidm closed this Sep 30, 2021

asfimport mentioned this pull request May 4, 2022

[C++][Compute] Add ExecNode for semi and anti-semi join #28949

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-13268 [C++][Compute] Add ExecNode for semi and anti-semi join #10845

ARROW-13268 [C++][Compute] Add ExecNode for semi and anti-semi join #10845

nirandaperera commented Jul 30, 2021

github-actions bot commented Jul 30, 2021

michalursa commented Jul 30, 2021 •

edited

nirandaperera commented Aug 2, 2021 •

edited

nirandaperera commented Aug 28, 2021

lidavidm left a comment

lidavidm Aug 28, 2021

nirandaperera Aug 28, 2021

lidavidm Aug 30, 2021

nirandaperera Aug 30, 2021

lidavidm commented Aug 28, 2021 via email

lidavidm commented Aug 30, 2021

westonpace left a comment

westonpace Sep 1, 2021

nirandaperera Sep 1, 2021

westonpace Sep 1, 2021

nirandaperera Sep 1, 2021

westonpace Sep 1, 2021

nirandaperera Sep 1, 2021

westonpace Sep 1, 2021

nirandaperera Sep 1, 2021

nirandaperera commented Sep 27, 2021

nirandaperera commented Sep 30, 2021 •

edited

lidavidm commented Sep 30, 2021

lidavidm commented Sep 30, 2021

		g.ExpectFind("[[3], [3]]", "[0, 0]");

		g.ExpectFind("[[3], [3]]", "[0, 0]");

	g.ExpectConsume("[[3], [3]]", "[0, 0]");

	g.ExpectConsume("[[3], [3]]", "[0, 0]");

		@@ -0,0 +1,18 @@
		// Licensed to the Apache Software Foundation (ASF) under one

ARROW-13268 [C++][Compute] Add ExecNode for semi and anti-semi join #10845

ARROW-13268 [C++][Compute] Add ExecNode for semi and anti-semi join #10845

Conversation

nirandaperera commented Jul 30, 2021

github-actions bot commented Jul 30, 2021

michalursa commented Jul 30, 2021 • edited

nirandaperera commented Aug 2, 2021 • edited

nirandaperera commented Aug 28, 2021

lidavidm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidavidm commented Aug 28, 2021 via email

lidavidm commented Aug 30, 2021

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nirandaperera commented Sep 27, 2021

nirandaperera commented Sep 30, 2021 • edited

lidavidm commented Sep 30, 2021

lidavidm commented Sep 30, 2021

michalursa commented Jul 30, 2021 •

edited

nirandaperera commented Aug 2, 2021 •

edited

nirandaperera commented Sep 30, 2021 •

edited