ARROW-15498: [C++][Compute] Implement Bloom filter pushdown between hash joins #12289

save-buffer · 2022-01-28T23:10:14Z

This adds Bloom filter pushdown between hash join nodes.

github-actions · 2022-01-28T23:10:31Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2022-01-28T23:26:14Z

https://issues.apache.org/jira/browse/ARROW-15498

github-actions · 2022-01-28T23:26:15Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

westonpace

Starting to review this. Still need to go through hash_join.cc but I have some initial comments from the periphery.

cpp/src/arrow/CMakeLists.txt

cpp/src/arrow/compute/exec/partition_util.h

westonpace · 2022-04-26T15:18:44Z

cpp/src/arrow/compute/exec/util.h

@@ -92,7 +92,7 @@ class TempVectorStack {
  Status Init(MemoryPool* pool, int64_t size) {
    num_vectors_ = 0;
    top_ = 0;
-    buffer_size_ = size;
+    buffer_size_ = PaddedAllocationSize(size) + kPadding + 2 * sizeof(uint64_t);


Why are we padding here?

It's weird if you Init the TempVectorStack with one size and then it segfaults if you try to alloc that much memory. That's because alloc bumps the stack by PaddedAllocationSize(size) + 2 * sizeof(uint64_t)

westonpace · 2022-04-26T15:21:09Z

cpp/src/arrow/compute/exec/bloom_filter.cc

@@ -114,7 +114,7 @@ Status BlockedBloomFilter::CreateEmpty(int64_t num_rows_to_insert, MemoryPool* p
 }

 template <typename T>
-void BlockedBloomFilter::InsertImp(int64_t num_rows, const T* hashes) {
+NO_TSAN void BlockedBloomFilter::InsertImp(int64_t num_rows, const T* hashes) {


Is this still needed after we fixed other TSAN related issues?

Yes, I had to make blocks atomic to make TSAN go away, which we don't want to do

westonpace · 2022-04-26T15:25:37Z

cpp/src/arrow/compute/exec/bloom_filter_test.cc

+      [&](std::function<Status(size_t)> func) -> Status {
+        return tp->Spawn([&, func] {
+          size_t tid = thread_indexer();
+          std::ignore = func(tid);


Why ignore it? Return it if it isn't ok.

The lambda expression needs to return void here. I can maybe DCHECK_OK it.

westonpace · 2022-04-26T16:51:11Z

cpp/src/arrow/compute/exec/hash_join_benchmark.cc

+    HashJoinImpl* bloom_filter_pushdown_target = nullptr;
+    std::vector<int> key_input_map;
+
+    bool bloom_filter_does_not_apply_to_join =


So is the idea here to measure the overhead cost of building the bloom filter?

Well we benchmark to see what kind of performance impact the Bloom filter has.
But since we currently only do early-elimination of rows, and only build on the build side, this disqualifies some types of joins, so we check that here.

cpp/src/arrow/compute/exec/hash_join.cc

westonpace · 2022-04-26T17:10:12Z

cpp/src/arrow/compute/exec/hash_join_node.cc

-                {"node.detail", ToString()},
-                {"node.kind", kind_name()}});
-    END_SPAN_ON_FUTURE_COMPLETION(span_, finished(), this);
+  std::pair<HashJoinImpl*, std::vector<int>> GetPushdownTarget() {


Since we operate on these pairs everywhere why not create:

struct BloomFilterTarget { HashJoinImpl* join_impl; std::vector<int> column_map; };

It's also takes a bit of reading to figure out what the purpose of column_map is so this could be a place to briefly describe that.

I only use the pair in one spot as far as I can tell. I just use it so that I can use std::tie on whoever calls GetPushdownTarget. I did add a big comment though

westonpace

Some minor suggestions but overall I think this is pretty much ready to go once #13091 merges.

cpp/src/arrow/compute/exec/bloom_filter_test.cc

cpp/src/arrow/compute/exec/exec_plan.h

cpp/src/arrow/compute/exec/hash_join.cc

cpp/src/arrow/compute/exec/util.h

cpp/src/arrow/compute/exec/task_util.h

cpp/src/arrow/compute/exec/hash_join_node_test.cc

westonpace

Let's rebase on top of master now that the thread scheduler issue is in. Then, assuming CI passes, I think this is ready to go.

save-buffer · 2022-05-16T23:38:32Z

OK I think this is good now. The various failures seem to be because of :

tests/test_plasma.py::TestPlasmaClient::test_connection_failure_raises_exception

And

685  CMake Error at cmake_modules/ThirdpartyToolchain.cmake:239 (find_package):
686  Could not find a package configuration file provided by "GTest" (requested
687  version 1.10.0) with any of the following names:
688
689    GTestConfig.cmake
690    gtest-config.cmake
691
692  Add the installation prefix of "GTest" to CMAKE_PREFIX_PATH or set
693  "GTest_DIR" to a directory containing one of the above files.  If "GTest"
694  provides a separate development package or SDK, be sure it has been
695  installed.
696

The latter of which is being addressed in #13101

westonpace · 2022-05-18T03:49:50Z

I played around with this today and saw similar. Some bloom filter combinations were just very slow on my laptop. In general though I didn't see any true deadlock though I can never fully rule that out.

save-buffer · 2022-05-18T03:53:59Z

Just to be clear, it's not the Bloom filter that's slow (the slow parts tend to be full outer joins, where Bloom filter is disabled). It seems to be related to residual filters being slow, in particular KeyEncoder::DecodeNulls

westonpace · 2022-05-18T04:04:08Z

Yes. The slowdown happened in ProbeQueuedBatches which would make sense (I think) if it was related to residual filters.

ursabot · 2022-05-18T14:41:45Z

Benchmark runs are scheduled for baseline = 6faee47 and contender = 0742f78. 0742f78 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.98% ⬆️1.63%] test-mac-arm
[Failed ⬇️0.0% ⬆️11.79%] ursa-i9-9960x
[Finished ⬇️1.66% ⬆️0.32%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 0742f78a ec2-t3-xlarge-us-east-2
[Failed] 0742f78a test-mac-arm
[Failed] 0742f78a ursa-i9-9960x
[Finished] 0742f78a ursa-thinkcentre-m75q
[Finished] 6faee474 ec2-t3-xlarge-us-east-2
[Finished] 6faee474 test-mac-arm
[Finished] 6faee474 ursa-i9-9960x
[Finished] 6faee474 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-05-18T14:42:02Z

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

…ash joins This adds Bloom filter pushdown between hash join nodes. Closes apache#12289 from save-buffer/sasha_bloom_pushdown Lead-authored-by: Sasha Krassovsky <krassovskysasha@gmail.com> Co-authored-by: michalursa <michal@ursacomputing.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

github-actions bot added the Component: C++ label Jan 28, 2022

save-buffer marked this pull request as draft January 28, 2022 23:11

save-buffer changed the title ~~[C++] Bloom filter pushdown for Hash Join~~ ARROW-15498: [C++][Compute] Implement Bloom filter pushdown between hash joins Jan 28, 2022

westonpace self-requested a review February 1, 2022 18:33

save-buffer force-pushed the sasha_bloom_pushdown branch 3 times, most recently from 800a969 to ed4d9dc Compare February 8, 2022 18:29

save-buffer force-pushed the sasha_bloom_pushdown branch from ed4d9dc to 1ab0bab Compare April 12, 2022 21:52

save-buffer marked this pull request as ready for review April 12, 2022 21:56

save-buffer force-pushed the sasha_bloom_pushdown branch 4 times, most recently from c9b57e0 to 02a003a Compare April 25, 2022 20:01

westonpace reviewed Apr 26, 2022

View reviewed changes

save-buffer force-pushed the sasha_bloom_pushdown branch from 7d4502d to 4b6ccf4 Compare April 26, 2022 19:51

westonpace self-requested a review April 28, 2022 23:55

save-buffer force-pushed the sasha_bloom_pushdown branch 2 times, most recently from 139dcae to 0575ddf Compare May 5, 2022 21:09

save-buffer force-pushed the sasha_bloom_pushdown branch 3 times, most recently from 653f13b to 619f098 Compare May 9, 2022 18:19

westonpace reviewed May 9, 2022

View reviewed changes

save-buffer force-pushed the sasha_bloom_pushdown branch from ad92449 to 15f3c37 Compare May 9, 2022 22:30

westonpace approved these changes May 13, 2022

View reviewed changes

save-buffer force-pushed the sasha_bloom_pushdown branch 2 times, most recently from 4e0621a to ede7599 Compare May 16, 2022 20:25

save-buffer added 21 commits May 17, 2022 15:26

ARROW_EXPORT

9f67171

More ARROW_EXPORT

7aa7458

Remove line

a10a6a8

Pray to the mighty barney that his condition variable may work

9f23c88

Add an underscore

54d5022

Document parameter, explicitly initialize variable in constructor

f1335e1

Document another thing

9929b84

Fix dumb bug

f68ea78

Run fewer tests

75bf765

Respond to michal comments

1aaeb6f

Fix CV and fix TSAN BloomFilterBuilder (thanks Michal and Weston)

643eeb5

Fix on big endian

b28f4d5

Run fewer tests with ASAN

b18a7d3

Fix weston comments

caa3193

clang-format

16e7d76

Make windows happy again

68c3ba6

clang-format

bdbcb01

Try fixing on big endian again

36a229b

clang-format

6e5bf4c

Disable bloom filter on big endian

a3b2f93

Hopefully fix timeout

1a4ae69

save-buffer force-pushed the sasha_bloom_pushdown branch from c8be4c8 to 1a4ae69 Compare May 17, 2022 22:26

westonpace closed this in 0742f78 May 18, 2022

asfimport mentioned this pull request May 18, 2022

[C++][Compute] Implement Bloom filter pushdown between hash joins #30973

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-15498: [C++][Compute] Implement Bloom filter pushdown between hash joins #12289

ARROW-15498: [C++][Compute] Implement Bloom filter pushdown between hash joins #12289

save-buffer commented Jan 28, 2022

github-actions bot commented Jan 28, 2022

github-actions bot commented Jan 28, 2022

github-actions bot commented Jan 28, 2022

westonpace left a comment

westonpace Apr 26, 2022

save-buffer Apr 27, 2022

westonpace Apr 26, 2022

save-buffer Apr 26, 2022

westonpace Apr 26, 2022

save-buffer Apr 26, 2022

westonpace Apr 26, 2022

save-buffer Apr 27, 2022

westonpace Apr 26, 2022

save-buffer Apr 26, 2022

westonpace left a comment

westonpace left a comment

save-buffer commented May 16, 2022

westonpace commented May 18, 2022

save-buffer commented May 18, 2022

westonpace commented May 18, 2022

ursabot commented May 18, 2022

ursabot commented May 18, 2022

ARROW-15498: [C++][Compute] Implement Bloom filter pushdown between hash joins #12289

ARROW-15498: [C++][Compute] Implement Bloom filter pushdown between hash joins #12289

Conversation

save-buffer commented Jan 28, 2022

github-actions bot commented Jan 28, 2022

github-actions bot commented Jan 28, 2022

github-actions bot commented Jan 28, 2022

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

save-buffer commented May 16, 2022

westonpace commented May 18, 2022

save-buffer commented May 18, 2022

westonpace commented May 18, 2022

ursabot commented May 18, 2022

ursabot commented May 18, 2022