Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-15498: [C++][Compute] Implement Bloom filter pushdown between hash joins #12289

Closed
wants to merge 42 commits into from

Conversation

save-buffer
Copy link
Contributor

This adds Bloom filter pushdown between hash join nodes.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@save-buffer save-buffer marked this pull request as draft January 28, 2022 23:11
@save-buffer save-buffer changed the title [C++] Bloom filter pushdown for Hash Join ARROW-15498: [C++][Compute] Implement Bloom filter pushdown between hash joins Jan 28, 2022
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@westonpace westonpace self-requested a review February 1, 2022 18:33
@save-buffer save-buffer force-pushed the sasha_bloom_pushdown branch 3 times, most recently from 800a969 to ed4d9dc Compare February 8, 2022 18:29
@save-buffer save-buffer marked this pull request as ready for review April 12, 2022 21:56
@save-buffer save-buffer force-pushed the sasha_bloom_pushdown branch 4 times, most recently from c9b57e0 to 02a003a Compare April 25, 2022 20:01
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to review this. Still need to go through hash_join.cc but I have some initial comments from the periphery.

cpp/src/arrow/CMakeLists.txt Show resolved Hide resolved
cpp/src/arrow/compute/exec/partition_util.h Show resolved Hide resolved
@@ -92,7 +92,7 @@ class TempVectorStack {
Status Init(MemoryPool* pool, int64_t size) {
num_vectors_ = 0;
top_ = 0;
buffer_size_ = size;
buffer_size_ = PaddedAllocationSize(size) + kPadding + 2 * sizeof(uint64_t);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we padding here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird if you Init the TempVectorStack with one size and then it segfaults if you try to alloc that much memory. That's because alloc bumps the stack by PaddedAllocationSize(size) + 2 * sizeof(uint64_t)

@@ -114,7 +114,7 @@ Status BlockedBloomFilter::CreateEmpty(int64_t num_rows_to_insert, MemoryPool* p
}

template <typename T>
void BlockedBloomFilter::InsertImp(int64_t num_rows, const T* hashes) {
NO_TSAN void BlockedBloomFilter::InsertImp(int64_t num_rows, const T* hashes) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed after we fixed other TSAN related issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had to make blocks atomic to make TSAN go away, which we don't want to do

[&](std::function<Status(size_t)> func) -> Status {
return tp->Spawn([&, func] {
size_t tid = thread_indexer();
std::ignore = func(tid);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore it? Return it if it isn't ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lambda expression needs to return void here. I can maybe DCHECK_OK it.

HashJoinImpl* bloom_filter_pushdown_target = nullptr;
std::vector<int> key_input_map;

bool bloom_filter_does_not_apply_to_join =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the idea here to measure the overhead cost of building the bloom filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we benchmark to see what kind of performance impact the Bloom filter has.
But since we currently only do early-elimination of rows, and only build on the build side, this disqualifies some types of joins, so we check that here.

cpp/src/arrow/compute/exec/hash_join.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/hash_join.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/hash_join.cc Outdated Show resolved Hide resolved
{"node.detail", ToString()},
{"node.kind", kind_name()}});
END_SPAN_ON_FUTURE_COMPLETION(span_, finished(), this);
std::pair<HashJoinImpl*, std::vector<int>> GetPushdownTarget() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we operate on these pairs everywhere why not create:

struct BloomFilterTarget {
  HashJoinImpl* join_impl;
  std::vector<int> column_map;
};

It's also takes a bit of reading to figure out what the purpose of column_map is so this could be a place to briefly describe that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only use the pair in one spot as far as I can tell. I just use it so that I can use std::tie on whoever calls GetPushdownTarget. I did add a big comment though

@westonpace westonpace self-requested a review April 28, 2022 23:55
@save-buffer save-buffer force-pushed the sasha_bloom_pushdown branch 2 times, most recently from 139dcae to 0575ddf Compare May 5, 2022 21:09
@save-buffer save-buffer force-pushed the sasha_bloom_pushdown branch 3 times, most recently from 653f13b to 619f098 Compare May 9, 2022 18:19
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions but overall I think this is pretty much ready to go once #13091 merges.

cpp/src/arrow/compute/exec/bloom_filter_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/exec_plan.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/exec_plan.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/hash_join.cc Show resolved Hide resolved
cpp/src/arrow/compute/exec/hash_join.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/hash_join.cc Show resolved Hide resolved
cpp/src/arrow/compute/exec/util.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/task_util.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/task_util.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/hash_join_node_test.cc Outdated Show resolved Hide resolved
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rebase on top of master now that the thread scheduler issue is in. Then, assuming CI passes, I think this is ready to go.

@save-buffer save-buffer force-pushed the sasha_bloom_pushdown branch 2 times, most recently from 4e0621a to ede7599 Compare May 16, 2022 20:25
@save-buffer
Copy link
Contributor Author

OK I think this is good now. The various failures seem to be because of :

tests/test_plasma.py::TestPlasmaClient::test_connection_failure_raises_exception

And

685  CMake Error at cmake_modules/ThirdpartyToolchain.cmake:239 (find_package):
686  Could not find a package configuration file provided by "GTest" (requested
687  version 1.10.0) with any of the following names:
688
689    GTestConfig.cmake
690    gtest-config.cmake
691
692  Add the installation prefix of "GTest" to CMAKE_PREFIX_PATH or set
693  "GTest_DIR" to a directory containing one of the above files.  If "GTest"
694  provides a separate development package or SDK, be sure it has been
695  installed.
696

The latter of which is being addressed in #13101

@westonpace
Copy link
Member

I played around with this today and saw similar. Some bloom filter combinations were just very slow on my laptop. In general though I didn't see any true deadlock though I can never fully rule that out.

@save-buffer
Copy link
Contributor Author

Just to be clear, it's not the Bloom filter that's slow (the slow parts tend to be full outer joins, where Bloom filter is disabled). It seems to be related to residual filters being slow, in particular KeyEncoder::DecodeNulls

@westonpace
Copy link
Member

Yes. The slowdown happened in ProbeQueuedBatches which would make sense (I think) if it was related to residual filters.

@ursabot
Copy link

ursabot commented May 18, 2022

Benchmark runs are scheduled for baseline = 6faee47 and contender = 0742f78. 0742f78 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.98% ⬆️1.63%] test-mac-arm
[Failed ⬇️0.0% ⬆️11.79%] ursa-i9-9960x
[Finished ⬇️1.66% ⬆️0.32%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 0742f78a ec2-t3-xlarge-us-east-2
[Failed] 0742f78a test-mac-arm
[Failed] 0742f78a ursa-i9-9960x
[Finished] 0742f78a ursa-thinkcentre-m75q
[Finished] 6faee474 ec2-t3-xlarge-us-east-2
[Finished] 6faee474 test-mac-arm
[Finished] 6faee474 ursa-i9-9960x
[Finished] 6faee474 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented May 18, 2022

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

siddhantrao23 pushed a commit to siddhantrao23/arrow that referenced this pull request Jun 28, 2022
…ash joins

This adds Bloom filter pushdown between hash join nodes.

Closes apache#12289 from save-buffer/sasha_bloom_pushdown

Lead-authored-by: Sasha Krassovsky <krassovskysasha@gmail.com>
Co-authored-by: michalursa <michal@ursacomputing.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants