GH-31769: [C++][Acero] Add spilling for hash join #13669

save-buffer · 2022-07-21T05:03:32Z

Adds support for spilling data to disk during hash join.

Closes: [C++] Support hash-join on larger than memory datasets #31769

github-actions · 2022-07-21T05:04:30Z

https://issues.apache.org/jira/browse/ARROW-16389

westonpace

Exciting to see this starting to come together. Not looking through in detail yet but picking at some of the points I suspect are going to be more contentious so we can start the conversation.

cpp/src/arrow/memory_pool_internal.h

cpp/src/arrow/compute/exec/spilling_util.cc

westonpace

Starting to poke around the edges of this PR. Can you explain to me the relationship between SpillingJoin and HashJoinNode?

cpp/src/arrow/compute/light_array.h

cpp/src/arrow/util/io_util.h

cpp/src/arrow/memory_pool.h

cpp/src/arrow/compute/light_array.h

cpp/src/arrow/compute/exec/accumulation_queue.cc

cpp/src/arrow/compute/exec/spilling_util.h

cpp/src/arrow/util/atomic_util.h

westonpace · 2022-08-11T23:30:24Z

cpp/src/arrow/compute/exec/spilling_join.h

+            using OutputBatchCallback = std::function<void(int64_t, ExecBatch)>;
+            using BuildFinishedCallback = std::function<Status(size_t)>;
+            using FinishedCallback = std::function<void(int64_t)>;
+            using RegisterTaskGroupCallback = std::function<int(
+                std::function<Status(size_t, int64_t)>, std::function<Status(size_t)>)>;
+            using StartTaskGroupCallback = std::function<Status(int, int64_t)>;
+            using PauseProbeSideCallback = std::function<void(int)>;
+            using ResumeProbeSideCallback = std::function<void(int)>;
+            using AbortContinuationImpl = std::function<void()>;
+
+            struct CallbackRecord
+            {
+                OutputBatchCallback output_batch_callback;
+                BuildFinishedCallback build_finished_callback;
+                FinishedCallback finished_callback;
+                RegisterTaskGroupCallback register_task_group_;
+                StartTaskGroupCallback start_task_group_callback;
+                PauseProbeSideCallback pause_probe_side_callback;
+                AbortContinuationImpl abort_callback;
+            };


Can we just use a pure virtual class at this point?

class HashJoinExternals { virtual void OutputBatch(int64_t, ExecBatch) = 0; // ... };

I personally find these pure-virtual classes cumbersome to deal with as they remove the callback record from being near the site where I invoke Init. They're also less flexible and don't let me reuse functions (like HashJoinImpl and SpillingHashJoin reuse a lot of the same callbacks, I can just assign the same stuff between the two callback records).

westonpace · 2022-08-11T23:57:07Z

@marsupialtail do you mind taking a look at spilling_file (and other parts of the PR if interested). Curious to get your feedback since you experimented with direct I/O as well.

marsupialtail · 2022-08-12T06:02:26Z

cpp/src/arrow/compute/exec/spilling_util.cc

+    }
+
+    if(pwritev(handle, ios.data(), static_cast<int>(ios.size()), info.start) == -1)
+        return Status::IOError("Failed to spill!");


I seem to recall a discussion here where we talked about the performance of using pwritev versus things like IO uring where you were able to saturate NVME SSD bandwidth. Were you able to saturate SSD with pwritev? I understand that when you are spilling many batches there might be many pwritevs happening at the same time. Still I am curious how the perf compares to IO uring -- this is to satisfy my (and maybe other people's) curiosity not to point out a problem with your code.

I think you are conflating two things: The IO command (pwritev) and the interface used to invoke it (syscall vs io_uring). io_uring lets you kick off a pwritev by writing into a ring buffer and invoking a memory barrier and allowing it to be executed on a kernel-mode thread. pwritev is a normal syscall that is synchronous, but I'm invoking it on a different user space thread in order to emulate asynchrony, so the net effect should be the same (but more cumbersome to write the code). I am using pwritev in both scenarios, just invoking it in two different ways.

That said, I will add a benchmark.

marsupialtail · 2022-08-12T06:04:29Z

cpp/src/arrow/compute/exec/spilling_util.cc

+#ifdef __ANDROID__
+        const char *backup = "/data/local/tmp/";
+#else
+        const char *backup = "/tmp/";


What if I want to spill to an attached NVME SSD that is mounted on its own directory? E.g. on AWS instances with NVME SSD you usually mount it to a directory called /data or something

For now you can set one of the below environment variables, but eventually we'll flesh out QueryOptions to allow you to specify more options such as the temp directory.

const char *selectors[] = { "TMPDIR", "TMP", "TEMP", "TEMPDIR" };

westonpace

Thanks for the contribution. I managed to get a pretty thorough look today so I hope future reviews can be faster (try not to squash changes past this point so it's easier for me to see what you've changed). I've got some suggestions.

Also, at the moment, when I run locally, both the benchmark and the unit tests hang forever. I am attempting to debug further but haven't look into it too much.

cpp/src/arrow/compute/exec/accumulation_queue.cc

cpp/src/arrow/compute/exec/accumulation_queue.h

cpp/src/arrow/compute/exec/accumulation_queue.cc

cpp/src/arrow/compute/exec/spilling_benchmark.cc

cpp/src/arrow/compute/exec/spilling_test.cc

westonpace

We spoke about this briefly offline, but I'll summarize here as well. These changes are roughly what we want. We need to figure out the CI failures at this point and refine the benchmarks & testing so they don't take so long in certain environments. Then we can go through, straighten out any last rough edges, and get this merged in.

github-actions · 2023-01-20T23:47:52Z

Closes: [C++] Support hash-join on larger than memory datasets #31769

github-actions · 2023-01-20T23:47:54Z

⚠️ GitHub issue #31769 has been automatically assigned in GitHub to PR creator.

amol- · 2023-03-30T17:13:55Z

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

vkhodygo · 2023-08-02T21:40:31Z

@save-buffer Any news regarding this one?

save-buffer marked this pull request as draft July 21, 2022 05:03

github-actions bot added the Component: C++ label Jul 21, 2022

save-buffer force-pushed the sasha_spilling2 branch from f0cd541 to 07fc888 Compare July 21, 2022 05:26

westonpace reviewed Jul 21, 2022

View reviewed changes

cpp/src/arrow/memory_pool_internal.h Outdated Show resolved Hide resolved

cpp/src/arrow/compute/exec/spilling_util.cc Outdated Show resolved Hide resolved

cpp/src/arrow/compute/exec/spilling_util.cc Outdated Show resolved Hide resolved

save-buffer force-pushed the sasha_spilling2 branch 3 times, most recently from fad418d to a1b3b13 Compare July 26, 2022 02:50

save-buffer force-pushed the sasha_spilling2 branch 2 times, most recently from 62fa41a to d858c9f Compare August 3, 2022 22:21

save-buffer force-pushed the sasha_spilling2 branch 6 times, most recently from fcb3bf2 to 8d527a3 Compare August 11, 2022 23:22

westonpace reviewed Aug 11, 2022

View reviewed changes

westonpace self-requested a review August 11, 2022 23:33

marsupialtail reviewed Aug 12, 2022

View reviewed changes

save-buffer force-pushed the sasha_spilling2 branch 10 times, most recently from 8f07030 to 8f97bb2 Compare August 18, 2022 22:09

save-buffer force-pushed the sasha_spilling2 branch 3 times, most recently from 96c2370 to 8dcf8c7 Compare September 21, 2022 23:59

save-buffer force-pushed the sasha_spilling2 branch 3 times, most recently from fd7f00e to 47a1fca Compare September 24, 2022 02:27

save-buffer force-pushed the sasha_spilling2 branch from 47a1fca to 2a45453 Compare January 6, 2023 21:19

save-buffer marked this pull request as ready for review January 6, 2023 21:20

save-buffer force-pushed the sasha_spilling2 branch from 2ba2f25 to 2620651 Compare January 9, 2023 20:03

westonpace requested changes Jan 9, 2023

View reviewed changes

asfimport mentioned this pull request Jan 9, 2023

[C++] Support hash-join on larger than memory datasets #31769

Open

save-buffer added 5 commits January 12, 2023 14:00

Implement spilling for Hash Join

9e683a4

Make my poor code completely unreadable

52df6bf

Some win32 fixes

d8291d3

Fix more windows errors

12f3b5b

Respond to Weston comments

5cb8c50

save-buffer force-pushed the sasha_spilling2 branch from 78f3398 to 5cb8c50 Compare January 12, 2023 22:00

ARROW_EXPORT some stuff to hopefully fix windows

98a912a

save-buffer requested a review from westonpace January 12, 2023 23:26

save-buffer added 2 commits January 12, 2023 16:54

More windows nonsense

d47fe5b

Change number of tests to see if it passes CI

81708bd

westonpace requested changes Jan 20, 2023

View reviewed changes

westonpace changed the title ~~ARROW-16389: [C++][Acero] Add spilling for hash join~~ GH-31769: [C++][Acero] Add spilling for hash join Jan 20, 2023

amol- closed this Mar 30, 2023

westonpace mentioned this pull request Apr 12, 2023

[Python] Table.join() produces incorrect results for large inputs #34474

Closed

vkhodygo mentioned this pull request Aug 19, 2024

GH-43495: [C++][Compute] Widen the row offset of the row table to 64-bit #43389

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-31769: [C++][Acero] Add spilling for hash join #13669

GH-31769: [C++][Acero] Add spilling for hash join #13669

save-buffer commented Jul 21, 2022 •

edited by github-actions bot

Loading

github-actions bot commented Jul 21, 2022

westonpace left a comment

westonpace left a comment

westonpace Aug 11, 2022

save-buffer Aug 26, 2022

westonpace commented Aug 11, 2022

marsupialtail Aug 12, 2022

save-buffer Aug 26, 2022

marsupialtail Aug 12, 2022 •

edited

Loading

save-buffer Aug 26, 2022

westonpace left a comment

westonpace left a comment

github-actions bot commented Jan 20, 2023

github-actions bot commented Jan 20, 2023

amol- commented Mar 30, 2023

vkhodygo commented Aug 2, 2023

GH-31769: [C++][Acero] Add spilling for hash join #13669

GH-31769: [C++][Acero] Add spilling for hash join #13669

Conversation

save-buffer commented Jul 21, 2022 • edited by github-actions bot Loading

github-actions bot commented Jul 21, 2022

westonpace left a comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

westonpace Aug 11, 2022

Choose a reason for hiding this comment

save-buffer Aug 26, 2022

Choose a reason for hiding this comment

westonpace commented Aug 11, 2022

marsupialtail Aug 12, 2022

Choose a reason for hiding this comment

save-buffer Aug 26, 2022

Choose a reason for hiding this comment

marsupialtail Aug 12, 2022 • edited Loading

Choose a reason for hiding this comment

save-buffer Aug 26, 2022

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 20, 2023

github-actions bot commented Jan 20, 2023

amol- commented Mar 30, 2023

vkhodygo commented Aug 2, 2023

save-buffer commented Jul 21, 2022 •

edited by github-actions bot

Loading

marsupialtail Aug 12, 2022 •

edited

Loading