New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8658: [C++][Dataset] Implement subtree pruning for FileSystemDataset #9670
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super-slick! LGTM once you've addressed those todos. I left a nit about something that's not obvious at first read.
cpp/src/arrow/dataset/file_base.cc
Outdated
// /num=1/al=be/ | ||
// /num=1/al=be/dat.par | ||
struct SubtreeImpl { | ||
using expression_code = char32_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: any particular reason for char32_t over say uint32_t or size_t (since it is a vector index)? And in any case, it doesn't match the static_cast below on line 176.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's to work with std::basic_string, but then I'm curious why you're favoring that over std::vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - to get the lexicographic sort down below. I think future readers might appreciate a note about that since it's not immediately obvious + is an unusual choice of types otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: basic_string has the short string approximation in most standard libraries (so a string with as many as 4 expression_code
s will probably be stored without allocation) and supports hashing out of the box
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this hack would be replaced by a single buffer of expression codes indexed into by SubtreeImpl::Encoded::partition_expression
and friends. This would give us a guarantee that even "deep" partitionings would not require lots of small buffers at the cost of complicating hashing and comparison
On a quick benchmark, this is a two orders of magnitude speedup! Though I'm also going to test some other scenarios + add unit tests. Before:
After:
|
4a1625c
to
b13b060
Compare
Before:
After:
|
bbcc910
to
f34b617
Compare
f34b617
to
ef88b83
Compare
// Drill down to a subtree. | ||
BENCHMARK_CAPTURE(GetFilteredFragments, single_dir, equal(field_ref("a"), literal(90))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Drill down to a subtree. | |
BENCHMARK_CAPTURE(GetFilteredFragments, single_dir, equal(field_ref("a"), literal(90))); | |
// Drill down to a subtree. | |
BENCHMARK_CAPTURE(GetFilteredFragments, single_dir, equal(field_ref("a"), literal(90))); | |
// Drill down but not to a subtree | |
BENCHMARK_CAPTURE(GetFilteredFragments, single_dir, equal(field_ref("b"), literal(90))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
---------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------
Before:
GetAllFragments 77956704 ns 77958109 ns 18
GetFilteredFragments/single_dir 504208733 ns 504213184 ns 3
GetFilteredFragments/multi_dir 501361456 ns 501370822 ns 3
GetFilteredFragments/single_file 1135849271 ns 1135862409 ns 1
GetFilteredFragments/range 507525698 ns 507533638 ns 3
After:
GetAllFragments 4821049 ns 4821404 ns 264
GetFilteredFragments/single_dir 5360703 ns 5361104 ns 252
GetFilteredFragments/multi_dir 406617239 ns 406644158 ns 3
GetFilteredFragments/single_file 11986866 ns 11987648 ns 116
GetFilteredFragments/range 68938442 ns 68942840 ns 20
ef88b83
to
b58b6f0
Compare
@ursabot please benchmark |
MacOS tests are fixed now that the sorting on subtrees is fully defined. |
Benchmark runs are scheduled for baseline = 2d140c3 and contender = 2aee5b6. Results will be available as each benchmark for each run completes: |
+1, merging |
…taset TODO: - [ ] Add benchmarks to be sure this provides an advantage - [ ] Move Forest to dataset_internal.h or so, since it is not used anywhere else - [ ] Unit test SubtreeImpl, add more comments Closes apache#9670 from bkietz/8658-Implement-subtree-pruning Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
…taset TODO: - [ ] Add benchmarks to be sure this provides an advantage - [ ] Move Forest to dataset_internal.h or so, since it is not used anywhere else - [ ] Unit test SubtreeImpl, add more comments Closes apache#9670 from bkietz/8658-Implement-subtree-pruning Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
TODO: