ARROW-8658: [C++][Dataset] Implement subtree pruning for FileSystemDataset #9670

bkietz · 2021-03-10T15:50:26Z

TODO:

Add benchmarks to be sure this provides an advantage
Move Forest to dataset_internal.h or so, since it is not used anywhere else
Unit test SubtreeImpl, add more comments

github-actions · 2021-03-10T16:25:23Z

https://issues.apache.org/jira/browse/ARROW-8658

lidavidm

This is super-slick! LGTM once you've addressed those todos. I left a nit about something that's not obvious at first read.

lidavidm · 2021-03-10T16:17:41Z

cpp/src/arrow/dataset/file_base.cc

+//   /num=1/al=be/
+//   /num=1/al=be/dat.par
+struct SubtreeImpl {
+  using expression_code = char32_t;


nit: any particular reason for char32_t over say uint32_t or size_t (since it is a vector index)? And in any case, it doesn't match the static_cast below on line 176.

I guess it's to work with std::basic_string, but then I'm curious why you're favoring that over std::vector.

Ah - to get the lexicographic sort down below. I think future readers might appreciate a note about that since it's not immediately obvious + is an unusual choice of types otherwise.

Also: basic_string has the short string approximation in most standard libraries (so a string with as many as 4 expression_codes will probably be stored without allocation) and supports hashing out of the box

Ideally this hack would be replaced by a single buffer of expression codes indexed into by SubtreeImpl::Encoded::partition_expression and friends. This would give us a guarantee that even "deep" partitionings would not require lots of small buffers at the cost of complicating hashing and comparison

lidavidm · 2021-03-10T22:06:05Z

On a quick benchmark, this is a two orders of magnitude speedup! Though I'm also going to test some other scenarios + add unit tests.

Before:

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
GetAllFragments        83434318 ns     83433416 ns           17
GetFilteredFragments  502654017 ns    502651704 ns            3

After:

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
GetAllFragments        72942933 ns     72941992 ns           19
GetFilteredFragments    5630666 ns      5630548 ns          250

…taset

lidavidm · 2021-03-11T00:35:08Z

Before:

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
GetAllFragments                    77918155 ns     77914414 ns           18
GetFilteredFragments/single_dir   504830505 ns    504788164 ns            3
GetFilteredFragments/single_file 1135530755 ns   1135494963 ns            1
GetFilteredFragments/range        509878962 ns    509855374 ns            3

After:

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
GetAllFragments                    67112654 ns     67106684 ns           21
GetFilteredFragments/single_dir     5564513 ns      5564443 ns          246
GetFilteredFragments/single_file   12560697 ns     12560479 ns          111
GetFilteredFragments/range         71451702 ns     71449949 ns           19

cpp/src/arrow/dataset/file_base.cc

cpp/src/arrow/dataset/file_base.h

cpp/src/arrow/dataset/file_base.cc

bkietz · 2021-03-11T13:31:15Z

cpp/src/arrow/dataset/file_benchmark.cc

+// Drill down to a subtree.
+BENCHMARK_CAPTURE(GetFilteredFragments, single_dir, equal(field_ref("a"), literal(90)));


Suggested change

// Drill down to a subtree.

BENCHMARK_CAPTURE(GetFilteredFragments, single_dir, equal(field_ref("a"), literal(90)));

// Drill down to a subtree.

BENCHMARK_CAPTURE(GetFilteredFragments, single_dir, equal(field_ref("a"), literal(90)));

// Drill down but not to a subtree

BENCHMARK_CAPTURE(GetFilteredFragments, single_dir, equal(field_ref("b"), literal(90)));

--------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- Before: GetAllFragments 77956704 ns 77958109 ns 18 GetFilteredFragments/single_dir 504208733 ns 504213184 ns 3 GetFilteredFragments/multi_dir 501361456 ns 501370822 ns 3 GetFilteredFragments/single_file 1135849271 ns 1135862409 ns 1 GetFilteredFragments/range 507525698 ns 507533638 ns 3 After: GetAllFragments 4821049 ns 4821404 ns 264 GetFilteredFragments/single_dir 5360703 ns 5361104 ns 252 GetFilteredFragments/multi_dir 406617239 ns 406644158 ns 3 GetFilteredFragments/single_file 11986866 ns 11987648 ns 116 GetFilteredFragments/range 68938442 ns 68942840 ns 20

lidavidm · 2021-03-11T16:13:06Z

@ursabot please benchmark

ursabot · 2021-03-11T16:13:12Z

lidavidm · 2021-03-12T14:01:04Z

MacOS tests are fixed now that the sorting on subtrees is fully defined.

ursabot · 2021-03-12T14:01:23Z

Benchmark runs are scheduled for baseline = 2d140c3 and contender = 2aee5b6. Results will be available as each benchmark for each run completes:
[Finished] ursa-dgx1: https://conbench.ursa.dev/compare/runs/95a13932-219e-476f-8e59-e294dcbd043a...faa3101a-5f6a-4294-a557-6603e3254e97/
[Finished] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/05118d70-8b4e-4b89-98ab-b7a429795df7...25503000-f5ab-4a11-9c2b-067af3a7dbd7/
[Finished] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/3424e2a0-2ae2-42a3-a0e6-b96d4fff1cf1...db4648f5-2ea4-4be2-87f4-7115b69eed5e/
[Finished] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/896988eb-8712-40fd-a2d9-f3f28e2e57ee...a964c74b-85a2-4dd8-b4f9-c14acd2d7001/

bkietz · 2021-03-12T15:54:31Z

+1, merging

…taset TODO: - [ ] Add benchmarks to be sure this provides an advantage - [ ] Move Forest to dataset_internal.h or so, since it is not used anywhere else - [ ] Unit test SubtreeImpl, add more comments Closes apache#9670 from bkietz/8658-Implement-subtree-pruning Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

bkietz requested a review from lidavidm March 10, 2021 15:50

github-actions bot added the Component: C++ label Mar 10, 2021

lidavidm approved these changes Mar 10, 2021

View reviewed changes

ARROW-8658: [C++][Dataset] Implement subtree pruning for FileSystemDa…

6baf5f4

…taset

lidavidm force-pushed the 8658-Implement-subtree-pruning branch from 4a1625c to b13b060 Compare March 11, 2021 00:33

lidavidm force-pushed the 8658-Implement-subtree-pruning branch 5 times, most recently from bbcc910 to f34b617 Compare March 11, 2021 03:30

lidavidm changed the title ~~ARROW-8658: [C++][Dataset] Implement subtree pruning for FileSystemDataset WIP~~ ARROW-8658: [C++][Dataset] Implement subtree pruning for FileSystemDataset Mar 11, 2021

lidavidm force-pushed the 8658-Implement-subtree-pruning branch from f34b617 to ef88b83 Compare March 11, 2021 12:42

bkietz commented Mar 11, 2021

View reviewed changes

cpp/src/arrow/dataset/file_base.cc Show resolved Hide resolved

bkietz commented Mar 11, 2021

View reviewed changes

cpp/src/arrow/dataset/file_base.h Outdated Show resolved Hide resolved

bkietz commented Mar 11, 2021

View reviewed changes

cpp/src/arrow/dataset/file_base.cc Outdated Show resolved Hide resolved

bkietz commented Mar 11, 2021

View reviewed changes

lidavidm added 4 commits March 11, 2021 08:38

ARROW-8658: [C++][Dataset] Move Forest to arrow/dataset

9d3d353

ARROW-8658: [C++][Dataset] Add GetFragments benchmark

ce16c1a

ARROW-8658: [C++][Dataset] Add SubtreeImpl tests

da07078

ARROW-8658: [C++][Dataset] Make Forest internal

b58b6f0

lidavidm force-pushed the 8658-Implement-subtree-pruning branch from ef88b83 to b58b6f0 Compare March 11, 2021 13:52

ARROW-8658: [C++][Dataset] Fully define sort of encoded subtrees

2aee5b6

bkietz closed this in 26fc751 Mar 12, 2021

bkietz deleted the 8658-Implement-subtree-pruning branch March 12, 2021 16:00

asfimport mentioned this pull request Mar 12, 2021

[C++][Dataset] Implement subtree pruning for FileSystemDataset::GetFragments #24819

Closed

whyzdev mentioned this pull request Mar 7, 2023

[C++] Reduce directory and file IO when reading partition parquet dataset with partition key filters #31174

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-8658: [C++][Dataset] Implement subtree pruning for FileSystemDataset #9670

ARROW-8658: [C++][Dataset] Implement subtree pruning for FileSystemDataset #9670

bkietz commented Mar 10, 2021

github-actions bot commented Mar 10, 2021

lidavidm left a comment

lidavidm Mar 10, 2021

lidavidm Mar 10, 2021

lidavidm Mar 10, 2021

bkietz Mar 10, 2021

bkietz Mar 10, 2021 •

edited

lidavidm commented Mar 10, 2021

lidavidm commented Mar 11, 2021

bkietz Mar 11, 2021

lidavidm Mar 11, 2021

lidavidm commented Mar 11, 2021

ursabot commented Mar 11, 2021 •

edited

lidavidm commented Mar 12, 2021

ursabot commented Mar 12, 2021 •

edited

bkietz commented Mar 12, 2021

		// Drill down to a subtree.
		BENCHMARK_CAPTURE(GetFilteredFragments, single_dir, equal(field_ref("a"), literal(90)));

ARROW-8658: [C++][Dataset] Implement subtree pruning for FileSystemDataset #9670

ARROW-8658: [C++][Dataset] Implement subtree pruning for FileSystemDataset #9670

Conversation

bkietz commented Mar 10, 2021

github-actions bot commented Mar 10, 2021

lidavidm left a comment

Choose a reason for hiding this comment

lidavidm Mar 10, 2021

Choose a reason for hiding this comment

lidavidm Mar 10, 2021

Choose a reason for hiding this comment

lidavidm Mar 10, 2021

Choose a reason for hiding this comment

bkietz Mar 10, 2021

Choose a reason for hiding this comment

bkietz Mar 10, 2021 • edited

Choose a reason for hiding this comment

lidavidm commented Mar 10, 2021

lidavidm commented Mar 11, 2021

bkietz Mar 11, 2021

Choose a reason for hiding this comment

lidavidm Mar 11, 2021

Choose a reason for hiding this comment

lidavidm commented Mar 11, 2021

ursabot commented Mar 11, 2021 • edited

lidavidm commented Mar 12, 2021

ursabot commented Mar 12, 2021 • edited

bkietz commented Mar 12, 2021

bkietz Mar 10, 2021 •

edited

ursabot commented Mar 11, 2021 •

edited

ursabot commented Mar 12, 2021 •

edited