Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8658: [C++][Dataset] Implement subtree pruning for FileSystemDataset #9670

Closed
wants to merge 6 commits into from

Conversation

bkietz
Copy link
Member

@bkietz bkietz commented Mar 10, 2021

TODO:

  • Add benchmarks to be sure this provides an advantage
  • Move Forest to dataset_internal.h or so, since it is not used anywhere else
  • Unit test SubtreeImpl, add more comments

@bkietz bkietz requested a review from lidavidm March 10, 2021 15:50
@github-actions
Copy link

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super-slick! LGTM once you've addressed those todos. I left a nit about something that's not obvious at first read.

// /num=1/al=be/
// /num=1/al=be/dat.par
struct SubtreeImpl {
using expression_code = char32_t;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: any particular reason for char32_t over say uint32_t or size_t (since it is a vector index)? And in any case, it doesn't match the static_cast below on line 176.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's to work with std::basic_string, but then I'm curious why you're favoring that over std::vector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - to get the lexicographic sort down below. I think future readers might appreciate a note about that since it's not immediately obvious + is an unusual choice of types otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: basic_string has the short string approximation in most standard libraries (so a string with as many as 4 expression_codes will probably be stored without allocation) and supports hashing out of the box

Copy link
Member Author

@bkietz bkietz Mar 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this hack would be replaced by a single buffer of expression codes indexed into by SubtreeImpl::Encoded::partition_expression and friends. This would give us a guarantee that even "deep" partitionings would not require lots of small buffers at the cost of complicating hashing and comparison

@lidavidm
Copy link
Member

On a quick benchmark, this is a two orders of magnitude speedup! Though I'm also going to test some other scenarios + add unit tests.

Before:

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
GetAllFragments        83434318 ns     83433416 ns           17
GetFilteredFragments  502654017 ns    502651704 ns            3

After:

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
GetAllFragments        72942933 ns     72941992 ns           19
GetFilteredFragments    5630666 ns      5630548 ns          250

@lidavidm lidavidm force-pushed the 8658-Implement-subtree-pruning branch from 4a1625c to b13b060 Compare March 11, 2021 00:33
@lidavidm
Copy link
Member

Before:

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
GetAllFragments                    77918155 ns     77914414 ns           18
GetFilteredFragments/single_dir   504830505 ns    504788164 ns            3
GetFilteredFragments/single_file 1135530755 ns   1135494963 ns            1
GetFilteredFragments/range        509878962 ns    509855374 ns            3

After:

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
GetAllFragments                    67112654 ns     67106684 ns           21
GetFilteredFragments/single_dir     5564513 ns      5564443 ns          246
GetFilteredFragments/single_file   12560697 ns     12560479 ns          111
GetFilteredFragments/range         71451702 ns     71449949 ns           19

@lidavidm lidavidm force-pushed the 8658-Implement-subtree-pruning branch 5 times, most recently from bbcc910 to f34b617 Compare March 11, 2021 03:30
@lidavidm lidavidm changed the title ARROW-8658: [C++][Dataset] Implement subtree pruning for FileSystemDataset WIP ARROW-8658: [C++][Dataset] Implement subtree pruning for FileSystemDataset Mar 11, 2021
@lidavidm lidavidm force-pushed the 8658-Implement-subtree-pruning branch from f34b617 to ef88b83 Compare March 11, 2021 12:42
Comment on lines +75 to +76
// Drill down to a subtree.
BENCHMARK_CAPTURE(GetFilteredFragments, single_dir, equal(field_ref("a"), literal(90)));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Drill down to a subtree.
BENCHMARK_CAPTURE(GetFilteredFragments, single_dir, equal(field_ref("a"), literal(90)));
// Drill down to a subtree.
BENCHMARK_CAPTURE(GetFilteredFragments, single_dir, equal(field_ref("a"), literal(90)));
// Drill down but not to a subtree
BENCHMARK_CAPTURE(GetFilteredFragments, single_dir, equal(field_ref("b"), literal(90)));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
Before:
GetAllFragments                    77956704 ns     77958109 ns           18
GetFilteredFragments/single_dir   504208733 ns    504213184 ns            3
GetFilteredFragments/multi_dir    501361456 ns    501370822 ns            3
GetFilteredFragments/single_file 1135849271 ns   1135862409 ns            1
GetFilteredFragments/range        507525698 ns    507533638 ns            3
After:
GetAllFragments                     4821049 ns      4821404 ns          264
GetFilteredFragments/single_dir     5360703 ns      5361104 ns          252
GetFilteredFragments/multi_dir    406617239 ns    406644158 ns            3
GetFilteredFragments/single_file   11986866 ns     11987648 ns          116
GetFilteredFragments/range         68938442 ns     68942840 ns           20

@lidavidm lidavidm force-pushed the 8658-Implement-subtree-pruning branch from ef88b83 to b58b6f0 Compare March 11, 2021 13:52
@lidavidm
Copy link
Member

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Mar 11, 2021

Benchmark runs are scheduled for baseline = 2d140c3 and contender = b58b6f0. Results will be available as each benchmark for each run completes:
[Finished] ursa-dgx1: https://conbench.ursa.dev/compare/runs/95a13932-219e-476f-8e59-e294dcbd043a...cf33179f-4a93-49f0-a69b-75bc00d529af/
[Finished] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/05118d70-8b4e-4b89-98ab-b7a429795df7...abae6596-e181-4a39-8047-7920eb1e2d07/
[Finished] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/3424e2a0-2ae2-42a3-a0e6-b96d4fff1cf1...fd7d3c23-3dfd-417d-977a-11f5ff7990c1/
[Finished] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/896988eb-8712-40fd-a2d9-f3f28e2e57ee...469fa6ab-e8de-4e06-b0c0-b887dd79ca77/
If you have Ursa Computing Inc's Buildkite access, you can also view benchmark runs logs using these links:
[Finished] https://buildkite.com/ursa-computing/arrow-run-benchmarks/builds/1208
[Finished] https://buildkite.com/ursa-computing/arrow-run-benchmarks-ec2-t3-large/builds/51
[Finished] https://buildkite.com/ursa-computing/arrow-run-benchmarks-ec2-t3-xlarge/builds/53
[Finished] https://buildkite.com/ursa-computing/arrow-run-benchmarks-ec2-t3-xlarge/builds/52
[Finished] https://buildkite.com/ursa-computing/arrow-run-benchmarks-dgx/builds/702
[Finished] https://buildkite.com/ursa-computing/arrow-run-benchmarks/builds/1209
[Finished] https://buildkite.com/ursa-computing/arrow-run-benchmarks-dgx/builds/703
[Finished] https://buildkite.com/ursa-computing/arrow-run-benchmarks-ec2-t3-large/builds/52

@lidavidm
Copy link
Member

MacOS tests are fixed now that the sorting on subtrees is fully defined.

@ursabot
Copy link

ursabot commented Mar 12, 2021

@bkietz
Copy link
Member Author

bkietz commented Mar 12, 2021

+1, merging

@bkietz bkietz closed this in 26fc751 Mar 12, 2021
@bkietz bkietz deleted the 8658-Implement-subtree-pruning branch March 12, 2021 16:00
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…taset

TODO:
- [ ] Add benchmarks to be sure this provides an advantage
- [ ] Move Forest to dataset_internal.h or so, since it is not used anywhere else
- [ ] Unit test SubtreeImpl, add more comments

Closes apache#9670 from bkietz/8658-Implement-subtree-pruning

Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…taset

TODO:
- [ ] Add benchmarks to be sure this provides an advantage
- [ ] Move Forest to dataset_internal.h or so, since it is not used anywhere else
- [ ] Unit test SubtreeImpl, add more comments

Closes apache#9670 from bkietz/8658-Implement-subtree-pruning

Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants