ARROW-16716: [C++] Add Benchmarks for ProjectNode #13314

iChauster · 2022-06-03T18:48:35Z

Create project_benchmark.cc and add to CMakeLists

If anyone can shed some light or guidance on the comments I made below, that would be extremely helpful!

…ion_benchmark

github-actions · 2022-06-03T18:48:55Z

https://issues.apache.org/jira/browse/ARROW-16716

github-actions · 2022-06-03T18:48:57Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

cpp/src/arrow/compute/exec/project_benchmark.cc

iChauster · 2022-06-03T18:56:46Z

Two other general questions:

How to measure the "data rate", e.g assess performance in bytes/sec since it seems we are measuring in rows / batches.
EDIT -> I ended up using a DenseThreadRange, let me know if this is the correct approach for this. I saw some examples of controlling the threads used in ExecuteScalarExpressionOverhead, how can we control threads per core, so we can have benchmarks for 1 thread, and 1 thread per core?

iChauster · 2022-06-06T15:24:08Z

Hi @westonpace, following up on our JIRA conversation with @icexelloss, here is the open PR for Projection Benchmarks. I think it is ready for review (however, I cannot request it as a first time contributor).

westonpace

Thank you so much for creating this. I didn't expect to see much new information for a basic node but, as usual, I was wrong.

I think it's interesting to compare the results here with the results from the scalar expression overhead. Also, I am interested in letting the engine manage the threading instead of google benchmark to make sure we are doing that well.

In theory the performance should be more or less the same but I am getting considerably worse results. On my system I can maybe get up to 2/3 billion rows per second but in the baseline ExecuteScalarExpressionOverhead I am getting up to 8 billion rows per second.

I agree the culprit seems to boil down to the source/sink nodes. Can you try setting up a harness to run the node in isolation? Also, if you do this, can you make it a separate benchmark instead of replacing what you have already? I think both would be useful.

cpp/src/arrow/compute/exec/CMakeLists.txt

cpp/src/arrow/compute/exec/project_benchmark.cc

westonpace · 2022-06-06T15:32:22Z

cpp/src/arrow/compute/exec/project_benchmark.cc

+      ->DenseThreadRange(1, std::thread::hardware_concurrency(),
+                         std::thread::hardware_concurrency())


I think it might be more interesting to actually not set threading here with google benchmark but instead make sure the compute engine is configured to run in parallel on the CPU thread pool (which will use std::thread::hardware_concurrency.

Is this something we set explicitly within the execution context?

If you do this...

ExecContext ctx(default_memory_pool(), arrow::internal::GetCpuThreadPool());

...then it will use one thread per core (i.e. std::thread::hardware_concurrency). You can create your own thread pool if you need finer grained control over the number of threads (but I don't think that is necessary here).

cpp/src/arrow/compute/exec/project_benchmark.cc

…n tests

iChauster · 2022-06-07T14:26:07Z

Hi @westonpace, I was able to debug everything -- InputFinished references output_[0], which was null for our ProjectNode, so I added a dummy sink to take care of that.

I did observe a speedup, but it does not exactly match my outputs for the ExpressionOverhead testing I have on my laptop. Let me know if I missed something in my code.

cpp/src/arrow/compute/exec/project_benchmark.cc

westonpace · 2022-06-07T19:22:35Z

I did observe a speedup, but it does not exactly match my outputs for the ExpressionOverhead testing I have on my laptop. Let me know if I missed something in my code.

So I guess the problem is that if we call InputReceived manually then we do not get any parallelism (that is today handled by the source node). So I think we will need to do that manually as well.

We could manually schedule in the benchmark by creating a new TaskScheduler. Here is a rough example that could be cleaned up. It's a bit complex but we could start to share this logic if we are going to test all the nodes.

iChauster · 2022-06-07T19:52:21Z

I did observe a speedup, but it does not exactly match my outputs for the ExpressionOverhead testing I have on my laptop. Let me know if I missed something in my code.

So I guess the problem is that if we call InputReceived manually then we do not get any parallelism (that is today handled by the source node). So I think we will need to do that manually as well.

We could manually schedule in the benchmark by creating a new TaskScheduler. Here is a rough example that could be cleaned up. It's a bit complex but we could start to share this logic if we are going to test all the nodes.

Yes, I had an inkling it had to do with the batch delivery especially with some of the metrics being slower than the source + projection + sink versions.

By the way, I found that the TaskScheduler is actually less performant than the for loop implementation for some reason, does it compete with ExpressionOverhead on your setup?

westonpace

By the way, I found that the TaskScheduler is actually less performant than the for loop implementation for some reason, does it compete with ExpressionOverhead on your setup?

It does not compete with expression overhead on my system but it is pretty similar to projection overhead. Either way, I don't think we need to understand all the performance issues to finish the benchmark. Can you go ahead and add the task scheduler implementation to this PR and then we can probably go ahead and merge it without too much more investigation.

cpp/src/arrow/compute/exec/project_benchmark.cc

iChauster · 2022-06-08T14:05:01Z

@westonpace, excellent. I had some very minor cleanup on your task scheduler implementation, but I think this is ready to be merged.

iChauster · 2022-06-08T14:52:59Z

If you could also point me in the right direction for #13302, that would be extremely helpful -- thank you so much Weston!

EDIT: Got some help and was able to make substantial progress!

westonpace

Thanks for adding this. I think it will be very helpful for understanding our scheduling. Just a few minor style issues and we can merge this.

cpp/src/arrow/compute/exec/project_benchmark.cc

iChauster · 2022-06-09T14:08:04Z

Alright, if that is all I think this is ready to go, benchmarks look good on my side.

Thanks @westonpace for all the help!

westonpace

Thanks for the cleanup. This looks good now. There were a few legitimate MSVC warnings (MSVC is rather pedantic about 64->32 bit conversions). I fixed them real quick and will merge on green.

CC @save-buffer this might be an interesting benchmark to take a look at when we work on scheduling. Our current scheduling appears to be introducing contention which has compounding effects with the contention in the kernel/function layer (just a theory).

save-buffer · 2022-06-10T00:04:06Z

Interesting, yes I'll definitely take a look at this benchmark. I'd also be interested in comparing this to just using OpenMP - I may draw that up at some point as well. As of right now I think ThreadIndexer may be a prime suspect for at least some of the overheads.

iChauster · 2022-06-10T15:00:36Z

Excellent! Once this is merged, I will open up another PR for benchmarks like these on filter_node. Thanks again!

iChauster · 2022-06-16T21:51:55Z

Benchmark Time CPU Iterations UserCounters...

ProjectionOverheadIsolated/complex_expression/batch_size:1000/real_time ProjectionOverheadIsolated/complex_expression/batch_size:10000/real_time ProjectionOverheadIsolated/complex_expression/batch_size:100000/real_time ProjectionOverheadIsolated/complex_expression/batch_size:1000000/real_time ProjectionOverheadIsolated/simple_expression/batch_size:1000/real_time ProjectionOverheadIsolated/simple_expression/batch_size:10000/real_time ProjectionOverheadIsolated/simple_expression/batch_size:100000/real_time ProjectionOverheadIsolated/simple_expression/batch_size:1000000/real_time ProjectionOverheadIsolated/zero_copy_expression/batch_size:1000/real_time ProjectionOverheadIsolated/zero_copy_expression/batch_size:10000/real_time ProjectionOverheadIsolated/zero_copy_expression/batch_size:100000/real_time ProjectionOverheadIsolated/zero_copy_expression/batch_size:1000000/real_time ProjectionOverheadIsolated/ref_only_expression/batch_size:1000/real_time ProjectionOverheadIsolated/ref_only_expression/batch_size:10000/real_time ProjectionOverheadIsolated/ref_only_expression/batch_size:100000/real_time ProjectionOverheadIsolated/ref_only_expression/batch_size:1000000/real_time ProjectionOverhead/complex_expression/batch_size:1000/real_time ProjectionOverhead/complex_expression/batch_size:10000/real_time ProjectionOverhead/complex_expression/batch_size:100000/real_time ProjectionOverhead/complex_expression/batch_size:1000000/real_time ProjectionOverhead/simple_expression/batch_size:1000/real_time ProjectionOverhead/simple_expression/batch_size:10000/real_time ProjectionOverhead/simple_expression/batch_size:100000/real_time ProjectionOverhead/simple_expression/batch_size:1000000/real_time ProjectionOverhead/zero_copy_expression/batch_size:1000/real_time ProjectionOverhead/zero_copy_expression/batch_size:10000/real_time ProjectionOverhead/zero_copy_expression/batch_size:100000/real_time ProjectionOverhead/zero_copy_expression/batch_size:1000000/real_time ProjectionOverhead/ref_only_expression/batch_size:1000/real_time ProjectionOverhead/ref_only_expression/batch_size:10000/real_time ProjectionOverhead/ref_only_expression/batch_size:100000/real_time ProjectionOverhead/ref_only_expression/batch_size:1000000/real_time 3295929 ns 507918 ns 229 batches_per_second=303.405k/s rows_per_second=303.405M/s
704499 ns 88055 ns 1010 batches_per_second=141.945k/s rows_per_second=1.41945G/s
795424 ns 80258 ns 1005 batches_per_second=12.5719k/s rows_per_second=1.25719G/s
2359050 ns 61147 ns 341 batches_per_second=423.899/s rows_per_second=423.899M/s
1801604 ns 535378 ns 383 batches_per_second=555.061k/s rows_per_second=555.061M/s
988456 ns 164531 ns 723 batches_per_second=101.168k/s rows_per_second=1011.68M/s
914617 ns 91048 ns 772 batches_per_second=10.9335k/s rows_per_second=1093.35M/s
1344189 ns 68767 ns 541 batches_per_second=743.943/s rows_per_second=743.943M/s
959498 ns 181219 ns 733 batches_per_second=1042.21k/s rows_per_second=1042.21M/s
112092 ns 36202 ns 6196 batches_per_second=892.123k/s rows_per_second=8.92123G/s
29696 ns 21104 ns 23557 batches_per_second=336.748k/s rows_per_second=33.6748G/s
13717 ns 7065 ns 50857 batches_per_second=72.9028k/s rows_per_second=72.9028G/s
674057 ns 121116 ns 1103 batches_per_second=1.48355M/s rows_per_second=1.48355G/s
101333 ns 35459 ns 8739 batches_per_second=986.846k/s rows_per_second=9.86846G/s
42491 ns 32605 ns 16796 batches_per_second=235.343k/s rows_per_second=23.5343G/s
13267 ns 7679 ns 49470 batches_per_second=75.3763k/s rows_per_second=75.3763G/s
9110565 ns 681856 ns 102 batches_per_second=109.763k/s rows_per_second=109.763M/s
1341879 ns 114700 ns 559 batches_per_second=74.5224k/s rows_per_second=745.224M/s
969770 ns 74182 ns 715 batches_per_second=10.3117k/s rows_per_second=1031.17M/s
4024855 ns 103623 ns 262 batches_per_second=248.456/s rows_per_second=248.456M/s
6629139 ns 626455 ns 130 batches_per_second=150.849k/s rows_per_second=150.849M/s
1078706 ns 136185 ns 666 batches_per_second=92.7037k/s rows_per_second=927.037M/s
978334 ns 78697 ns 718 batches_per_second=10.2215k/s rows_per_second=1022.15M/s
2060913 ns 106348 ns 585 batches_per_second=485.222/s rows_per_second=485.222M/s
5082465 ns 174192 ns 100 batches_per_second=196.755k/s rows_per_second=196.755M/s
771846 ns 43574 ns 1122 batches_per_second=129.56k/s rows_per_second=1.2956G/s
118740 ns 23673 ns 5832 batches_per_second=84.2174k/s rows_per_second=8.42174G/s
40628 ns 17786 ns 16876 batches_per_second=24.6135k/s rows_per_second=24.6135G/s
6957773 ns 195333 ns 117 batches_per_second=143.724k/s rows_per_second=143.724M/s
733865 ns 32280 ns 992 batches_per_second=136.265k/s rows_per_second=1.36265G/s
104410 ns 20485 ns 6822 batches_per_second=95.7764k/s rows_per_second=9.57764G/s
35838 ns 15927 ns 18883 batches_per_second=27.9036k/s rows_per_second=27.9036G/s

Ivan Chau added 2 commits June 3, 2022 12:04

add projection benchmark 1.0

29cb83b

Merge branch 'master' of https://github.com/apache/arrow into project…

2821e05

…ion_benchmark

github-actions bot added the Component: C++ label Jun 3, 2022

iChauster commented Jun 3, 2022

View reviewed changes

cpp/src/arrow/compute/exec/project_benchmark.cc Outdated Show resolved Hide resolved

cpp/src/arrow/compute/exec/project_benchmark.cc Show resolved Hide resolved

add threading control, simplify batch arguments to remove redundancies

e287ce6

westonpace requested changes Jun 6, 2022

View reviewed changes

Ivan Chau added 2 commits June 7, 2022 10:09

rename, remove gbenchmark thread control, refactoring, buggy isolatio…

6289e91

…n tests

fix isolated benchmark

0c61e8b

iChauster requested a review from westonpace June 7, 2022 14:26

ninja lint format

7bafc56

iChauster commented Jun 7, 2022

View reviewed changes

cpp/src/arrow/compute/exec/project_benchmark.cc Outdated Show resolved Hide resolved

initial parallel scheduling project benchmark

397bcce

westonpace reviewed Jun 8, 2022

View reviewed changes

cpp/src/arrow/compute/exec/project_benchmark.cc Outdated Show resolved Hide resolved

add batch parallelization with task scheduler

978269e

iChauster requested a review from westonpace June 8, 2022 14:04

westonpace requested changes Jun 9, 2022

View reviewed changes

cpp/src/arrow/compute/exec/project_benchmark.cc Outdated Show resolved Hide resolved

cpp/src/arrow/compute/exec/project_benchmark.cc Outdated Show resolved Hide resolved

cpp/src/arrow/compute/exec/project_benchmark.cc Outdated Show resolved Hide resolved

remove auto, change variable names

a0282d0

iChauster requested a review from westonpace June 9, 2022 14:08

Fix some MSVC conversion warnings

c1a6e88

westonpace approved these changes Jun 9, 2022

View reviewed changes

westonpace merged commit c563a00 into apache:master Jun 10, 2022

iChauster mentioned this pull request Jun 10, 2022

ARROW-16809: [C++] Add Benchmarks for FilterNode #13366

Merged

iChauster deleted the projection_benchmark branch June 10, 2022 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-16716: [C++] Add Benchmarks for ProjectNode #13314

ARROW-16716: [C++] Add Benchmarks for ProjectNode #13314

iChauster commented Jun 3, 2022 •

edited

github-actions bot commented Jun 3, 2022

github-actions bot commented Jun 3, 2022

iChauster commented Jun 3, 2022 •

edited

iChauster commented Jun 6, 2022

westonpace left a comment

westonpace Jun 6, 2022

iChauster Jun 6, 2022

westonpace Jun 6, 2022

iChauster commented Jun 7, 2022

westonpace commented Jun 7, 2022

iChauster commented Jun 7, 2022 •

edited

westonpace left a comment

iChauster commented Jun 8, 2022

iChauster commented Jun 8, 2022 •

edited

westonpace left a comment

iChauster commented Jun 9, 2022

westonpace left a comment

save-buffer commented Jun 10, 2022

iChauster commented Jun 10, 2022 •

edited

iChauster commented Jun 16, 2022

		->DenseThreadRange(1, std::thread::hardware_concurrency(),
		std::thread::hardware_concurrency())

ARROW-16716: [C++] Add Benchmarks for ProjectNode #13314

ARROW-16716: [C++] Add Benchmarks for ProjectNode #13314

Conversation

iChauster commented Jun 3, 2022 • edited

github-actions bot commented Jun 3, 2022

github-actions bot commented Jun 3, 2022

iChauster commented Jun 3, 2022 • edited

iChauster commented Jun 6, 2022

westonpace left a comment

Choose a reason for hiding this comment

westonpace Jun 6, 2022

Choose a reason for hiding this comment

iChauster Jun 6, 2022

Choose a reason for hiding this comment

westonpace Jun 6, 2022

Choose a reason for hiding this comment

iChauster commented Jun 7, 2022

westonpace commented Jun 7, 2022

iChauster commented Jun 7, 2022 • edited

westonpace left a comment

Choose a reason for hiding this comment

iChauster commented Jun 8, 2022

iChauster commented Jun 8, 2022 • edited

westonpace left a comment

Choose a reason for hiding this comment

iChauster commented Jun 9, 2022

westonpace left a comment

Choose a reason for hiding this comment

save-buffer commented Jun 10, 2022

iChauster commented Jun 10, 2022 • edited

iChauster commented Jun 16, 2022

Benchmark Time CPU Iterations UserCounters...

iChauster commented Jun 3, 2022 •

edited

iChauster commented Jun 3, 2022 •

edited

iChauster commented Jun 7, 2022 •

edited

iChauster commented Jun 8, 2022 •

edited

iChauster commented Jun 10, 2022 •

edited