Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16716: [C++] Add Benchmarks for ProjectNode #13314

Merged
merged 10 commits into from Jun 10, 2022

Conversation

iChauster
Copy link
Contributor

@iChauster iChauster commented Jun 3, 2022

Create project_benchmark.cc and add to CMakeLists

If anyone can shed some light or guidance on the comments I made below, that would be extremely helpful!

@github-actions
Copy link

github-actions bot commented Jun 3, 2022

@github-actions
Copy link

github-actions bot commented Jun 3, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@iChauster
Copy link
Contributor Author

iChauster commented Jun 3, 2022

Two other general questions:

  • How to measure the "data rate", e.g assess performance in bytes/sec since it seems we are measuring in rows / batches.
  • EDIT -> I ended up using a DenseThreadRange, let me know if this is the correct approach for this. I saw some examples of controlling the threads used in ExecuteScalarExpressionOverhead, how can we control threads per core, so we can have benchmarks for 1 thread, and 1 thread per core?

@iChauster
Copy link
Contributor Author

Hi @westonpace, following up on our JIRA conversation with @icexelloss, here is the open PR for Projection Benchmarks. I think it is ready for review (however, I cannot request it as a first time contributor).

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for creating this. I didn't expect to see much new information for a basic node but, as usual, I was wrong.

I think it's interesting to compare the results here with the results from the scalar expression overhead. Also, I am interested in letting the engine manage the threading instead of google benchmark to make sure we are doing that well.

In theory the performance should be more or less the same but I am getting considerably worse results. On my system I can maybe get up to 2/3 billion rows per second but in the baseline ExecuteScalarExpressionOverhead I am getting up to 8 billion rows per second.

I agree the culprit seems to boil down to the source/sink nodes. Can you try setting up a harness to run the node in isolation? Also, if you do this, can you make it a separate benchmark instead of replacing what you have already? I think both would be useful.

cpp/src/arrow/compute/exec/CMakeLists.txt Show resolved Hide resolved
cpp/src/arrow/compute/exec/project_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/project_benchmark.cc Outdated Show resolved Hide resolved
Comment on lines 40 to 41
->DenseThreadRange(1, std::thread::hardware_concurrency(),
std::thread::hardware_concurrency())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be more interesting to actually not set threading here with google benchmark but instead make sure the compute engine is configured to run in parallel on the CPU thread pool (which will use std::thread::hardware_concurrency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we set explicitly within the execution context?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do this...

  ExecContext ctx(default_memory_pool(), arrow::internal::GetCpuThreadPool());

...then it will use one thread per core (i.e. std::thread::hardware_concurrency). You can create your own thread pool if you need finer grained control over the number of threads (but I don't think that is necessary here).

cpp/src/arrow/compute/exec/project_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/project_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/project_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/project_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/project_benchmark.cc Outdated Show resolved Hide resolved
@iChauster
Copy link
Contributor Author

Hi @westonpace, I was able to debug everything -- InputFinished references output_[0], which was null for our ProjectNode, so I added a dummy sink to take care of that.

I did observe a speedup, but it does not exactly match my outputs for the ExpressionOverhead testing I have on my laptop. Let me know if I missed something in my code.

@iChauster iChauster requested a review from westonpace June 7, 2022 14:26
@westonpace
Copy link
Member

I did observe a speedup, but it does not exactly match my outputs for the ExpressionOverhead testing I have on my laptop. Let me know if I missed something in my code.

So I guess the problem is that if we call InputReceived manually then we do not get any parallelism (that is today handled by the source node). So I think we will need to do that manually as well.

We could manually schedule in the benchmark by creating a new TaskScheduler. Here is a rough example that could be cleaned up. It's a bit complex but we could start to share this logic if we are going to test all the nodes.

@iChauster
Copy link
Contributor Author

iChauster commented Jun 7, 2022

I did observe a speedup, but it does not exactly match my outputs for the ExpressionOverhead testing I have on my laptop. Let me know if I missed something in my code.

So I guess the problem is that if we call InputReceived manually then we do not get any parallelism (that is today handled by the source node). So I think we will need to do that manually as well.

We could manually schedule in the benchmark by creating a new TaskScheduler. Here is a rough example that could be cleaned up. It's a bit complex but we could start to share this logic if we are going to test all the nodes.

Yes, I had an inkling it had to do with the batch delivery especially with some of the metrics being slower than the source + projection + sink versions.

By the way, I found that the TaskScheduler is actually less performant than the for loop implementation for some reason, does it compete with ExpressionOverhead on your setup?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I found that the TaskScheduler is actually less performant than the for loop implementation for some reason, does it compete with ExpressionOverhead on your setup?

It does not compete with expression overhead on my system but it is pretty similar to projection overhead. Either way, I don't think we need to understand all the performance issues to finish the benchmark. Can you go ahead and add the task scheduler implementation to this PR and then we can probably go ahead and merge it without too much more investigation.

cpp/src/arrow/compute/exec/project_benchmark.cc Outdated Show resolved Hide resolved
@iChauster iChauster requested a review from westonpace June 8, 2022 14:04
@iChauster
Copy link
Contributor Author

@westonpace, excellent. I had some very minor cleanup on your task scheduler implementation, but I think this is ready to be merged.

@iChauster
Copy link
Contributor Author

iChauster commented Jun 8, 2022

If you could also point me in the right direction for #13302, that would be extremely helpful -- thank you so much Weston!

EDIT: Got some help and was able to make substantial progress!

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this. I think it will be very helpful for understanding our scheduling. Just a few minor style issues and we can merge this.

cpp/src/arrow/compute/exec/project_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/project_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/project_benchmark.cc Outdated Show resolved Hide resolved
@iChauster
Copy link
Contributor Author

Alright, if that is all I think this is ready to go, benchmarks look good on my side.

Thanks @westonpace for all the help!

@iChauster iChauster requested a review from westonpace June 9, 2022 14:08
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the cleanup. This looks good now. There were a few legitimate MSVC warnings (MSVC is rather pedantic about 64->32 bit conversions). I fixed them real quick and will merge on green.

CC @save-buffer this might be an interesting benchmark to take a look at when we work on scheduling. Our current scheduling appears to be introducing contention which has compounding effects with the contention in the kernel/function layer (just a theory).

@save-buffer
Copy link
Contributor

Interesting, yes I'll definitely take a look at this benchmark. I'd also be interested in comparing this to just using OpenMP - I may draw that up at some point as well. As of right now I think ThreadIndexer may be a prime suspect for at least some of the overheads.

@iChauster
Copy link
Contributor Author

iChauster commented Jun 10, 2022

Excellent! Once this is merged, I will open up another PR for benchmarks like these on filter_node. Thanks again!

@westonpace westonpace merged commit c563a00 into apache:master Jun 10, 2022
@iChauster iChauster deleted the projection_benchmark branch June 10, 2022 18:07
@iChauster
Copy link
Contributor Author


Benchmark Time CPU Iterations UserCounters...

ProjectionOverheadIsolated/complex_expression/batch_size:1000/real_time 3295929 ns 507918 ns 229 batches_per_second=303.405k/s rows_per_second=303.405M/s
ProjectionOverheadIsolated/complex_expression/batch_size:10000/real_time 704499 ns 88055 ns 1010 batches_per_second=141.945k/s rows_per_second=1.41945G/s
ProjectionOverheadIsolated/complex_expression/batch_size:100000/real_time 795424 ns 80258 ns 1005 batches_per_second=12.5719k/s rows_per_second=1.25719G/s
ProjectionOverheadIsolated/complex_expression/batch_size:1000000/real_time 2359050 ns 61147 ns 341 batches_per_second=423.899/s rows_per_second=423.899M/s
ProjectionOverheadIsolated/simple_expression/batch_size:1000/real_time 1801604 ns 535378 ns 383 batches_per_second=555.061k/s rows_per_second=555.061M/s
ProjectionOverheadIsolated/simple_expression/batch_size:10000/real_time 988456 ns 164531 ns 723 batches_per_second=101.168k/s rows_per_second=1011.68M/s
ProjectionOverheadIsolated/simple_expression/batch_size:100000/real_time 914617 ns 91048 ns 772 batches_per_second=10.9335k/s rows_per_second=1093.35M/s
ProjectionOverheadIsolated/simple_expression/batch_size:1000000/real_time 1344189 ns 68767 ns 541 batches_per_second=743.943/s rows_per_second=743.943M/s
ProjectionOverheadIsolated/zero_copy_expression/batch_size:1000/real_time 959498 ns 181219 ns 733 batches_per_second=1042.21k/s rows_per_second=1042.21M/s
ProjectionOverheadIsolated/zero_copy_expression/batch_size:10000/real_time 112092 ns 36202 ns 6196 batches_per_second=892.123k/s rows_per_second=8.92123G/s
ProjectionOverheadIsolated/zero_copy_expression/batch_size:100000/real_time 29696 ns 21104 ns 23557 batches_per_second=336.748k/s rows_per_second=33.6748G/s
ProjectionOverheadIsolated/zero_copy_expression/batch_size:1000000/real_time 13717 ns 7065 ns 50857 batches_per_second=72.9028k/s rows_per_second=72.9028G/s
ProjectionOverheadIsolated/ref_only_expression/batch_size:1000/real_time 674057 ns 121116 ns 1103 batches_per_second=1.48355M/s rows_per_second=1.48355G/s
ProjectionOverheadIsolated/ref_only_expression/batch_size:10000/real_time 101333 ns 35459 ns 8739 batches_per_second=986.846k/s rows_per_second=9.86846G/s
ProjectionOverheadIsolated/ref_only_expression/batch_size:100000/real_time 42491 ns 32605 ns 16796 batches_per_second=235.343k/s rows_per_second=23.5343G/s
ProjectionOverheadIsolated/ref_only_expression/batch_size:1000000/real_time 13267 ns 7679 ns 49470 batches_per_second=75.3763k/s rows_per_second=75.3763G/s
ProjectionOverhead/complex_expression/batch_size:1000/real_time 9110565 ns 681856 ns 102 batches_per_second=109.763k/s rows_per_second=109.763M/s
ProjectionOverhead/complex_expression/batch_size:10000/real_time 1341879 ns 114700 ns 559 batches_per_second=74.5224k/s rows_per_second=745.224M/s
ProjectionOverhead/complex_expression/batch_size:100000/real_time 969770 ns 74182 ns 715 batches_per_second=10.3117k/s rows_per_second=1031.17M/s
ProjectionOverhead/complex_expression/batch_size:1000000/real_time 4024855 ns 103623 ns 262 batches_per_second=248.456/s rows_per_second=248.456M/s
ProjectionOverhead/simple_expression/batch_size:1000/real_time 6629139 ns 626455 ns 130 batches_per_second=150.849k/s rows_per_second=150.849M/s
ProjectionOverhead/simple_expression/batch_size:10000/real_time 1078706 ns 136185 ns 666 batches_per_second=92.7037k/s rows_per_second=927.037M/s
ProjectionOverhead/simple_expression/batch_size:100000/real_time 978334 ns 78697 ns 718 batches_per_second=10.2215k/s rows_per_second=1022.15M/s
ProjectionOverhead/simple_expression/batch_size:1000000/real_time 2060913 ns 106348 ns 585 batches_per_second=485.222/s rows_per_second=485.222M/s
ProjectionOverhead/zero_copy_expression/batch_size:1000/real_time 5082465 ns 174192 ns 100 batches_per_second=196.755k/s rows_per_second=196.755M/s
ProjectionOverhead/zero_copy_expression/batch_size:10000/real_time 771846 ns 43574 ns 1122 batches_per_second=129.56k/s rows_per_second=1.2956G/s
ProjectionOverhead/zero_copy_expression/batch_size:100000/real_time 118740 ns 23673 ns 5832 batches_per_second=84.2174k/s rows_per_second=8.42174G/s
ProjectionOverhead/zero_copy_expression/batch_size:1000000/real_time 40628 ns 17786 ns 16876 batches_per_second=24.6135k/s rows_per_second=24.6135G/s
ProjectionOverhead/ref_only_expression/batch_size:1000/real_time 6957773 ns 195333 ns 117 batches_per_second=143.724k/s rows_per_second=143.724M/s
ProjectionOverhead/ref_only_expression/batch_size:10000/real_time 733865 ns 32280 ns 992 batches_per_second=136.265k/s rows_per_second=1.36265G/s
ProjectionOverhead/ref_only_expression/batch_size:100000/real_time 104410 ns 20485 ns 6822 batches_per_second=95.7764k/s rows_per_second=9.57764G/s
ProjectionOverhead/ref_only_expression/batch_size:1000000/real_time 35838 ns 15927 ns 18883 batches_per_second=27.9036k/s rows_per_second=27.9036G/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants