New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16716: [C++] Add Benchmarks for ProjectNode #13314
Conversation
|
Two other general questions:
|
Hi @westonpace, following up on our JIRA conversation with @icexelloss, here is the open PR for Projection Benchmarks. I think it is ready for review (however, I cannot request it as a first time contributor). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for creating this. I didn't expect to see much new information for a basic node but, as usual, I was wrong.
I think it's interesting to compare the results here with the results from the scalar expression overhead. Also, I am interested in letting the engine manage the threading instead of google benchmark to make sure we are doing that well.
In theory the performance should be more or less the same but I am getting considerably worse results. On my system I can maybe get up to 2/3 billion rows per second but in the baseline ExecuteScalarExpressionOverhead
I am getting up to 8 billion rows per second.
I agree the culprit seems to boil down to the source/sink nodes. Can you try setting up a harness to run the node in isolation? Also, if you do this, can you make it a separate benchmark instead of replacing what you have already? I think both would be useful.
->DenseThreadRange(1, std::thread::hardware_concurrency(), | ||
std::thread::hardware_concurrency()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be more interesting to actually not set threading here with google benchmark but instead make sure the compute engine is configured to run in parallel on the CPU thread pool (which will use std::thread::hardware_concurrency
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something we set explicitly within the execution context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do this...
ExecContext ctx(default_memory_pool(), arrow::internal::GetCpuThreadPool());
...then it will use one thread per core (i.e. std::thread::hardware_concurrency
). You can create your own thread pool if you need finer grained control over the number of threads (but I don't think that is necessary here).
Hi @westonpace, I was able to debug everything -- I did observe a speedup, but it does not exactly match my outputs for the |
So I guess the problem is that if we call We could manually schedule in the benchmark by creating a new TaskScheduler. Here is a rough example that could be cleaned up. It's a bit complex but we could start to share this logic if we are going to test all the nodes. |
Yes, I had an inkling it had to do with the batch delivery especially with some of the metrics being slower than the source + projection + sink versions. By the way, I found that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I found that the TaskScheduler is actually less performant than the for loop implementation for some reason, does it compete with ExpressionOverhead on your setup?
It does not compete with expression overhead on my system but it is pretty similar to projection overhead. Either way, I don't think we need to understand all the performance issues to finish the benchmark. Can you go ahead and add the task scheduler implementation to this PR and then we can probably go ahead and merge it without too much more investigation.
@westonpace, excellent. I had some very minor cleanup on your task scheduler implementation, but I think this is ready to be merged. |
If you could also point me in the right direction for #13302, that would be extremely helpful -- thank you so much Weston! EDIT: Got some help and was able to make substantial progress! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this. I think it will be very helpful for understanding our scheduling. Just a few minor style issues and we can merge this.
Alright, if that is all I think this is ready to go, benchmarks look good on my side. Thanks @westonpace for all the help! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the cleanup. This looks good now. There were a few legitimate MSVC warnings (MSVC is rather pedantic about 64->32 bit conversions). I fixed them real quick and will merge on green.
CC @save-buffer this might be an interesting benchmark to take a look at when we work on scheduling. Our current scheduling appears to be introducing contention which has compounding effects with the contention in the kernel/function layer (just a theory).
Interesting, yes I'll definitely take a look at this benchmark. I'd also be interested in comparing this to just using OpenMP - I may draw that up at some point as well. As of right now I think |
Excellent! Once this is merged, I will open up another PR for benchmarks like these on |
Benchmark Time CPU Iterations UserCounters...ProjectionOverheadIsolated/complex_expression/batch_size:1000/real_time 3295929 ns 507918 ns 229 batches_per_second=303.405k/s rows_per_second=303.405M/s |
Create
project_benchmark.cc
and add toCMakeLists
If anyone can shed some light or guidance on the comments I made below, that would be extremely helpful!