-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16894: [C++] Add Benchmarks for Asof Join Node #13426
Conversation
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for moving the generation utilities into C++. I have some style suggestions but overall this looks like a good approach.
|
||
int num_rows = time_column_builder.length(); | ||
columns.push_back(time_column_builder.Finish().ValueOrDie()); | ||
columns.push_back(id_column_builder.Finish().ValueOrDie()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is using ValueOrDie()
here okay? I tried using CHECK_OK_AND_ASSIGN
but I don't think that works in a non-void function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change the function to return Result<std::shared_ptr<Table>>
(which you should) then you can use ARROW_ASSIGN_OR_RAISE
. CHECK_...
and ASSERT_...
should only be used in the test/benchmark files themselves. In helper functions (e.g. test_util.cc
) you should return a Status
or a Result<T>
Sorry I was out last week — very much agree with what @pitrou and @westonpace have said here. Having macro benchmarks in our continuous benchmarking suite would be fantastic. There are a few readmes in that repo, but I'm always happy to help out if you get stuck putting something together—feel free to tag me in a PR or issue or wherever. Thanks! |
@westonpace are these failed checks related? |
Hi @westonpace , friendly ping -- let me know when you get a chance to take a look at this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience. Last week I think everyone was heads down on the release. Just a few nit-picky things and then we can get this merged.
size_t row_size = sizeof(double) * (table.get()->schema()->num_fields() - 2) + | ||
sizeof(int64_t) + sizeof(int32_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some utilities you can use in arrow/util/byte_size.h
too if you wanted a more accurate version of the size (e.g. will report size used by validity bitmaps).
However, this is fine too I think. It represents a more conceptual data size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some testing, it seems these numbers are identical.
|
||
int num_rows = time_column_builder.length(); | ||
columns.push_back(time_column_builder.Finish().ValueOrDie()); | ||
columns.push_back(id_column_builder.Finish().ValueOrDie()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change the function to return Result<std::shared_ptr<Table>>
(which you should) then you can use ARROW_ASSIGN_OR_RAISE
. CHECK_...
and ASSERT_...
should only be used in the test/benchmark files themselves. In helper functions (e.g. test_util.cc
) you should return a Status
or a Result<T>
}; | ||
|
||
/// The table generated in accordance to the TableGenerationProperties has the following | ||
/// schema: time (int64) id (int32) [properties.column_prefix]0 (float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the 0
in [properties.column_prefix]0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think this one got caught in the linting / formatter and made it a bit unclear, but each column is numbered from 0 to n - 1 inclusive, so each column name is something like [properties.column_prefix][i] where i = {0...n-1}. Is there a way I can make this clearer through the comments?
This error from the Windows CI is probably legitimate:
To fix this add |
Can you update the description? This becomes the commit message when things merge and it still claims to be a draft. |
Previous description: Hi @westonpace, Here is a very primitive version of our Asof Join Benchmarks ( I think this needs some work before it goes into arrow. We currently run this benchmark by generating There are also quite a large number of |
@westonpace updated! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's fix this one last change: https://github.com/apache/arrow/pull/13426/files#r931509310
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the persistence and the new benchmarks.
Benchmark runs are scheduled for baseline = a963392 and contender = 9667946. 9667946 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Add
asof_join_benchmark.cc
and editCMakeLists.txt
. Add utility to generate random tables intest_util.cc/h
with respect to width, frequency, and number of ids.