Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16894: [C++] Add Benchmarks for Asof Join Node #13426

Merged
merged 91 commits into from
Jul 28, 2022

Conversation

iChauster
Copy link
Contributor

@iChauster iChauster commented Jun 23, 2022

Add asof_join_benchmark.cc and edit CMakeLists.txt. Add utility to generate random tables in test_util.cc/h with respect to width, frequency, and number of ids.

icexelloss and others added 30 commits May 13, 2022 10:45
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving the generation utilities into C++. I have some style suggestions but overall this looks like a good approach.

cpp/src/arrow/compute/exec/test_util.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/test_util.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/test_util.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/test_util.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/test_util.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/asof_join_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/asof_join_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/asof_join_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/test_util.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/test_util.cc Outdated Show resolved Hide resolved

int num_rows = time_column_builder.length();
columns.push_back(time_column_builder.Finish().ValueOrDie());
columns.push_back(id_column_builder.Finish().ValueOrDie());
Copy link
Contributor Author

@iChauster iChauster Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is using ValueOrDie() here okay? I tried using CHECK_OK_AND_ASSIGN but I don't think that works in a non-void function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change the function to return Result<std::shared_ptr<Table>> (which you should) then you can use ARROW_ASSIGN_OR_RAISE. CHECK_... and ASSERT_... should only be used in the test/benchmark files themselves. In helper functions (e.g. test_util.cc) you should return a Status or a Result<T>

@jonkeane
Copy link
Member

I see.

For the purpose of this PR, we will convert some simple data generation to C++ to merge this PR and will move the end-to-end benchmarks to Python/Continuous bench marking. Thanks!

Sorry I was out last week — very much agree with what @pitrou and @westonpace have said here. Having macro benchmarks in our continuous benchmarking suite would be fantastic. There are a few readmes in that repo, but I'm always happy to help out if you get stuck putting something together—feel free to tag me in a PR or issue or wherever. Thanks!

@iChauster
Copy link
Contributor Author

@westonpace are these failed checks related?

@iChauster
Copy link
Contributor Author

Hi @westonpace , friendly ping -- let me know when you get a chance to take a look at this!

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience. Last week I think everyone was heads down on the release. Just a few nit-picky things and then we can get this merged.

Comment on lines +45 to +46
size_t row_size = sizeof(double) * (table.get()->schema()->num_fields() - 2) +
sizeof(int64_t) + sizeof(int32_t);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some utilities you can use in arrow/util/byte_size.h too if you wanted a more accurate version of the size (e.g. will report size used by validity bitmaps).

However, this is fine too I think. It represents a more conceptual data size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some testing, it seems these numbers are identical.

cpp/src/arrow/compute/exec/asof_join_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/asof_join_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/asof_join_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/asof_join_benchmark.cc Outdated Show resolved Hide resolved

int num_rows = time_column_builder.length();
columns.push_back(time_column_builder.Finish().ValueOrDie());
columns.push_back(id_column_builder.Finish().ValueOrDie());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change the function to return Result<std::shared_ptr<Table>> (which you should) then you can use ARROW_ASSIGN_OR_RAISE. CHECK_... and ASSERT_... should only be used in the test/benchmark files themselves. In helper functions (e.g. test_util.cc) you should return a Status or a Result<T>

cpp/src/arrow/compute/exec/test_util.cc Outdated Show resolved Hide resolved
};

/// The table generated in accordance to the TableGenerationProperties has the following
/// schema: time (int64) id (int32) [properties.column_prefix]0 (float64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the 0 in [properties.column_prefix]0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think this one got caught in the linting / formatter and made it a bit unclear, but each column is numbered from 0 to n - 1 inclusive, so each column name is something like [properties.column_prefix][i] where i = {0...n-1}. Is there a way I can make this clearer through the comments?

cpp/src/arrow/compute/exec/asof_join_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/exec/asof_join_benchmark.cc Outdated Show resolved Hide resolved
@westonpace
Copy link
Member

This error from the Windows CI is probably legitimate:

unity_0_cxx.obj : error LNK2019: unresolved external symbol "class std::shared_ptr<class arrow::Table> __cdecl arrow::compute::MakeRandomTimeSeriesTable(struct arrow::compute::TableGenerationProperties const &)" (?MakeRandomTimeSeriesTable@compute@arrow@@YA?AV?$shared_ptr@VTable@arrow@@@std@@AEBUTableGenerationProperties@12@@Z) referenced in function "struct arrow::compute::TableStats __cdecl arrow::compute::MakeTable(struct arrow::compute::TableGenerationProperties const &)" (?MakeTable@compute@arrow@@YA?AUTableStats@12@AEBUTableGenerationProperties@12@@Z) [D:\a\arrow\arrow\build\cpp\src\arrow\compute\exec\arrow-compute-asof-join-benchmark.vcxproj]

To fix this add ARROW_TESTING_EXPORT to TableGenerationProperties and MakeRandomTimeSeriesTable (Windows requires you to specifically label which functions are "external" and can be called outside of a shared object.

@westonpace
Copy link
Member

Can you update the description? This becomes the commit message when things merge and it still claims to be a draft.

@iChauster
Copy link
Contributor Author

Previous description:

Hi @westonpace,

Here is a very primitive version of our Asof Join Benchmarks (asof_join_benchmark.cc). Our main goal is to benchmark on four qualities: the effect of table density (the frequency of rows, e.g a row every 2s as opposed to every 1h over some time range), table width (# of columns), tids (# of keys), and multi-table joins. We also have a baseline comparison benchmark with hash joins (which is currently in this file).

I think this needs some work before it goes into arrow. We currently run this benchmark by generating .feather files with Python via bamboo-streaming's datagen.py to represent each table, and then reading them in through cpp (see make_arrow_ipc_reader_node). We perhaps want to write a utility that allows us to do this in cpp, while varying many of the metrics I've mentioned above, or finding a way to generate those files as part of the benchmark.

There are also quite a large number of BENCHMARK_CAPTURE statements, as an immediate workaround to some limitations in Google Benchmarks. I haven't found a great non-verbose way to pass in the parameters needed (strings and vectors) while also having readable titles and details about the benchmark being written to the output file. Let me know if you have any advice about this / know some one who does.

@iChauster
Copy link
Contributor Author

@westonpace updated!

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the persistence and the new benchmarks.

@westonpace westonpace merged commit 9667946 into apache:master Jul 28, 2022
@iChauster iChauster deleted the asof_join_benchmarks branch July 29, 2022 02:18
@ursabot
Copy link

ursabot commented Jul 29, 2022

Benchmark runs are scheduled for baseline = a963392 and contender = 9667946. 9667946 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.24% ⬆️0.1%] test-mac-arm
[Finished ⬇️0.82% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.04% ⬆️0.25%] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] 96679463 ec2-t3-xlarge-us-east-2
[Finished] 96679463 test-mac-arm
[Finished] 96679463 ursa-i9-9960x
[Finished] 96679463 ursa-thinkcentre-m75q
[Failed] a963392c ec2-t3-xlarge-us-east-2
[Finished] a963392c test-mac-arm
[Finished] a963392c ursa-i9-9960x
[Finished] a963392c ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants