ARROW-16894: [C++] Add Benchmarks for Asof Join Node #13426

iChauster · 2022-06-23T13:23:49Z

Add asof_join_benchmark.cc and edit CMakeLists.txt. Add utility to generate random tables in test_util.cc/h with respect to width, frequency, and number of ids.

Co-authored-by: Weston Pace <weston.pace@gmail.com>

westonpace

Thanks for moving the generation utilities into C++. I have some style suggestions but overall this looks like a good approach.

cpp/src/arrow/compute/exec/test_util.cc

cpp/src/arrow/compute/exec/asof_join_benchmark.cc

cpp/src/arrow/compute/exec/test_util.h

cpp/src/arrow/compute/exec/test_util.cc

iChauster · 2022-07-18T15:24:16Z

cpp/src/arrow/compute/exec/test_util.cc

+
+  int num_rows = time_column_builder.length();
+  columns.push_back(time_column_builder.Finish().ValueOrDie());
+  columns.push_back(id_column_builder.Finish().ValueOrDie());


Is using ValueOrDie() here okay? I tried using CHECK_OK_AND_ASSIGN but I don't think that works in a non-void function.

If you change the function to return Result<std::shared_ptr<Table>> (which you should) then you can use ARROW_ASSIGN_OR_RAISE. CHECK_... and ASSERT_... should only be used in the test/benchmark files themselves. In helper functions (e.g. test_util.cc) you should return a Status or a Result<T>

jonkeane · 2022-07-18T21:00:37Z

I see.

For the purpose of this PR, we will convert some simple data generation to C++ to merge this PR and will move the end-to-end benchmarks to Python/Continuous bench marking. Thanks!

Sorry I was out last week — very much agree with what @pitrou and @westonpace have said here. Having macro benchmarks in our continuous benchmarking suite would be fantastic. There are a few readmes in that repo, but I'm always happy to help out if you get stuck putting something together—feel free to tag me in a PR or issue or wherever. Thanks!

…in_benchmarks

iChauster · 2022-07-22T03:02:01Z

@westonpace are these failed checks related?

iChauster · 2022-07-25T19:07:11Z

Hi @westonpace , friendly ping -- let me know when you get a chance to take a look at this!

westonpace

Thanks for your patience. Last week I think everyone was heads down on the release. Just a few nit-picky things and then we can get this merged.

westonpace · 2022-07-25T21:40:49Z

cpp/src/arrow/compute/exec/asof_join_benchmark.cc

+  size_t row_size = sizeof(double) * (table.get()->schema()->num_fields() - 2) +
+                    sizeof(int64_t) + sizeof(int32_t);


There are some utilities you can use in arrow/util/byte_size.h too if you wanted a more accurate version of the size (e.g. will report size used by validity bitmaps).

However, this is fine too I think. It represents a more conceptual data size.

After some testing, it seems these numbers are identical.

cpp/src/arrow/compute/exec/asof_join_benchmark.cc

westonpace · 2022-07-25T21:52:10Z

cpp/src/arrow/compute/exec/test_util.cc

+
+  int num_rows = time_column_builder.length();
+  columns.push_back(time_column_builder.Finish().ValueOrDie());
+  columns.push_back(id_column_builder.Finish().ValueOrDie());


If you change the function to return Result<std::shared_ptr<Table>> (which you should) then you can use ARROW_ASSIGN_OR_RAISE. CHECK_... and ASSERT_... should only be used in the test/benchmark files themselves. In helper functions (e.g. test_util.cc) you should return a Status or a Result<T>

cpp/src/arrow/compute/exec/test_util.cc

westonpace · 2022-07-25T21:53:59Z

cpp/src/arrow/compute/exec/test_util.h

+};
+
+/// The table generated in accordance to the TableGenerationProperties has the following
+/// schema: time (int64) id (int32) [properties.column_prefix]0 (float64)


What's the 0 in [properties.column_prefix]0

Hmm, I think this one got caught in the linting / formatter and made it a bit unclear, but each column is numbered from 0 to n - 1 inclusive, so each column name is something like [properties.column_prefix][i] where i = {0...n-1}. Is there a way I can make this clearer through the comments?

cpp/src/arrow/compute/exec/asof_join_benchmark.cc

westonpace · 2022-07-25T22:01:01Z

This error from the Windows CI is probably legitimate:

unity_0_cxx.obj : error LNK2019: unresolved external symbol "class std::shared_ptr<class arrow::Table> __cdecl arrow::compute::MakeRandomTimeSeriesTable(struct arrow::compute::TableGenerationProperties const &)" (?MakeRandomTimeSeriesTable@compute@arrow@@YA?AV?$shared_ptr@VTable@arrow@@@std@@AEBUTableGenerationProperties@12@@Z) referenced in function "struct arrow::compute::TableStats __cdecl arrow::compute::MakeTable(struct arrow::compute::TableGenerationProperties const &)" (?MakeTable@compute@arrow@@YA?AUTableStats@12@AEBUTableGenerationProperties@12@@Z) [D:\a\arrow\arrow\build\cpp\src\arrow\compute\exec\arrow-compute-asof-join-benchmark.vcxproj]

To fix this add ARROW_TESTING_EXPORT to TableGenerationProperties and MakeRandomTimeSeriesTable (Windows requires you to specifically label which functions are "external" and can be called outside of a shared object.

westonpace · 2022-07-26T22:01:32Z

Can you update the description? This becomes the commit message when things merge and it still claims to be a draft.

iChauster · 2022-07-26T22:07:22Z

Previous description:

Hi @westonpace,

Here is a very primitive version of our Asof Join Benchmarks (asof_join_benchmark.cc). Our main goal is to benchmark on four qualities: the effect of table density (the frequency of rows, e.g a row every 2s as opposed to every 1h over some time range), table width (# of columns), tids (# of keys), and multi-table joins. We also have a baseline comparison benchmark with hash joins (which is currently in this file).

I think this needs some work before it goes into arrow. We currently run this benchmark by generating .feather files with Python via bamboo-streaming's datagen.py to represent each table, and then reading them in through cpp (see make_arrow_ipc_reader_node). We perhaps want to write a utility that allows us to do this in cpp, while varying many of the metrics I've mentioned above, or finding a way to generate those files as part of the benchmark.

There are also quite a large number of BENCHMARK_CAPTURE statements, as an immediate workaround to some limitations in Google Benchmarks. I haven't found a great non-verbose way to pass in the parameters needed (strings and vectors) while also having readable titles and details about the benchmark being written to the output file. Let me know if you have any advice about this / know some one who does.

iChauster · 2022-07-26T22:09:34Z

@westonpace updated!

cpp/src/arrow/compute/exec/test_util.cc

cpp/src/arrow/compute/exec/test_util.h

westonpace

Let's fix this one last change: https://github.com/apache/arrow/pull/13426/files#r931509310

westonpace

I appreciate the persistence and the new benchmarks.

ursabot · 2022-07-29T02:42:43Z

Benchmark runs are scheduled for baseline = a963392 and contender = 9667946. 9667946 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.24% ⬆️0.1%] test-mac-arm
[Finished ⬇️0.82% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.04% ⬆️0.25%] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] 96679463 ec2-t3-xlarge-us-east-2
[Finished] 96679463 test-mac-arm
[Finished] 96679463 ursa-i9-9960x
[Finished] 96679463 ursa-thinkcentre-m75q
[Failed] a963392c ec2-t3-xlarge-us-east-2
[Finished] a963392c test-mac-arm
[Finished] a963392c ursa-i9-9960x
[Finished] a963392c ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

icexelloss and others added 30 commits May 13, 2022 10:45

wip

23b8c71

wip

f4b2106

wip

7ab446d

wip

138daee

wip: First test pass

94a8453

Fix code style and lint (partial)

6466b80

Add support for mutliple tables; Add more tests

4c33452

Clean up code style (Pass ninja lint now), switch to unbounded queue

c6c6093

Clean up some files

643e368

Clean up some files

0781a16

Minor clean up

fc75844

Fix nulls in test result

26bc862

Clean up includes

5a6afbd

Clean up error handling

4f7cac7

Error handling

8773317

Fix compiler warning

22c9941

Fix Wshorten-64-to-32 error

6b27e6b

Fix lint

775be1d

Fix lint

2dc5691

Fix compiler warning Wunused-result

a9dd980

Fix format

0f39fce

Remove debug statement

761e5de

Update cpp/src/arrow/compute/exec/asof_join_node.cc

0387e5c

Co-authored-by: Weston Pace <weston.pace@gmail.com>

Update cpp/src/arrow/compute/exec/asof_join_node.cc

15ba43d

Co-authored-by: Weston Pace <weston.pace@gmail.com>

Update cpp/src/arrow/compute/exec/asof_join_node.cc

b92a303

Co-authored-by: Weston Pace <weston.pace@gmail.com>

Update cpp/src/arrow/compute/exec/asof_join.h

f0edd17

Co-authored-by: Weston Pace <weston.pace@gmail.com>

Update cpp/src/arrow/compute/exec/asof_join_node.cc

58f229d

Co-authored-by: Weston Pace <weston.pace@gmail.com>

Update cpp/src/arrow/compute/exec/asof_join_node.cc

15e2783

Co-authored-by: Weston Pace <weston.pace@gmail.com>

Update cpp/src/arrow/compute/exec/asof_join_node.cc

7aa252a

Co-authored-by: Weston Pace <weston.pace@gmail.com>

Apply suggestions from code review

9c332eb

Co-authored-by: Weston Pace <weston.pace@gmail.com>

westonpace requested changes Jul 18, 2022

View reviewed changes

Ivan Chau added 2 commits July 18, 2022 11:02

refactor asof, other style changes from code rev

7512796

follow comment format

ff84700

iChauster commented Jul 18, 2022

View reviewed changes

make spacing consistent

0111994

iChauster requested a review from westonpace July 18, 2022 20:59

Ivan Chau added 3 commits July 19, 2022 15:23

Merge branch 'master' of https://github.com/apache/arrow into asof_jo…

df1a566

…in_benchmarks

change TableGenerationProperties.seed to int from uint

7816ade

adjust data types to silence build warnings

fd33711

westonpace reviewed Jul 25, 2022

View reviewed changes

Ivan Chau added 2 commits July 26, 2022 11:31

respond to code review from weston

cdbf5f2

remove table gen export

3658a3a

iChauster requested a review from westonpace July 26, 2022 20:36

westonpace reviewed Jul 27, 2022

View reviewed changes

cpp/src/arrow/compute/exec/test_util.cc Outdated Show resolved Hide resolved

westonpace reviewed Jul 27, 2022

View reviewed changes

cpp/src/arrow/compute/exec/test_util.h Outdated Show resolved Hide resolved

westonpace requested changes Jul 27, 2022

View reviewed changes

Ivan Chau added 2 commits July 27, 2022 16:01

add documentation change, add arrow check to randomtable

3a212e5

ninja lint format

bb3565a

iChauster requested a review from westonpace July 28, 2022 20:52

westonpace approved these changes Jul 28, 2022

View reviewed changes

westonpace merged commit 9667946 into apache:master Jul 28, 2022

iChauster deleted the asof_join_benchmarks branch July 29, 2022 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-16894: [C++] Add Benchmarks for Asof Join Node #13426

ARROW-16894: [C++] Add Benchmarks for Asof Join Node #13426

iChauster commented Jun 23, 2022 •

edited

Loading

westonpace left a comment

iChauster Jul 18, 2022 •

edited

Loading

westonpace Jul 25, 2022

jonkeane commented Jul 18, 2022

iChauster commented Jul 22, 2022

iChauster commented Jul 25, 2022

westonpace left a comment

westonpace Jul 25, 2022

iChauster Jul 26, 2022

westonpace Jul 25, 2022

westonpace Jul 25, 2022

iChauster Jul 26, 2022

westonpace commented Jul 25, 2022

westonpace commented Jul 26, 2022

iChauster commented Jul 26, 2022

iChauster commented Jul 26, 2022

westonpace left a comment

westonpace left a comment

ursabot commented Jul 29, 2022

		size_t row_size = sizeof(double) * (table.get()->schema()->num_fields() - 2) +
		sizeof(int64_t) + sizeof(int32_t);

ARROW-16894: [C++] Add Benchmarks for Asof Join Node #13426

ARROW-16894: [C++] Add Benchmarks for Asof Join Node #13426

Conversation

iChauster commented Jun 23, 2022 • edited Loading

westonpace left a comment

Choose a reason for hiding this comment

iChauster Jul 18, 2022 • edited Loading

Choose a reason for hiding this comment

westonpace Jul 25, 2022

Choose a reason for hiding this comment

jonkeane commented Jul 18, 2022

iChauster commented Jul 22, 2022

iChauster commented Jul 25, 2022

westonpace left a comment

Choose a reason for hiding this comment

westonpace Jul 25, 2022

Choose a reason for hiding this comment

iChauster Jul 26, 2022

Choose a reason for hiding this comment

westonpace Jul 25, 2022

Choose a reason for hiding this comment

westonpace Jul 25, 2022

Choose a reason for hiding this comment

iChauster Jul 26, 2022

Choose a reason for hiding this comment

westonpace commented Jul 25, 2022

westonpace commented Jul 26, 2022

iChauster commented Jul 26, 2022

iChauster commented Jul 26, 2022

westonpace left a comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

ursabot commented Jul 29, 2022

iChauster commented Jun 23, 2022 •

edited

Loading

iChauster Jul 18, 2022 •

edited

Loading