ARROW-10647: [Rust] [Parquet] Port benchmarks from from parquet-rs to arrow repo #8708

alamb · 2020-11-18T16:48:58Z

This PR ports the parquet benchmarks from the original parquet-rs repo in service of helping to get #8698 merged.

The PR may be easier to review commit by commit to see what I had to change to make the benchmarks work in this Repo

My one question is if it is ok to add a 653KB binary file as part of this PR, or if that should be put into one of the other repos (like test data)

To run:

cd arrow/rust/parquet
cargo bench

Example output:

     Running /Users/alamb/Software/arrow2/rust/target/release/deps/encoding-a8e00763431f13f3

running 20 tests
test delta_bit_pack_i32_1k_10   ... bench:       9,252 ns/iter (+/- 2,121) = 442 MB/s
test delta_bit_pack_i32_1k_100  ... bench:       8,952 ns/iter (+/- 1,329) = 457 MB/s
test delta_bit_pack_i32_1k_1000 ... bench:       9,397 ns/iter (+/- 2,673) = 435 MB/s
test delta_bit_pack_i32_1m_10   ... bench:   9,329,524 ns/iter (+/- 770,014) = 449 MB/s
test delta_bit_pack_i32_1m_100  ... bench:   9,301,679 ns/iter (+/- 703,854) = 450 MB/s
test delta_bit_pack_i32_1m_1000 ... bench:   9,507,733 ns/iter (+/- 582,931) = 441 MB/s
test dict_i32_1k_10             ... bench:      11,565 ns/iter (+/- 3,881) = 354 MB/s
test dict_i32_1k_100            ... bench:      11,219 ns/iter (+/- 1,149) = 365 MB/s
test dict_i32_1k_1000           ... bench:      17,140 ns/iter (+/- 4,870) = 238 MB/s
test dict_i32_1m_10             ... bench:  12,270,883 ns/iter (+/- 1,498,457) = 341 MB/s
test dict_i32_1m_100            ... bench:  11,595,816 ns/iter (+/- 866,835) = 361 MB/s
test dict_i32_1m_1000           ... bench:  11,688,057 ns/iter (+/- 1,084,930) = 358 MB/s
test dict_str_1m                ... bench:  13,151,428 ns/iter (+/- 2,764,931) = 797 MB/s
test plain_i32_1k_10            ... bench:         165 ns/iter (+/- 42) = 24824 MB/s
test plain_i32_1k_100           ... bench:         164 ns/iter (+/- 11) = 24975 MB/s
test plain_i32_1k_1000          ... bench:         163 ns/iter (+/- 19) = 25128 MB/s
test plain_i32_1m_10            ... bench:     406,179 ns/iter (+/- 73,744) = 10326 MB/s
test plain_i32_1m_100           ... bench:     396,644 ns/iter (+/- 73,762) = 10574 MB/s
test plain_i32_1m_1000          ... bench:     412,808 ns/iter (+/- 45,920) = 10160 MB/s
test plain_str_1m               ... bench:  13,453,959 ns/iter (+/- 3,264,946) = 779 MB/s

github-actions · 2020-11-18T16:55:34Z

https://issues.apache.org/jira/browse/ARROW-10647

nevi-me · 2020-11-18T16:59:26Z

This will also close https://issues.apache.org/jira/browse/ARROW-4063

GregBowyer

Super minor: The commit message possibly should be s/benche/bench

sunchao

Thanks @alamb . This is really great! just one nit.

sunchao · 2020-11-18T19:41:54Z

rust/parquet/benches/codec.rs

+//   }
+//
+// filled with random values.
+const TEST_FILE: &str = "10k-v2.parquet";


I think we normally put test data in https://github.com/apache/parquet-testing so perhaps we should add this one there as well (or if there any existing file there that we can use instead)?

I created apache/parquet-testing#15 -- if/when that gets merged in, I'll update this PR to pick up a later version of parquet-testing and remove the binary from this PR as well.

sunchao · 2020-11-18T19:44:13Z

Unrelated: there is also the fuzz module which is quite useful for detecting bad crashes in the code. It probably worth porting to arrow as well.

GregBowyer · 2020-11-18T19:57:35Z

I am going to suggest porting these to criterion (as it makes it easier to compare parameters and runs)

I have a PR in the works for this, PR-ception I will PR on your repo to PR the PR :P

alamb · 2020-11-18T20:50:35Z

@GregBowyer -- sounds great!

alamb · 2020-11-18T20:50:58Z

alamb · 2020-11-19T18:44:33Z

@wesm suggests that rather than checking in files, we write / use a data generator, which makes sense to me. I'll try and work on such a thing -- though I am not sure when I will get time to do so

wesm · 2020-11-19T23:53:51Z

I'm fine with checking in these files (or putting them in an S3 bucket, or anything really), but just don't think that checking in binary files should be the project's benchmarking strategy =)

alamb · 2020-12-03T11:49:57Z

Update on this PR -- I plan to try and make a synthetic data generator rather than checking the data files in. I just haven't had the chance to do so yet

sunchao · 2021-01-09T19:14:12Z

I'll spend some time on this. We can probably port encoding/decoding benchmark first as they do not rely on the test file.

alamb · 2021-01-10T15:32:51Z

Thank you @sunchao

alamb added 3 commits November 18, 2020 11:47

ARROW-10647: Copy benche marks from from parquet-rs to arrow repo

cc48974

Add dev dependency, fix compiler warnings and errors

5965e9e

Add test data file

894a096

github-actions bot added Component: Rust Component: Parquet labels Nov 18, 2020

alamb added 2 commits November 18, 2020 11:49

fix: cargo fmt

1de35e3

fix: clippy

be1a941

alamb requested a review from sunchao November 18, 2020 16:51

alamb mentioned this pull request Nov 18, 2020

ARROW-10636: [Rust][Parquet] Switch to Rust Stable by removing specialization in parquet #8698

Closed

4 tasks

nevi-me self-requested a review November 18, 2020 16:59

GregBowyer reviewed Nov 18, 2020

View reviewed changes

sunchao reviewed Nov 18, 2020

View reviewed changes

alamb mentioned this pull request Nov 18, 2020

ARROW-10647: Add in Rust performance testing file apache/parquet-testing#15

Closed

Add missing file

e79bc5a

alamb closed this Dec 7, 2020

alamb deleted the alamb/port-parquet-benches branch December 7, 2020 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10647: [Rust] [Parquet] Port benchmarks from from parquet-rs to arrow repo #8708

ARROW-10647: [Rust] [Parquet] Port benchmarks from from parquet-rs to arrow repo #8708

alamb commented Nov 18, 2020 •

edited

github-actions bot commented Nov 18, 2020

nevi-me commented Nov 18, 2020

GregBowyer left a comment

sunchao left a comment

sunchao Nov 18, 2020

alamb Nov 18, 2020 •

edited

sunchao commented Nov 18, 2020

GregBowyer commented Nov 18, 2020

alamb commented Nov 18, 2020

alamb commented Nov 18, 2020

alamb commented Nov 19, 2020

wesm commented Nov 19, 2020

alamb commented Dec 3, 2020

sunchao commented Jan 9, 2021

alamb commented Jan 10, 2021

ARROW-10647: [Rust] [Parquet] Port benchmarks from from parquet-rs to arrow repo #8708

ARROW-10647: [Rust] [Parquet] Port benchmarks from from parquet-rs to arrow repo #8708

Conversation

alamb commented Nov 18, 2020 • edited

github-actions bot commented Nov 18, 2020

nevi-me commented Nov 18, 2020

GregBowyer left a comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

sunchao Nov 18, 2020

Choose a reason for hiding this comment

alamb Nov 18, 2020 • edited

Choose a reason for hiding this comment

sunchao commented Nov 18, 2020

GregBowyer commented Nov 18, 2020

alamb commented Nov 18, 2020

alamb commented Nov 18, 2020

alamb commented Nov 19, 2020

wesm commented Nov 19, 2020

alamb commented Dec 3, 2020

sunchao commented Jan 9, 2021

alamb commented Jan 10, 2021

alamb commented Nov 18, 2020 •

edited

alamb Nov 18, 2020 •

edited