Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10647: [Rust] [Parquet] Port benchmarks from from parquet-rs to arrow repo #8708

Closed
wants to merge 6 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Nov 18, 2020

This PR ports the parquet benchmarks from the original parquet-rs repo in service of helping to get #8698 merged.

The PR may be easier to review commit by commit to see what I had to change to make the benchmarks work in this Repo

My one question is if it is ok to add a 653KB binary file as part of this PR, or if that should be put into one of the other repos (like test data)

To run:

cd arrow/rust/parquet
cargo bench

Example output:

     Running /Users/alamb/Software/arrow2/rust/target/release/deps/encoding-a8e00763431f13f3

running 20 tests
test delta_bit_pack_i32_1k_10   ... bench:       9,252 ns/iter (+/- 2,121) = 442 MB/s
test delta_bit_pack_i32_1k_100  ... bench:       8,952 ns/iter (+/- 1,329) = 457 MB/s
test delta_bit_pack_i32_1k_1000 ... bench:       9,397 ns/iter (+/- 2,673) = 435 MB/s
test delta_bit_pack_i32_1m_10   ... bench:   9,329,524 ns/iter (+/- 770,014) = 449 MB/s
test delta_bit_pack_i32_1m_100  ... bench:   9,301,679 ns/iter (+/- 703,854) = 450 MB/s
test delta_bit_pack_i32_1m_1000 ... bench:   9,507,733 ns/iter (+/- 582,931) = 441 MB/s
test dict_i32_1k_10             ... bench:      11,565 ns/iter (+/- 3,881) = 354 MB/s
test dict_i32_1k_100            ... bench:      11,219 ns/iter (+/- 1,149) = 365 MB/s
test dict_i32_1k_1000           ... bench:      17,140 ns/iter (+/- 4,870) = 238 MB/s
test dict_i32_1m_10             ... bench:  12,270,883 ns/iter (+/- 1,498,457) = 341 MB/s
test dict_i32_1m_100            ... bench:  11,595,816 ns/iter (+/- 866,835) = 361 MB/s
test dict_i32_1m_1000           ... bench:  11,688,057 ns/iter (+/- 1,084,930) = 358 MB/s
test dict_str_1m                ... bench:  13,151,428 ns/iter (+/- 2,764,931) = 797 MB/s
test plain_i32_1k_10            ... bench:         165 ns/iter (+/- 42) = 24824 MB/s
test plain_i32_1k_100           ... bench:         164 ns/iter (+/- 11) = 24975 MB/s
test plain_i32_1k_1000          ... bench:         163 ns/iter (+/- 19) = 25128 MB/s
test plain_i32_1m_10            ... bench:     406,179 ns/iter (+/- 73,744) = 10326 MB/s
test plain_i32_1m_100           ... bench:     396,644 ns/iter (+/- 73,762) = 10574 MB/s
test plain_i32_1m_1000          ... bench:     412,808 ns/iter (+/- 45,920) = 10160 MB/s
test plain_str_1m               ... bench:  13,453,959 ns/iter (+/- 3,264,946) = 779 MB/s

@github-actions
Copy link

@nevi-me
Copy link
Contributor

nevi-me commented Nov 18, 2020

This will also close https://issues.apache.org/jira/browse/ARROW-4063

@nevi-me nevi-me self-requested a review November 18, 2020 16:59
Copy link
Contributor

@GregBowyer GregBowyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor: The commit message possibly should be s/benche/bench

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb . This is really great! just one nit.

// }
//
// filled with random values.
const TEST_FILE: &str = "10k-v2.parquet";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we normally put test data in https://github.com/apache/parquet-testing so perhaps we should add this one there as well (or if there any existing file there that we can use instead)?

Copy link
Contributor Author

@alamb alamb Nov 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created apache/parquet-testing#15 -- if/when that gets merged in, I'll update this PR to pick up a later version of parquet-testing and remove the binary from this PR as well.

@sunchao
Copy link
Member

sunchao commented Nov 18, 2020

Unrelated: there is also the fuzz module which is quite useful for detecting bad crashes in the code. It probably worth porting to arrow as well.

@GregBowyer
Copy link
Contributor

I am going to suggest porting these to criterion (as it makes it easier to compare parameters and runs)

I have a PR in the works for this, PR-ception I will PR on your repo to PR the PR :P

@alamb
Copy link
Contributor Author

alamb commented Nov 18, 2020

@GregBowyer -- sounds great!

@alamb
Copy link
Contributor Author

alamb commented Nov 18, 2020

image

@alamb
Copy link
Contributor Author

alamb commented Nov 19, 2020

@wesm suggests that rather than checking in files, we write / use a data generator, which makes sense to me. I'll try and work on such a thing -- though I am not sure when I will get time to do so

@wesm
Copy link
Member

wesm commented Nov 19, 2020

I'm fine with checking in these files (or putting them in an S3 bucket, or anything really), but just don't think that checking in binary files should be the project's benchmarking strategy =)

@alamb
Copy link
Contributor Author

alamb commented Dec 3, 2020

Update on this PR -- I plan to try and make a synthetic data generator rather than checking the data files in. I just haven't had the chance to do so yet

@alamb alamb closed this Dec 7, 2020
@alamb alamb deleted the alamb/port-parquet-benches branch December 7, 2020 19:30
@sunchao
Copy link
Member

sunchao commented Jan 9, 2021

I'll spend some time on this. We can probably port encoding/decoding benchmark first as they do not rely on the test file.

@alamb
Copy link
Contributor Author

alamb commented Jan 10, 2021

Thank you @sunchao

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants