-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10647: [Rust] [Parquet] Port benchmarks from from parquet-rs to arrow repo #8708
Conversation
This will also close https://issues.apache.org/jira/browse/ARROW-4063 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super minor: The commit message possibly should be s/benche/bench
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb . This is really great! just one nit.
// } | ||
// | ||
// filled with random values. | ||
const TEST_FILE: &str = "10k-v2.parquet"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we normally put test data in https://github.com/apache/parquet-testing so perhaps we should add this one there as well (or if there any existing file there that we can use instead)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created apache/parquet-testing#15 -- if/when that gets merged in, I'll update this PR to pick up a later version of parquet-testing and remove the binary from this PR as well.
Unrelated: there is also the fuzz module which is quite useful for detecting bad crashes in the code. It probably worth porting to arrow as well. |
I am going to suggest porting these to criterion (as it makes it easier to compare parameters and runs) I have a PR in the works for this, PR-ception I will PR on your repo to PR the PR :P |
@GregBowyer -- sounds great! |
@wesm suggests that rather than checking in files, we write / use a data generator, which makes sense to me. I'll try and work on such a thing -- though I am not sure when I will get time to do so |
I'm fine with checking in these files (or putting them in an S3 bucket, or anything really), but just don't think that checking in binary files should be the project's benchmarking strategy =) |
Update on this PR -- I plan to try and make a synthetic data generator rather than checking the data files in. I just haven't had the chance to do so yet |
I'll spend some time on this. We can probably port encoding/decoding benchmark first as they do not rely on the test file. |
Thank you @sunchao |
This PR ports the parquet benchmarks from the original parquet-rs repo in service of helping to get #8698 merged.
The PR may be easier to review commit by commit to see what I had to change to make the benchmarks work in this Repo
My one question is if it is ok to add a 653KB binary file as part of this PR, or if that should be put into one of the other repos (like test data)
To run:
Example output: