Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet Fuzz Tests #1053

Closed
tustvold opened this issue Dec 17, 2021 · 2 comments · Fixed by #1110
Closed

Parquet Fuzz Tests #1053

tustvold opened this issue Dec 17, 2021 · 2 comments · Fixed by #1110
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Whilst working on #1037 I've introduced bugs that have then been caught by the arrow array benchmarks.

It would therefore appear that these tests are exercising code paths not found in the other tests, and we could therefore increase the test coverage by including some variant of them.

Describe the solution you'd like

A set of fuzz tests that create various types of PageIterator with multiple column chunks, and multiple pages per column chunk. This can likely reuse much of the fuzz plumbing found in the arrow_array_reader benchmarks.

The tests would then use the ArrayReader abstractions to read this data and verify it is what was written.

Describe alternatives you've considered

We could not add fuzz tests, but there would be an increased likelihood of regressions.

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Dec 17, 2021
@chadbrewbaker
Copy link

chadbrewbaker commented Dec 19, 2021

After thinking about this for a week - I'm inclined to start driving with Arrow Python/Hypothesis and Python Parquet tests then gradually add Proptest. AWS Labs has the best proptest examples.

Zooming out a bit more, DataFusion needs to be integrated in squirrel - sqlancer cross SQL engine tests. Can use sqlsmith for reductions of large queries.

We also want to be like AWS Redshift where you write a query in Python/SQL - and it emits Rust code that gets compiled and sent to worker nodes.

Seems we might need thin-lto even on dev builds to reduce false positives https://github.com/awslabs/rust-smt-ir/blob/551565ea5e97f502269d74d189e2e2c1e6b52f40/Cargo.toml#L11

@tustvold
Copy link
Contributor Author

FYI I'm experimenting with extending the existing fuzz tests to support nulls, dictionaries, etc...

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 29, 2021
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 29, 2021
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 29, 2021
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 29, 2021
alamb pushed a commit that referenced this issue Jan 11, 2022
…groups with multiple pages (#1053) (#1110)

* Parquet fuzz tests (#1053)

* Test multiple WriterVersions

* Revert array_reader change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
2 participants