Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extends parquet fuzz tests to also tests nulls, dictionaries and row groups with multiple pages (#1053) #1110

Merged
merged 4 commits into from
Jan 11, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Dec 29, 2021

Which issue does this PR close?

Closes #1053.

Rationale for this change

See ticket

What changes are included in this PR?

This extends the parquet fuzz tests to also tests nulls, dictionaries and row groups with multiple pages. Currently this runs into what appears to be a bug in the null handling for ArrowArrayReader. This is likely the same as in apache/datafusion#1441 - I have temporarily switched back to ComplexObjectArrayReader to get the test to pass, and will look into a fix prior to marking this ready for review. This has been fixed by #1130

Are there any user-facing changes?

No, this only adds tests

/// Total number of batches to attempt to read.
/// `record_batch_size` * `num_iterations` should be greater
/// than `num_rows` to ensure the data can be read back completely
num_iterations: usize,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't seem to serve a purpose, as it was always set in such a way as to read all the data, so I removed it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is redundant when record_batch_size is provided (which means the data is not all read in one big chunk, but is read in record_batch_size chunks)

@codecov-commenter
Copy link

codecov-commenter commented Dec 29, 2021

Codecov Report

Merging #1110 (ec79c43) into master (719096b) will increase coverage by 0.01%.
The diff coverage is 94.16%.

❗ Current head ec79c43 differs from pull request most recent head 87ea9a1. Consider uploading reports for the commit 87ea9a1 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1110      +/-   ##
==========================================
+ Coverage   82.55%   82.56%   +0.01%     
==========================================
  Files         169      169              
  Lines       50456    50535      +79     
==========================================
+ Hits        41655    41726      +71     
- Misses       8801     8809       +8     
Impacted Files Coverage Δ
parquet/src/arrow/arrow_reader.rs 89.93% <94.06%> (+0.61%) ⬆️
parquet/src/util/test_common/rand_gen.rs 82.69% <100.00%> (+0.87%) ⬆️
arrow/src/datatypes/field.rs 53.79% <0.00%> (-0.31%) ⬇️
parquet_derive/src/parquet_field.rs 66.21% <0.00%> (-0.23%) ⬇️
arrow/src/array/transform/mod.rs 85.56% <0.00%> (-0.14%) ⬇️
parquet/src/arrow/converter.rs 69.56% <0.00%> (+0.86%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 719096b...87ea9a1. Read the comment docs.

@tustvold
Copy link
Contributor Author

Thanks to @yordan-pavlov 's work on #1130 this now passes on master 🎉

@tustvold tustvold marked this pull request as ready for review January 10, 2022 20:37
@alamb alamb changed the title Parquet fuzz tests (#1053) Extends parquet fuzz tests to also tests nulls, dictionaries and row groups with multiple pages (#1053) Jan 10, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Nice work @tustvold

/// Total number of batches to attempt to read.
/// `record_batch_size` * `num_iterations` should be greater
/// than `num_rows` to ensure the data can be read back completely
num_iterations: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is redundant when record_batch_size is provided (which means the data is not all read in one big chunk, but is read in record_batch_size chunks)

@alamb alamb merged commit f8ff7fe into apache:master Jan 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet Fuzz Tests
3 participants