Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Ensure page statistics are written only when conifgured from the Arrow Writer #5181

Merged
merged 2 commits into from Dec 7, 2023

Conversation

AdamGS
Copy link
Contributor

@AdamGS AdamGS commented Dec 7, 2023

Which issue does this PR close?

Closes #5162.

Rationale for this change

Currently, he ArrowWriter writes page-level statistics for byte-array-encoded types whether its configured to do so or not.

What changes are included in this PR?

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 7, 2023
@AdamGS AdamGS changed the title fix: Ensure page statistics are written only when conifgured from the Arrow Writer Parquet: Ensure page statistics are written only when conifgured from the Arrow Writer Dec 7, 2023
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me thank you

))
}
_ => None,
let page_statistics = if let (Some(min), Some(max)) =
Copy link
Contributor

@tustvold tustvold Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code it would appear the way this was intended to work is for the encoder itself to not compute page statistics unless EnabledStatistics::Page. If EnabledStatistics::Chunk the column stats are computed at a higher-level in write_batch_internal. What do you think about pushing this condition into ByteArrayEncoder like it is for ColumnValueEncoderImpl?

Edit: I'll have a play to see if I can't simplify this as a follow up, the current behaviour is rather confusing imo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's this branch to calculate chunk statistics directly when EnabledStatistics::Chunk. I think that just having it as the default (For both Page and Chunk) will probably simplify the code as you don't have to keep track of the chunk-level metadata when adding pages, but it might require a bit more work which is why I didn't end up going that way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what I was thinking also. I tried this out in #5183 perhaps you could take a look

@tustvold tustvold merged commit 490c080 into apache:master Dec 7, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EnabledStatistics::Page does not take effect on ByteArrayEncoder
2 participants