-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Parquet] Provide only encrypted column stats in plaintext footer #8305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
241c532 to
a5a44f4
Compare
3a7de46 to
e2ebc45
Compare
24576dd to
381c69b
Compare
03e52aa to
5459ffc
Compare
6a9315a to
c765177
Compare
078b622 to
686b38f
Compare
adamreeve
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good thanks Rok, I think this can just be tidied up a bit to reduce some duplication
| serialize_column_meta_data(self, writer)?; | ||
| last_field_id = 3; | ||
| } | ||
| // Always write the ColumnMetaData struct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now duplicates the block below so these can be merged and the #[cfg... removed
| // Only write statistics to plaintext footer if column is not encrypted | ||
|
|
||
| #[cfg(feature = "encryption")] | ||
| if column_chunk.crypto_metadata().is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contents of these two blocks is the same, so we could tidy this up by adding a new function to decide whether to write stats and remove the inline #[cfg()] here, eg:
#[cfg(feature = "encryption")]
fn should_write_column_stats(column_chunk: &ColumnChunkMetaData) -> bool {
// If there is encrypted column metadata present,
// the column is encrypted with a different key to the footer or a plaintext footer is used,
// so the statistics are sensitive and shouldn't be written.
column_chunk.encrypted_column_metadata.is_none()
}
#[cfg(not(feature = "encryption"))]
fn should_write_column_stats(column_chunk: &ColumnChunkMetaData) -> bool {
true
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After refactoring the tests and adding some extra test cases, I realised this logic isn't correct in the case of uniform encryption with an encrypted footer. In that scenario, the column chunk has crypto_metadata present but no encrypted_column_metadata, and we do need to write the statistics here.
I think the fix is just to check for encrypted_column_metadata instead of crypto_metadata, like the code was doing previously. I've edited my comment above to address that.
|
|
||
| if !is_footer_encrypted { | ||
| // Temporarily clear crypto_metadata so statistics get included in encrypted blob | ||
| let crypto_metadata = column_chunk.column_crypto_metadata.take(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nearly identical to the block below except for the different encryptors used. This should be able to be refactored to avoid this, eg. by changing the match to return an Option<Box<dyn BlockEncryptor>> and then have a separate match on that result that does the encryption.
| } | ||
|
|
||
| #[test] | ||
| pub fn test_non_uniform_plaintext_encryption_behaviour() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this test a bit hard to follow, I'll see if I can suggest a better structure and make a PR to your branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR here: rok#6
| let is_footer_encrypted = file_encryptor.properties().encrypt_footer(); | ||
|
|
||
| if !is_footer_encrypted { | ||
| // Temporarily clear crypto_metadata so statistics get included in encrypted blob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this step if we check for encrypted_column_metadata instead of crypto_metadata in serialize_column_meta_data (https://github.com/apache/arrow-rs/pull/8305/files#r2543948275)
Closes #8304.
Rationale for this change
See #8304 and especially #8304 (comment)