Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Sep 10, 2025

Closes #8304.

Rationale for this change

See #8304 and especially #8304 (comment)

@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 10, 2025
@rok rok force-pushed the plaintext_statistics branch from 241c532 to a5a44f4 Compare September 23, 2025 19:50
@rok rok force-pushed the plaintext_statistics branch 2 times, most recently from 3a7de46 to e2ebc45 Compare October 28, 2025 09:14
@rok rok force-pushed the plaintext_statistics branch from 24576dd to 381c69b Compare October 29, 2025 00:21
@rok rok force-pushed the plaintext_statistics branch from 03e52aa to 5459ffc Compare November 4, 2025 19:54
@rok rok force-pushed the plaintext_statistics branch from 6a9315a to c765177 Compare November 11, 2025 21:25
@rok rok force-pushed the plaintext_statistics branch from 078b622 to 686b38f Compare November 12, 2025 00:44
@rok rok marked this pull request as ready for review November 19, 2025 14:11
@rok rok requested a review from adamreeve November 19, 2025 16:37
Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good thanks Rok, I think this can just be tidied up a bit to reduce some duplication

serialize_column_meta_data(self, writer)?;
last_field_id = 3;
}
// Always write the ColumnMetaData struct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now duplicates the block below so these can be merged and the #[cfg... removed

// Only write statistics to plaintext footer if column is not encrypted

#[cfg(feature = "encryption")]
if column_chunk.crypto_metadata().is_none() {
Copy link
Contributor

@adamreeve adamreeve Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contents of these two blocks is the same, so we could tidy this up by adding a new function to decide whether to write stats and remove the inline #[cfg()] here, eg:

#[cfg(feature = "encryption")]
fn should_write_column_stats(column_chunk: &ColumnChunkMetaData) -> bool {
    // If there is encrypted column metadata present,
    // the column is encrypted with a different key to the footer or a plaintext footer is used,
    // so the statistics are sensitive and shouldn't be written.
    column_chunk.encrypted_column_metadata.is_none()
}

#[cfg(not(feature = "encryption"))]
fn should_write_column_stats(column_chunk: &ColumnChunkMetaData) -> bool {
    true
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After refactoring the tests and adding some extra test cases, I realised this logic isn't correct in the case of uniform encryption with an encrypted footer. In that scenario, the column chunk has crypto_metadata present but no encrypted_column_metadata, and we do need to write the statistics here.

I think the fix is just to check for encrypted_column_metadata instead of crypto_metadata, like the code was doing previously. I've edited my comment above to address that.


if !is_footer_encrypted {
// Temporarily clear crypto_metadata so statistics get included in encrypted blob
let crypto_metadata = column_chunk.column_crypto_metadata.take();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nearly identical to the block below except for the different encryptors used. This should be able to be refactored to avoid this, eg. by changing the match to return an Option<Box<dyn BlockEncryptor>> and then have a separate match on that result that does the encryption.

}

#[test]
pub fn test_non_uniform_plaintext_encryption_behaviour() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this test a bit hard to follow, I'll see if I can suggest a better structure and make a PR to your branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR here: rok#6

let is_footer_encrypted = file_encryptor.properties().encrypt_footer();

if !is_footer_encrypted {
// Temporarily clear crypto_metadata so statistics get included in encrypted blob
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this step if we check for encrypted_column_metadata instead of crypto_metadata in serialize_column_meta_data (https://github.com/apache/arrow-rs/pull/8305/files#r2543948275)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Parquet] Provide only encrypted column stats in plaintext footer

2 participants