-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-41431: [C++][Parquet][Dataset] Fix repeated scan on encrypted dataset #41550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
@pitrou @jorisvandenbossche Would you mind taking a look at this? |
cpp/src/parquet/file_reader.cc
Outdated
|
|
||
| void set_metadata(std::shared_ptr<FileMetaData> metadata) { | ||
| file_metadata_ = std::move(metadata); | ||
| file_decryptor_ = file_metadata_->file_decryptor(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing, here we set file_decryptor_ from file_metadata_, but in ParseUnencryptedFileMetadata and ParseMetaDataOfEncryptedFileWithPlaintextFooter we set file_metadata_ from file_decryptor_. Can we please make this consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/arrow/blob/main/cpp/src/parquet/file_reader.cc#L775-L780
I think these are in two different directions while opening a parquet reader:
- For
ParseXXXfunctions where we parse footer to createfile_decryptor_andfile_metadata_. We need to setfile_decryptor_tofile_metadata_so that bothSerializedFileandFileMetaDatahave a copy of the decryptor. - For
set_metadatafunction, we already have the cachedFileMetaDatabut need to createSerializedFilewhere itsfile_decryptor_is null. Therefore we need to getfile_decryptor_fromfile_metadata_.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this begs the question: do we need a file_decryptor_ field here? We could just get it from the metadata everytime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! Let me consolidate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed file_decryptor_ from SerializedFile and SerializedRowGroup. Let me know WDYT. @pitrou
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for this. Subset or cache file metadata might also a normal use case of parquet, which can build in different way from direct read meta. I'm +1 on this
|
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 5385926. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
|
@raulcd Should we port this to 16.1.0? |
I'll try to add it |
…set (#41550) ### Rationale for this change When parquet dataset is reused to create multiple scanners, `FileMetaData` objects are cached to avoid parsing them again. However, these caused issues on encrypted files since internal file decryptors were no longer created by cached `FileMetaData` objects. ### What changes are included in this PR? Expose file_decryptor from FileMetaData and set it properly. ### Are these changes tested? Yes, modify the test to reproduce the issue and assure fixed. ### Are there any user-facing changes? No. * GitHub Issue: #41431 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Gang Wu <ustcwg@gmail.com>
…d dataset (apache#41550) ### Rationale for this change When parquet dataset is reused to create multiple scanners, `FileMetaData` objects are cached to avoid parsing them again. However, these caused issues on encrypted files since internal file decryptors were no longer created by cached `FileMetaData` objects. ### What changes are included in this PR? Expose file_decryptor from FileMetaData and set it properly. ### Are these changes tested? Yes, modify the test to reproduce the issue and assure fixed. ### Are there any user-facing changes? No. * GitHub Issue: apache#41431 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Gang Wu <ustcwg@gmail.com>
Rationale for this change
When parquet dataset is reused to create multiple scanners,
FileMetaDataobjects are cached to avoid parsing them again. However, these caused issues on encrypted files since internal file decryptors were no longer created by cachedFileMetaDataobjects.What changes are included in this PR?
Expose file_decryptor from FileMetaData and set it properly.
Are these changes tested?
Yes, modify the test to reproduce the issue and assure fixed.
Are there any user-facing changes?
No.