-
Notifications
You must be signed in to change notification settings - Fork 340
fix: support reading compressed metadata #1802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
654de6b to
cd16381
Compare
| let metadata_content = input_file.read().await?; | ||
| let metadata = serde_json::from_slice::<TableMetadata>(&metadata_content)?; | ||
|
|
||
| let metadata = if metadata_location.as_ref().ends_with(".gz.metadata.json") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to optionally support the Java Iceberg alternative?
The Java reference implementation can additionally read GZIP compressed files with the suffix metadata.json.gz.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems better to have one convention, to me, but happy either way.
Even better would be peeking at the file and looking for the gzip magic number. If there's interest in that I can implement it. The wording of the spec ("some implementations require") seems to suggest it would be better to have no naming requirement at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even better would be peeking at the file and looking for the gzip magic number. If there's interest in that I can implement it.
That would be a really elegant solution, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done!
9892bae to
011512a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor performance nit.
|
|
||
| use _serde::TableMetadataEnum; | ||
| use chrono::{DateTime, Utc}; | ||
| use flate2::read::GzDecoder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you go to read metadata_content it's already in memory as a &[u8] so I think we should use flate2::bufread::GzDecoder here. It might be an imperceptible performance difference, but you never know how big metadata might get :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, should be the opposite, no? With bufread we'll pay for an extra copy, but the "syscalls" (read) are free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you're right, I had it backwards in my head, sorry about that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @colinmarc!
011512a to
2d87efe
Compare
The spec mentions that metadata files "may be compressed with GZIP",
here:
https://iceberg.apache.org/spec/#table-metadata-and-snapshots
2d87efe to
453dadc
Compare
|
Just found one case ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @colinmarc for this pr!
| let mut decompressed_data = Vec::new(); | ||
| decoder | ||
| .read_to_end(&mut decompressed_data) | ||
| .map_err(|e| Error::new(ErrorKind::DataInvalid, e.to_string()))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| .map_err(|e| Error::new(ErrorKind::DataInvalid, e.to_string()))?; | |
| .map_err(|e| Error::new(ErrorKind::DataInvalid, "Trying to read compressed metadata file").with_context("file_path", metadata_location).with_source(e) | |
| )?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make error reporting better.
| let metadata = if metadata_content.len() > 2 | ||
| && metadata_content[0] == 0x1F | ||
| && metadata_content[1] == 0x8B | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a debug log here to explain why we choose to use try to decompress it?
The spec mentions this naming convention here:
https://iceberg.apache.org/spec/#naming-for-gzip-compressed-metadata-json-files
Which issue does this PR close?
What changes are included in this PR?
Support for reading compressed metadata.
Are these changes tested?
Yes.