-
Notifications
You must be signed in to change notification settings - Fork 358
feat: Honor compression settings for metadata.json on write #1876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Previously these properties where not honored on tabel properties. - Adds table properties for these values. - Plumbs them through for writers.
| key: &str, | ||
| default: T, | ||
| ) -> Result<T, anyhow::Error> | ||
| ) -> crate::error::Result<T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ) -> crate::error::Result<T> | |
| ) -> Result<T> |
Unnecessary fully qualified name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| /// The target file size for files. | ||
| pub write_target_file_size_bytes: usize, | ||
| /// Compression codec for metadata files (JSON) | ||
| pub metadata_compression_codec: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be sth like Option<String>, none should be None in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
|
|
||
| // Modify filename to add .gz before .metadata.json | ||
| let location = metadata_location.as_ref(); | ||
| let new_location = if location.ends_with(".metadata.json") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should simply replace the suffix with this approach. Currently all metadata location generated logic are in MetadataLocation, we should add a new method in like following and deprecate old method as following:
impl MetadataLocation {
#[deprecate("Use new_with_table instead.")]
pub fn new_with_table_location(table_location: impl ToString) -> Self {
}
pub fn new_with_table(table: &Table) -> Self {
....
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table option seemed a little awkward with the control flow on how it is used istead I added a method that takes table properties. Also added a similar method for incrementing the number.
I modified this logic a bit and kept. My concern here is if we only deprecate the new_with_table_location then users can write out metadata.json files in a format that would be unreadable by other readers.
| .map_err(from_aws_sdk_error)?; | ||
| let warehouse_location = get_resp.warehouse_location().to_string(); | ||
| MetadataLocation::new_with_table_location(warehouse_location).to_string() | ||
| get_resp.warehouse_location().to_string() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a semantic change because up the update to creation.location below, need to investigate further what the correct thing to do here is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, from the comment above (L451-452) it seems like the "s3tables catalog" generates the metadata file location and does not allow any modification
so maybe we should just use
let metadata_location =
MetadataLocation::new_with_properties(metadata_location, None)
.to_string();
so as to not do any path modification before writing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original code was wrong. As best I can test warehouseLocation is actually the table location.
I've updated it to assign to table_location and then assign it below (as far as I can tell, the .location field is meant to be table location and not the metadata.json file).
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, a few nit comments.
would be good to get a second pair of 👀
|
|
||
| /// Creates a completely new metadata location starting at version 0, | ||
| /// with compression settings from the table's properties. | ||
| /// Only used for creating a new table. For updates, see `with_next_version`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// Only used for creating a new table. For updates, see `with_next_version`. | |
| /// Only used for creating a new table. For updates, see `with_next_version_and_properties`. |
since with_next_version is also deprecated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll wait till I understand @liurenjie1024 concerns about with_next_version_and_properties to update this.
| .map_err(from_aws_sdk_error)?; | ||
| let warehouse_location = get_resp.warehouse_location().to_string(); | ||
| MetadataLocation::new_with_table_location(warehouse_location).to_string() | ||
| get_resp.warehouse_location().to_string() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, from the comment above (L451-452) it seems like the "s3tables catalog" generates the metadata file location and does not allow any modification
so maybe we should just use
let metadata_location =
MetadataLocation::new_with_properties(metadata_location, None)
.to_string();
so as to not do any path modification before writing
liurenjie1024
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @emkornfield for this pr, looks much better now, still need some refinements.
| /// Creates a completely new metadata location starting at version 0, | ||
| /// with compression settings from the table's properties. | ||
| /// Only used for creating a new table. For updates, see `with_next_version`. | ||
| pub fn new_with_properties( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| pub fn new_with_properties( | |
| pub fn new_table(table: &Table) |
We may do refactoring to table properties, and we may need other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue I have with taking a table here is that metadata location is an input to building a table, how about taking TableMetadataRef as input?
Or are you suggesting changing the return type of Table, so that this method updates it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @emkornfield changing the input param to take a &TableMetadata sounds good to me.
| since = "0.8.0", | ||
| note = "Use with_next_version_and_properties instead to properly handle compression settings changes" | ||
| )] | ||
| pub fn with_next_version(&self) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need change this method, it's not a constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as the comment below, lets consolidate conversation there.
| /// Creates a new metadata location for an updated metadata file. | ||
| /// Takes table properties to determine compression settings, which may have changed | ||
| /// from the previous version. | ||
| pub fn with_next_version_and_properties(&self, properties: &HashMap<String, String>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what you mean by we don't need it? I updated crates/iceberg/src/catalog/mod.rs to use it, if new properties change compression settings, I'm not sure how this would be reflected otherwise? I might have missed it but I don't think there is anything in the spec that prevents users from changing compression properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If properties has been changes, we should build a new instance of this MetadataLocation struct.
| properties | ||
| .get(TableProperties::PROPERTY_METADATA_COMPRESSION_CODEC) | ||
| .and_then(|codec| match codec.to_lowercase().as_str() { | ||
| "gzip" => Some(".gz".to_string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make ".gz" a constant rather a magic literal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
…nfield/iceberg-rust into feat/metadata-compression
Which issue does this PR close?
Split off from #1851
What changes are included in this PR?
This change honors the compression setting for metadata.json file (
write.metadata.compression-codec).Are these changes tested?
Add unit test to verify files are gzipped when the flag is enabled.