Skip to content

Conversation

@colinmarc
Copy link
Contributor

The spec mentions this naming convention here:

https://iceberg.apache.org/spec/#naming-for-gzip-compressed-metadata-json-files

Which issue does this PR close?

What changes are included in this PR?

Support for reading compressed metadata.

Are these changes tested?

Yes.

@colinmarc colinmarc force-pushed the metadata-compressed branch 2 times, most recently from 654de6b to cd16381 Compare October 29, 2025 21:26
let metadata_content = input_file.read().await?;
let metadata = serde_json::from_slice::<TableMetadata>(&metadata_content)?;

let metadata = if metadata_location.as_ref().ends_with(".gz.metadata.json") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to optionally support the Java Iceberg alternative?

The Java reference implementation can additionally read GZIP compressed files with the suffix metadata.json.gz.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems better to have one convention, to me, but happy either way.

Even better would be peeking at the file and looking for the gzip magic number. If there's interest in that I can implement it. The wording of the spec ("some implementations require") seems to suggest it would be better to have no naming requirement at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better would be peeking at the file and looking for the gzip magic number. If there's interest in that I can implement it.

That would be a really elegant solution, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done!

@colinmarc colinmarc force-pushed the metadata-compressed branch 2 times, most recently from 9892bae to 011512a Compare October 30, 2025 07:37
Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor performance nit.


use _serde::TableMetadataEnum;
use chrono::{DateTime, Utc};
use flate2::read::GzDecoder;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you go to read metadata_content it's already in memory as a &[u8] so I think we should use flate2::bufread::GzDecoder here. It might be an imperceptible performance difference, but you never know how big metadata might get :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, should be the opposite, no? With bufread we'll pay for an extra copy, but the "syscalls" (read) are free.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right, I had it backwards in my head, sorry about that!

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @colinmarc!

@colinmarc colinmarc force-pushed the metadata-compressed branch from 011512a to 2d87efe Compare October 30, 2025 19:24
The spec mentions that metadata files "may be compressed with GZIP",
here:

    https://iceberg.apache.org/spec/#table-metadata-and-snapshots
@colinmarc colinmarc force-pushed the metadata-compressed branch from 2d87efe to 453dadc Compare October 30, 2025 19:25
@colinmarc
Copy link
Contributor Author

Just found one case (StaticTable) that wasn't using TableMetadata::read_from. Fixed now.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @colinmarc for this pr!

let mut decompressed_data = Vec::new();
decoder
.read_to_end(&mut decompressed_data)
.map_err(|e| Error::new(ErrorKind::DataInvalid, e.to_string()))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.map_err(|e| Error::new(ErrorKind::DataInvalid, e.to_string()))?;
.map_err(|e| Error::new(ErrorKind::DataInvalid, "Trying to read compressed metadata file").with_context("file_path", metadata_location).with_source(e)
)?;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make error reporting better.

let metadata = if metadata_content.len() > 2
&& metadata_content[0] == 0x1F
&& metadata_content[1] == 0x8B
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a debug log here to explain why we choose to use try to decompress it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FR: support compressed metadata

3 participants