fix: support reading compressed metadata #1802

colinmarc · 2025-10-29T21:23:25Z

The spec mentions this naming convention here:

https://iceberg.apache.org/spec/#naming-for-gzip-compressed-metadata-json-files

Which issue does this PR close?

Closes FR: support compressed metadata #1801

What changes are included in this PR?

Support for reading compressed metadata.

Are these changes tested?

Yes.

mbutrovich · 2025-10-29T21:33:01Z

crates/iceberg/src/spec/table_metadata.rs

        let metadata_content = input_file.read().await?;
-        let metadata = serde_json::from_slice::<TableMetadata>(&metadata_content)?;
+
+        let metadata = if metadata_location.as_ref().ends_with(".gz.metadata.json") {


Do we want to optionally support the Java Iceberg alternative?

The Java reference implementation can additionally read GZIP compressed files with the suffix metadata.json.gz.

Seems better to have one convention, to me, but happy either way.

Even better would be peeking at the file and looking for the gzip magic number. If there's interest in that I can implement it. The wording of the spec ("some implementations require") seems to suggest it would be better to have no naming requirement at all.

Even better would be peeking at the file and looking for the gzip magic number. If there's interest in that I can implement it.

That would be a really elegant solution, I think.

mbutrovich

Minor performance nit.

mbutrovich · 2025-10-30T17:30:12Z

crates/iceberg/src/spec/table_metadata.rs


 use _serde::TableMetadataEnum;
 use chrono::{DateTime, Utc};
+use flate2::read::GzDecoder;


When you go to read metadata_content it's already in memory as a &[u8] so I think we should use flate2::bufread::GzDecoder here. It might be an imperceptible performance difference, but you never know how big metadata might get :)

Hm, should be the opposite, no? With bufread we'll pay for an extra copy, but the "syscalls" (read) are free.

Yeah you're right, I had it backwards in my head, sorry about that!

mbutrovich

Thanks @colinmarc!

The spec mentions that metadata files "may be compressed with GZIP", here: https://iceberg.apache.org/spec/#table-metadata-and-snapshots

colinmarc · 2025-10-30T19:28:36Z

Just found one case (StaticTable) that wasn't using TableMetadata::read_from. Fixed now.

liurenjie1024

Thanks @colinmarc for this pr!

liurenjie1024 · 2025-10-31T08:58:24Z

crates/iceberg/src/spec/table_metadata.rs

+            let mut decompressed_data = Vec::new();
+            decoder
+                .read_to_end(&mut decompressed_data)
+                .map_err(|e| Error::new(ErrorKind::DataInvalid, e.to_string()))?;


Suggested change

.map_err(|e| Error::new(ErrorKind::DataInvalid, e.to_string()))?;

.map_err(|e| Error::new(ErrorKind::DataInvalid, "Trying to read compressed metadata file").with_context("file_path", metadata_location).with_source(e)

)?;

To make error reporting better.

liurenjie1024 · 2025-10-31T08:59:34Z

crates/iceberg/src/spec/table_metadata.rs

+        let metadata = if metadata_content.len() > 2
+            && metadata_content[0] == 0x1F
+            && metadata_content[1] == 0x8B
+        {


Add a debug log here to explain why we choose to use try to decompress it?

colinmarc force-pushed the metadata-compressed branch 2 times, most recently from 654de6b to cd16381 Compare October 29, 2025 21:26

mbutrovich reviewed Oct 29, 2025

View reviewed changes

colinmarc force-pushed the metadata-compressed branch 2 times, most recently from 9892bae to 011512a Compare October 30, 2025 07:37

mbutrovich suggested changes Oct 30, 2025

View reviewed changes

mbutrovich approved these changes Oct 30, 2025

View reviewed changes

colinmarc force-pushed the metadata-compressed branch from 011512a to 2d87efe Compare October 30, 2025 19:24

fix: support reading compressed metadata

453dadc

The spec mentions that metadata files "may be compressed with GZIP", here: https://iceberg.apache.org/spec/#table-metadata-and-snapshots

colinmarc force-pushed the metadata-compressed branch from 2d87efe to 453dadc Compare October 30, 2025 19:25

liurenjie1024 reviewed Oct 31, 2025

View reviewed changes

	.map_err(\|e\| Error::new(ErrorKind::DataInvalid, e.to_string()))?;
	.map_err(\|e\| Error::new(ErrorKind::DataInvalid, "Trying to read compressed metadata file").with_context("file_path", metadata_location).with_source(e)
	)?;

fix: support reading compressed metadata #1802

Are you sure you want to change the base?

fix: support reading compressed metadata #1802

Uh oh!

Conversation

colinmarc commented Oct 29, 2025

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

colinmarc commented Oct 30, 2025

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants