Skip to content

Tables with files imported to Iceberg using add_files contain inconsistent format-version manifest fields when using format-version 2 #13667

@mihirsamdarshi

Description

@mihirsamdarshi

Apache Iceberg version

1.9.2 (latest release)

Query engine

Spark

Please describe the bug 🐞

Hi, I have noticed a potential issue when using the add_files procedure.

When using this procedure, manifest files set the format-version header to 1, despite creating the table with (and the metadata JSON file stating that) the format-version should be 2.

I've been documenting this issue in the duckdb/duckdb-iceberg#374 repository, since DuckDB is unable to read manifest files that contain different format-versions from their metadata JSON files. This behavior throws an error, since the content field of the data_file struct is missing in all records in the manifest.

I believe that this occurs in the add_files procedure. I was tracing down serialization of the output Avro manifest file, and noticed that the content field was optional in the DataFile interface:

Types.NestedField CONTENT =
optional(
134,
"content",
IntegerType.get(),
"Contents of the file: 0=data, 1=position deletes, 2=equality deletes");

I also noticed that there is no format-version dependent logic (to account for the required content field) in constructing the DataFiles:

buildDataFile(fileStatus.get(index), partitionValues, spec, metrics, "parquet");

Or serializing them:

.map(
(MapFunction<DataFile, Tuple2<String, DataFile>>)
file -> Tuple2.apply(file.location(), file),
Encoders.tuple(Encoders.STRING(), Encoders.javaSerialization(DataFile.class)))

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions