-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Apache Iceberg version
1.9.2 (latest release)
Query engine
Spark
Please describe the bug 🐞
Hi, I have noticed a potential issue when using the add_files procedure.
When using this procedure, manifest files set the format-version header to 1, despite creating the table with (and the metadata JSON file stating that) the format-version should be 2.
I've been documenting this issue in the duckdb/duckdb-iceberg#374 repository, since DuckDB is unable to read manifest files that contain different format-versions from their metadata JSON files. This behavior throws an error, since the content field of the data_file struct is missing in all records in the manifest.
I believe that this occurs in the add_files procedure. I was tracing down serialization of the output Avro manifest file, and noticed that the content field was optional in the DataFile interface:
iceberg/api/src/main/java/org/apache/iceberg/DataFile.java
Lines 37 to 42 in f25e07d
| Types.NestedField CONTENT = | |
| optional( | |
| 134, | |
| "content", | |
| IntegerType.get(), | |
| "Contents of the file: 0=data, 1=position deletes, 2=equality deletes"); |
I also noticed that there is no format-version dependent logic (to account for the required content field) in constructing the DataFiles:
| buildDataFile(fileStatus.get(index), partitionValues, spec, metrics, "parquet"); |
Or serializing them:
iceberg/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
Lines 804 to 807 in 54a62ae
| .map( | |
| (MapFunction<DataFile, Tuple2<String, DataFile>>) | |
| file -> Tuple2.apply(file.location(), file), | |
| Encoders.tuple(Encoders.STRING(), Encoders.javaSerialization(DataFile.class))) |
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time