Apache Iceberg version
main (development)
Query engine
Spark
Please describe the bug 🐞
Description
When executing Spark queries on Iceberg tables stored in cloud storage (ex: GCS). SerializableFileIOWithSize fails to override the newInputFile(String path, long length) method, causing the file length property to be dropped during Spark execution.
Root Cause
When an Iceberg reader on an executor attempts to open a file using a DataFile object or an explicit length, the Java runtime falls back to the default implementation in the FileIO interface:
// api/src/main/java/org/apache/iceberg/io/FileIO.java
default InputFile newInputFile(String path, long length) {
return newInputFile(path); // <--- Length property is discarded here!
}
Because SerializableFileIOWithSize does not override this, the length is lost before the call reaches the underlying IO implementation (e.g., GCSFileIO).
Impact
When the length is dropped, IO implementations like GCSFileIO instantiate GCSInputStream with a null file size.
When reading columnar formats (Parquet/ORC), Iceberg needs the file size to locate the footer (readTail). If the size is unknown, the IO driver is forced to execute a synchronous, blocking metadata API call (e.g., storage.get() in GCS) to determine the size.
This results in tens of thousands of unnecessary GetObjectMetadata calls, significantly degrading query performance and increases cost as well. In local testing, fixing this restored performance gap.
Willingness to contribute
Apache Iceberg version
main (development)
Query engine
Spark
Please describe the bug 🐞
Description
When executing Spark queries on Iceberg tables stored in cloud storage (ex: GCS).
SerializableFileIOWithSizefails to override thenewInputFile(String path, long length)method, causing the file length property to be dropped during Spark execution.Root Cause
When an Iceberg reader on an executor attempts to open a file using a DataFile object or an explicit length, the Java runtime falls back to the default implementation in the FileIO interface:
Because SerializableFileIOWithSize does not override this, the length is lost before the call reaches the underlying IO implementation (e.g., GCSFileIO).
Impact
When the length is dropped, IO implementations like GCSFileIO instantiate GCSInputStream with a null file size.
When reading columnar formats (Parquet/ORC), Iceberg needs the file size to locate the footer (readTail). If the size is unknown, the IO driver is forced to execute a synchronous, blocking metadata API call (e.g., storage.get() in GCS) to determine the size.
This results in tens of thousands of unnecessary GetObjectMetadata calls, significantly degrading query performance and increases cost as well. In local testing, fixing this restored performance gap.
Willingness to contribute