Skip to content

Spark: SerializableFileIOWithSize drops file length causing performance regression in Cloud Storage #16283

@ajayky-os

Description

@ajayky-os

Apache Iceberg version

main (development)

Query engine

Spark

Please describe the bug 🐞

Description
When executing Spark queries on Iceberg tables stored in cloud storage (ex: GCS). SerializableFileIOWithSize fails to override the newInputFile(String path, long length) method, causing the file length property to be dropped during Spark execution.

Root Cause
When an Iceberg reader on an executor attempts to open a file using a DataFile object or an explicit length, the Java runtime falls back to the default implementation in the FileIO interface:

// api/src/main/java/org/apache/iceberg/io/FileIO.java
default InputFile newInputFile(String path, long length) {
   return newInputFile(path); // <--- Length property is discarded here!
}

Because SerializableFileIOWithSize does not override this, the length is lost before the call reaches the underlying IO implementation (e.g., GCSFileIO).

Impact
When the length is dropped, IO implementations like GCSFileIO instantiate GCSInputStream with a null file size.

When reading columnar formats (Parquet/ORC), Iceberg needs the file size to locate the footer (readTail). If the size is unknown, the IO driver is forced to execute a synchronous, blocking metadata API call (e.g., storage.get() in GCS) to determine the size.

This results in tens of thousands of unnecessary GetObjectMetadata calls, significantly degrading query performance and increases cost as well. In local testing, fixing this restored performance gap.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions