Spark: SerializableFileIOWithSize drops file length causing performance regression in Cloud Storage

### Apache Iceberg version

main (development)

### Query engine

Spark

### Please describe the bug 🐞

**Description**
When executing Spark queries on Iceberg tables stored in cloud storage (ex: GCS). `SerializableFileIOWithSize` fails to override the `newInputFile(String path, long length)` method, causing the file length property to be dropped during Spark execution.

**Root Cause**
When an Iceberg reader on an executor attempts to open a file using a DataFile object or an explicit length, the Java runtime falls back to the default implementation in the FileIO interface:

```
// api/src/main/java/org/apache/iceberg/io/FileIO.java
default InputFile newInputFile(String path, long length) {
   return newInputFile(path); // <--- Length property is discarded here!
}
```

Because SerializableFileIOWithSize does not override this, the length is lost before the call reaches the underlying IO implementation (e.g., GCSFileIO). 

**Impact**
When the length is dropped, IO implementations like GCSFileIO instantiate GCSInputStream with a null file size. 

When reading columnar formats (Parquet/ORC), Iceberg needs the file size to locate the footer (readTail). If the size is unknown, the IO driver is forced to execute a synchronous, blocking metadata API call (e.g., storage.get() in GCS) to determine the size.

This results in tens of thousands of unnecessary GetObjectMetadata calls, significantly degrading query performance and increases cost as well. In local testing, fixing this restored performance gap.


### Willingness to contribute

- [x] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: SerializableFileIOWithSize drops file length causing performance regression in Cloud Storage #16283

Apache Iceberg version

Query engine

Please describe the bug 🐞

Willingness to contribute

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Spark: SerializableFileIOWithSize drops file length causing performance regression in Cloud Storage #16283

Description

Apache Iceberg version

Query engine

Please describe the bug 🐞

Willingness to contribute

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions