Spark: Pass path/length to underlying FileIO in SerializableFileIOWithSize by ajayky-os · Pull Request #16284 · apache/iceberg

ajayky-os · 2026-05-11T12:23:08Z

Fix for issue : #16283

Testing Done:
Ran a suite of TPCDS benchmark using GCSFileIO and verified that GetObjectMetadata calls are not made after this fix.

huaxingao · 2026-05-12T01:04:13Z

@ajayky-os Thanks for the PR! Could you please add a test?

When executing queries on cloud storage, SerializableFileIOWithSize dropped the file length when intercepting FileIO.newInputFile(path, length) requests. This caused underlying IO modules like GCSFileIO to execute expensive and synchronous object metadata API calls to determine file sizes when reading columnar footers. This PR adds the missing newInputFile(String path, long length) override to SerializableFileIOWithSize across all affected Spark modules, preserving the length parameter and eliminating the unnecessary metadata lookups.

ajayky-os · 2026-05-12T05:00:15Z

@ajayky-os Thanks for the PR! Could you please add a test?

Done.

steveloughran · 2026-05-12T10:31:36Z

This is where some minimal benchmarking against cloud storage would be good, though if the clients and streams collected basic stats on operations executed the checks could be done by assertions rather than benchmarks

singhpk234

LGTM, do you know is there is a visible TPC-DS impact if yes how much ?

asking if we should include this in 1.11 release ? @nastra

nastra · 2026-05-12T13:52:39Z

LGTM, do you know is there is a visible TPC-DS impact if yes how much ?

asking if we should include this in 1.11 release ? @nastra

this makes sense to me to include this in 1.11.0

steveloughran · 2026-05-12T16:10:40Z

Can I flag that without #15470 in, this means that position delete files on GCS may not be read so reliably.

kevinjqliu · 2026-05-12T17:04:37Z

btw calling out that this is not added to spark 3.4 since SerializableFileIOWithSize was not added there to begin with

Backport of #15683 (and length fix #16284) to spark/v3.4. Note: BaseReader required an adaptation \u2014 v3.4 still used the legacy table.encryption().decrypt(...) path. Switched it to fileIO.bulkDecrypt(...) to match v3.5/4.0/4.1, since the broadcast FileIO is now an EncryptingFileIO (combined in the constructor). All other files match the v3.5 patch byte-for-byte (with paths translated).

github-actions Bot added the spark label May 11, 2026

ajayky-os force-pushed the fix/extra_metadata_api_calls branch 2 times, most recently from 26a51e8 to cc4bad5 Compare May 12, 2026 04:56

ajayky-os force-pushed the fix/extra_metadata_api_calls branch from cc4bad5 to d9f6735 Compare May 12, 2026 04:58

nastra approved these changes May 12, 2026

View reviewed changes

nastra changed the title ~~Spark: Fix dropped file length in SerializableFileIOWithSize~~ Spark: Override newInputFile(String path, long length) in SerializableFileIOWithSize May 12, 2026

nastra changed the title ~~Spark: Override newInputFile(String path, long length) in SerializableFileIOWithSize~~ Spark: Pass path/length to underlying FileIO in SerializableFileIOWithSize May 12, 2026

singhpk234 approved these changes May 12, 2026

View reviewed changes

nastra added this to the Iceberg 1.11.0 milestone May 12, 2026

huaxingao approved these changes May 12, 2026

View reviewed changes

nastra merged commit 575446e into apache:main May 12, 2026
28 checks passed

This was referenced May 12, 2026

Spark: SerializableFileIOWithSize drops file length causing performance regression in Cloud Storage #16283

Closed

Spark 3.4: Pass FileIO on Spark's read path #16307

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Pass path/length to underlying FileIO in SerializableFileIOWithSize#16284

Spark: Pass path/length to underlying FileIO in SerializableFileIOWithSize#16284
nastra merged 1 commit into
apache:mainfrom
ajayky-os:fix/extra_metadata_api_calls

ajayky-os commented May 11, 2026

Uh oh!

huaxingao commented May 12, 2026

Uh oh!

ajayky-os commented May 12, 2026

Uh oh!

steveloughran commented May 12, 2026

Uh oh!

singhpk234 left a comment

Uh oh!

nastra commented May 12, 2026

Uh oh!

Uh oh!

steveloughran commented May 12, 2026

Uh oh!

kevinjqliu commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

ajayky-os commented May 11, 2026

Uh oh!

huaxingao commented May 12, 2026

Uh oh!

ajayky-os commented May 12, 2026

Uh oh!

steveloughran commented May 12, 2026

Uh oh!

singhpk234 left a comment

Choose a reason for hiding this comment

Uh oh!

nastra commented May 12, 2026

Uh oh!

Uh oh!

steveloughran commented May 12, 2026

Uh oh!

kevinjqliu commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants