Skip to content

Spark: Pass path/length to underlying FileIO in SerializableFileIOWithSize#16284

Merged
nastra merged 1 commit into
apache:mainfrom
ajayky-os:fix/extra_metadata_api_calls
May 12, 2026
Merged

Spark: Pass path/length to underlying FileIO in SerializableFileIOWithSize#16284
nastra merged 1 commit into
apache:mainfrom
ajayky-os:fix/extra_metadata_api_calls

Conversation

@ajayky-os
Copy link
Copy Markdown
Contributor

Fix for issue : #16283

Testing Done:
Ran a suite of TPCDS benchmark using GCSFileIO and verified that GetObjectMetadata calls are not made after this fix.

@github-actions github-actions Bot added the spark label May 11, 2026
@huaxingao
Copy link
Copy Markdown
Contributor

@ajayky-os Thanks for the PR! Could you please add a test?

@ajayky-os ajayky-os force-pushed the fix/extra_metadata_api_calls branch 2 times, most recently from 26a51e8 to cc4bad5 Compare May 12, 2026 04:56
When executing queries on cloud storage, SerializableFileIOWithSize
dropped the file length when intercepting FileIO.newInputFile(path, length)
requests. This caused underlying IO modules like GCSFileIO to execute
expensive and synchronous object metadata API calls to determine file sizes
when reading columnar footers.

This PR adds the missing newInputFile(String path, long length) override
to SerializableFileIOWithSize across all affected Spark modules, preserving
the length parameter and eliminating the unnecessary metadata lookups.
@ajayky-os ajayky-os force-pushed the fix/extra_metadata_api_calls branch from cc4bad5 to d9f6735 Compare May 12, 2026 04:58
@ajayky-os
Copy link
Copy Markdown
Contributor Author

@ajayky-os Thanks for the PR! Could you please add a test?

Done.

@nastra nastra changed the title Spark: Fix dropped file length in SerializableFileIOWithSize Spark: Override newInputFile(String path, long length) in SerializableFileIOWithSize May 12, 2026
@nastra nastra changed the title Spark: Override newInputFile(String path, long length) in SerializableFileIOWithSize Spark: Pass path/length to underlying FileIO in SerializableFileIOWithSize May 12, 2026
@steveloughran
Copy link
Copy Markdown
Contributor

This is where some minimal benchmarking against cloud storage would be good, though if the clients and streams collected basic stats on operations executed the checks could be done by assertions rather than benchmarks

Copy link
Copy Markdown
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, do you know is there is a visible TPC-DS impact if yes how much ?

asking if we should include this in 1.11 release ? @nastra

@nastra
Copy link
Copy Markdown
Contributor

nastra commented May 12, 2026

LGTM, do you know is there is a visible TPC-DS impact if yes how much ?

asking if we should include this in 1.11 release ? @nastra

this makes sense to me to include this in 1.11.0

@nastra nastra added this to the Iceberg 1.11.0 milestone May 12, 2026
@nastra nastra merged commit 575446e into apache:main May 12, 2026
28 checks passed
@steveloughran
Copy link
Copy Markdown
Contributor

Can I flag that without #15470 in, this means that position delete files on GCS may not be read so reliably.

@kevinjqliu
Copy link
Copy Markdown
Contributor

btw calling out that this is not added to spark 3.4 since SerializableFileIOWithSize was not added there to begin with

kevinjqliu added a commit that referenced this pull request May 13, 2026
Backport of #15683 (and length fix #16284) to spark/v3.4.

Note: BaseReader required an adaptation \u2014 v3.4 still used the legacy
table.encryption().decrypt(...) path. Switched it to fileIO.bulkDecrypt(...)
to match v3.5/4.0/4.1, since the broadcast FileIO is now an
EncryptingFileIO (combined in the constructor). All other files match the
v3.5 patch byte-for-byte (with paths translated).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants