Spark: Pass path/length to underlying FileIO in SerializableFileIOWithSize#16284
Conversation
|
@ajayky-os Thanks for the PR! Could you please add a test? |
26a51e8 to
cc4bad5
Compare
When executing queries on cloud storage, SerializableFileIOWithSize dropped the file length when intercepting FileIO.newInputFile(path, length) requests. This caused underlying IO modules like GCSFileIO to execute expensive and synchronous object metadata API calls to determine file sizes when reading columnar footers. This PR adds the missing newInputFile(String path, long length) override to SerializableFileIOWithSize across all affected Spark modules, preserving the length parameter and eliminating the unnecessary metadata lookups.
cc4bad5 to
d9f6735
Compare
Done. |
|
This is where some minimal benchmarking against cloud storage would be good, though if the clients and streams collected basic stats on operations executed the checks could be done by assertions rather than benchmarks |
singhpk234
left a comment
There was a problem hiding this comment.
LGTM, do you know is there is a visible TPC-DS impact if yes how much ?
asking if we should include this in 1.11 release ? @nastra
this makes sense to me to include this in 1.11.0 |
|
Can I flag that without #15470 in, this means that position delete files on GCS may not be read so reliably. |
|
btw calling out that this is not added to spark 3.4 since |
Backport of #15683 (and length fix #16284) to spark/v3.4. Note: BaseReader required an adaptation \u2014 v3.4 still used the legacy table.encryption().decrypt(...) path. Switched it to fileIO.bulkDecrypt(...) to match v3.5/4.0/4.1, since the broadcast FileIO is now an EncryptingFileIO (combined in the constructor). All other files match the v3.5 patch byte-for-byte (with paths translated).
Fix for issue : #16283
Testing Done:
Ran a suite of TPCDS benchmark using GCSFileIO and verified that GetObjectMetadata calls are not made after this fix.