Skip to content

Core: Combine 3 GET requests for parquet reads#16729

Open
varun-lakhyani wants to merge 2 commits into
apache:mainfrom
varun-lakhyani:combine-getfooter-getdata
Open

Core: Combine 3 GET requests for parquet reads#16729
varun-lakhyani wants to merge 2 commits into
apache:mainfrom
varun-lakhyani:combine-getfooter-getdata

Conversation

@varun-lakhyani

Copy link
Copy Markdown
Contributor

For small file workloads (compaction, manifests, small data files), footer request latency grows linearly with file count and becomes a scaling bottleneck.

Solution

  • Introduce SingleFetchInputFile, an InputFile decorator in org.apache.iceberg.io. When a file's size is at or below read.single-fetch-threshold-bytes, newStream() fetches the entire file in one GET and returns a SingleFetchInputStream backed by the downloaded bytes. All subsequent parquet-mr reads - footer size, footer body, row group data - are served from memory with no additional remote calls.
  • Files above the threshold, or when the property is unset (0, the default), pass through unchanged with no behavioral difference from today.
  • SingleFetchInputStream implements both SeekableInputStream and RangeReadable, preserving parquet-mr's vectorized read path exactly.

@varun-lakhyani

Copy link
Copy Markdown
Contributor Author

Combining S3 GET requests alone gives 60-65% improvement, with further gains possible by parallelising them.
image

@varun-lakhyani varun-lakhyani force-pushed the combine-getfooter-getdata branch from b85a11d to 9a67fc3 Compare June 8, 2026 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant