Use OffsetIndex
to Prune IO in ParquetRecordBatchStream
#2426
Labels
enhancement
Any new improvement worthy of a entry in the changelog
parquet
Changes to the parquet crate
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Even when there is an active row selection,
async_reader::ReaderFactory::read_row_group
will still fetch an entire column chunk from object storage.Describe the solution you'd like
In the event that we have an
OffsetIndex
we can identify the pages that overlap with the row selection, and only fetch the corresponding byte ranges.Describe alternatives you've considered
We could not do this
Additional context
This will likely benefit from
ObjectStore::get_ranges
added in #2336 being integrated into DataFusion to ensure the more granular ranges don't result in a regression, by making lots of small get requestsThe text was updated successfully, but these errors were encountered: