-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I want to make many parallel data fetch requests to the underlying object store when fetching data with many small row groups.
This is relevant for few-column queries of parquet files with modest-sized row groups using high-latency object storage like S3 and R2.
Do people think this is a problem worth solving? Any suggestions on what a good API or implementation would look like? I’m going to take crack at making something work, just to explore the space, but would appreciate any input.
Describe the solution you'd like
At a super high level the ideal interface would be ParquetRecordBatchStream or similar, but where I can configure the number of parallel read requests to generate.
Describe alternatives you've considered
I don't have any good ideas for how to get IO parallelism with the current types. The sequential nature of row group processing is fairly deeply baked into the state-machine architecture.
There are some related issues that touch on this, but the capability of having IO for multiple row groups in flight at the same time still appears to be unsupported: #5522
- Prefetch Row Groups using
next_readerAPI in parquet-rs datafusion#18391 - Decouple IO and CPU operations in the Parquet Reader (push decoder) #7983
- Support customizing row group reading process in async reader #5141
- feat(parquet): Add next_row_group API for ParquetRecordBatchStream #6907
Additional context
For example, I have a parquet file where I need to make ~1k reads of 250kB to read a particular column. If we assume that the per-request latency of the object store is 70ms (as observed for R2 in various benchmarks) and we get 25MB/s of throughput, then making serial requests will take 1k * 70ms + 1k * 250kB/(25MB/s) = 70s (latency) + 10s (data transfer). S3 and R2 scale to many parallel GET requests, letting us hide much of the per-request latency, if we can parallelize the requests. In a browser I can make 6 parallel requests, so we’d expect the total time to come down to ~ 70s/6 + 10s = 21s for my particular use case of in-browser parquet viz.