Parallel Parquet Reading

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
I want to make many parallel data fetch requests to the underlying object store when fetching data with many small row groups. 

This is relevant for few-column queries of parquet files with modest-sized row groups using high-latency object storage like S3 and R2.

Do people think this is a problem worth solving? Any suggestions on what a good API or implementation would look like?  I’m going to take crack at making something work, just to explore the space, but would appreciate any input.

**Describe the solution you'd like**
At a super high level the ideal interface would be ParquetRecordBatchStream or similar, but where I can configure the number of parallel read requests to generate.

**Describe alternatives you've considered**
I don't have any good ideas for how to get IO parallelism with the current types. The sequential nature of row group processing is fairly deeply baked into the state-machine architecture.

There are some related issues that touch on this, but the capability of having IO for multiple row groups in flight at the same time still appears to be unsupported: https://github.com/apache/arrow-rs/issues/5522
- https://github.com/apache/datafusion/pull/18391
- https://github.com/apache/arrow-rs/issues/7983
- https://github.com/apache/arrow-rs/issues/5141
- https://github.com/apache/arrow-rs/pull/6907

**Additional context**
For example, I have a parquet file where I need to make ~1k reads of 250kB to read a particular column. If we assume that the per-request latency of the object store is  70ms (as observed for R2 in various benchmarks) and we get 25MB/s of throughput, then making serial requests will take 1k * 70ms +  1k * 250kB/(25MB/s) = 70s (latency) + 10s (data transfer).  S3 and R2 scale to many parallel GET requests, letting us hide much of the per-request latency, if we can parallelize the requests. In a browser I can make 6 parallel requests, so we’d expect the total time to come down to ~ 70s/6 + 10s = 21s for my particular use case of in-browser parquet viz.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Parquet Reading #9381

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parallel Parquet Reading #9381

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions