Skip to content

Parallel Parquet Reading #9381

@pmarks

Description

@pmarks

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I want to make many parallel data fetch requests to the underlying object store when fetching data with many small row groups.

This is relevant for few-column queries of parquet files with modest-sized row groups using high-latency object storage like S3 and R2.

Do people think this is a problem worth solving? Any suggestions on what a good API or implementation would look like? I’m going to take crack at making something work, just to explore the space, but would appreciate any input.

Describe the solution you'd like
At a super high level the ideal interface would be ParquetRecordBatchStream or similar, but where I can configure the number of parallel read requests to generate.

Describe alternatives you've considered
I don't have any good ideas for how to get IO parallelism with the current types. The sequential nature of row group processing is fairly deeply baked into the state-machine architecture.

There are some related issues that touch on this, but the capability of having IO for multiple row groups in flight at the same time still appears to be unsupported:
#5522

Additional context
For example, I have a parquet file where I need to make ~1k reads of 250kB to read a particular column. If we assume that the per-request latency of the object store is 70ms (as observed for R2 in various benchmarks) and we get 25MB/s of throughput, then making serial requests will take 1k * 70ms + 1k * 250kB/(25MB/s) = 70s (latency) + 10s (data transfer). S3 and R2 scale to many parallel GET requests, letting us hide much of the per-request latency, if we can parallelize the requests. In a browser I can make 6 parallel requests, so we’d expect the total time to come down to ~ 70s/6 + 10s = 21s for my particular use case of in-browser parquet viz.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions