parquet/arrow: should sync/async readers converge on a shared physical read planner

I’ve been looking into readers in parquets, and I think there’s an architectural problem worth discussing.

Today we effectively have three different byte-fetch/control-flow shapes:

  - sync ParquetRecordBatchReaderBuilder::build()
  - push decoder / DataRequestBuilder
  - async wrapping the push-decoder path

That makes it hard to do any of the following cleanly:

  - share fixes across reader paths
  - reason about performance regressions/wins
  - evaluate backend changes like pread / batched range fetch / mmap 
  
Do we want a shared internal artifact for “what bytes should this row group read next”, and then separate executors for sync / push / async?

Very roughly:

  - planner: metadata + projection + current selection + available chunks + optional offset index -> planned byte ranges
  - executor: fetch those ranges
  - assembly: map fetched bytes back into InMemoryRowGroup / column chunks
  - existing decode logic stays mostly where it is


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet/arrow: should sync/async readers converge on a shared physical read planner #9764

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

parquet/arrow: should sync/async readers converge on a shared physical read planner #9764

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions