Skip to content

parquet/arrow: should sync/async readers converge on a shared physical read planner #9764

@lyang24

Description

@lyang24

I’ve been looking into readers in parquets, and I think there’s an architectural problem worth discussing.

Today we effectively have three different byte-fetch/control-flow shapes:

  • sync ParquetRecordBatchReaderBuilder::build()
  • push decoder / DataRequestBuilder
  • async wrapping the push-decoder path

That makes it hard to do any of the following cleanly:

  • share fixes across reader paths
  • reason about performance regressions/wins
  • evaluate backend changes like pread / batched range fetch / mmap

Do we want a shared internal artifact for “what bytes should this row group read next”, and then separate executors for sync / push / async?

Very roughly:

  • planner: metadata + projection + current selection + available chunks + optional offset index -> planned byte ranges
  • executor: fetch those ranges
  • assembly: map fetched bytes back into InMemoryRowGroup / column chunks
  • existing decode logic stays mostly where it is

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions