I’ve been looking into readers in parquets, and I think there’s an architectural problem worth discussing.
Today we effectively have three different byte-fetch/control-flow shapes:
- sync ParquetRecordBatchReaderBuilder::build()
- push decoder / DataRequestBuilder
- async wrapping the push-decoder path
That makes it hard to do any of the following cleanly:
- share fixes across reader paths
- reason about performance regressions/wins
- evaluate backend changes like pread / batched range fetch / mmap
Do we want a shared internal artifact for “what bytes should this row group read next”, and then separate executors for sync / push / async?
Very roughly:
- planner: metadata + projection + current selection + available chunks + optional offset index -> planned byte ranges
- executor: fetch those ranges
- assembly: map fetched bytes back into InMemoryRowGroup / column chunks
- existing decode logic stays mostly where it is
I’ve been looking into readers in parquets, and I think there’s an architectural problem worth discussing.
Today we effectively have three different byte-fetch/control-flow shapes:
That makes it hard to do any of the following cleanly:
Do we want a shared internal artifact for “what bytes should this row group read next”, and then separate executors for sync / push / async?
Very roughly: