Background
#21828 implements OFFSET pushdown for single-file parquet queries. Multi-file queries still use GlobalLimitExec for offset handling.
Problem
For multi-file queries like SELECT * FROM directory/ LIMIT 5 OFFSET 1000000, the offset is handled by GlobalLimitExec which reads all rows then discards the first 1M. With multiple files, we could skip entire files whose cumulative row count falls within the offset.
Challenge
File read order is non-deterministic with target_partitions > 1 and dynamic scheduling (#21351). A shared counter (Arc<AtomicUsize>) across file openers could work for single-partition sequential reads, but multi-partition ordering is undefined.
Proposed approach
- Single partition (
preserve_order=true): files read in deterministic order → shared counter tracks consumed offset across files → skip entire files + RGs
- Multi-partition: keep
GlobalLimitExec (order undefined without ORDER BY)
- Use file-level statistics (
PartitionedFile.statistics.num_rows) to skip entire files before opening
Related
Background
#21828 implements OFFSET pushdown for single-file parquet queries. Multi-file queries still use
GlobalLimitExecfor offset handling.Problem
For multi-file queries like
SELECT * FROM directory/ LIMIT 5 OFFSET 1000000, the offset is handled byGlobalLimitExecwhich reads all rows then discards the first 1M. With multiple files, we could skip entire files whose cumulative row count falls within the offset.Challenge
File read order is non-deterministic with
target_partitions > 1and dynamic scheduling (#21351). A shared counter (Arc<AtomicUsize>) across file openers could work for single-partition sequential reads, but multi-partition ordering is undefined.Proposed approach
preserve_order=true): files read in deterministic order → shared counter tracks consumed offset across files → skip entire files + RGsGlobalLimitExec(order undefined without ORDER BY)PartitionedFile.statistics.num_rows) to skip entire files before openingRelated