Skip to content

feat(parquet): row-group and row-range sampling on ParquetSource#22024

Open
adriangb wants to merge 2 commits intoapache:mainfrom
pydantic:worktree-parquet-sampling-base
Open

feat(parquet): row-group and row-range sampling on ParquetSource#22024
adriangb wants to merge 2 commits intoapache:mainfrom
pydantic:worktree-parquet-sampling-base

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented May 5, 2026

Which issue does this PR close?

This PR extracts the first layer of #22000 — the opt-in ParquetSource sampling primitives — as a self-contained change. The TABLESAMPLE SQL surface, Sample logical/physical nodes, and SamplePushdown rule are deliberately not included; they will land in follow-ups.

Rationale for this change

DataFusion has the machinery for fine-grained parquet sampling (ParquetAccessPlan with Skip / Scan / Selection(RowSelection)) but no public way to ask for a sample without constructing the access plan by hand and stuffing it into PartitionedFile.extensions. That works for one-off code but is awkward for ad-hoc data exploration, layered helpers that want to compute approximate stats over a bounded slice, and EXPLAIN ANALYZE-driven debug runs against a representative slice.

This PR adds the lowest layer: opt-in builders on ParquetSource that translate fractions into a ParquetAccessPlan lazily inside the opener (after the footer is loaded, so we sample by real row-group index). It is additive and has no behavior change for existing scans. The SQL surface in #22000 is built on top of these primitives.

What changes are included in this PR?

ParquetSource::new(schema)
    .with_row_group_sampling(0.1)   // keep ~10% of row groups per file
    .with_row_fraction(0.05)        // within each kept row group, keep ~5% of rows
    .with_row_cluster_size(8192);   // controls window granularity (default 32 768)

with_row_group_sampling(fraction):

  • Selection is deferred until the opener has loaded the parquet footer, so we sample by real row-group index.
  • Deterministic per (file_index, row_group_count, fraction) — re-runs match. The opener passes the execution partition_index as the stable file_index, so sampling is reproducible across environments without depending on object-store paths.
  • Always keeps at least one row group (target = max(1, ceil(N * fraction))).
  • No-op when fraction >= 1.0.

with_row_fraction(fraction):

  • Translates the per-row-group target into K contiguous windows spread evenly across the row group, each placed at a random offset within its stride. Window count = ceil(target / cluster_size).
  • Materializes a RowSelection per kept row group; the parquet reader uses the page index to read only the data pages covering the selected rows. This gives "page-level" IO savings without requiring per-column page alignment (which doesn't exist in parquet).
  • Falls back gracefully when the page index is missing — the reader still returns the right rows, the IO win just disappears.
  • Deterministic per (file_index, row_group_index, fraction, cluster_size).
  • Window-layout math is extracted into a dedicated build_row_window_selectors function and fuzz-tested across thousands of configurations to guarantee no overlap, in-bounds positions, and full coverage.

The two layers compose: row_group_fraction = 0.1 × row_fraction = 0.1 reads ~1% of the rows from ~10% of the row groups, with windows spread out so the sample isn't clustered at one end of each row group.

Internals

  • New ParquetSampling struct re-exported at the crate root.
  • Plumbed through ParquetMorselizerPreparedParquetOpen.
  • Two free functions invoked from prune_row_groups right after create_initial_plan.
  • New dep: rand with the small_rng feature (already in workspace Cargo.toml).

Differences vs. the original commit in #22000

Two pieces of review feedback on the parent PR are folded in here:

Are these changes tested?

12 tests in datafusion-datasource-parquet:

Row-group sampling (sampling::tests):

  • row_group_sampling_keeps_target_countceil(N * fraction) math.
  • row_group_sampling_is_deterministic — same inputs → same selection.
  • row_group_sampling_differs_per_file_index — different file_index → different sample.
  • row_group_sampling_no_op_when_fraction_is_one — fraction ≥ 1.0 keeps everything.
  • row_group_sampling_target_at_least_onefraction = 0.001 over 100 row groups still keeps 1.
  • row_group_sampling_no_op_when_unsetNone is a no-op.

Row-window selection (sampling::tests):

  • row_window_selection_basic_layout — hand-checked anchor case.
  • row_window_selection_returns_none_on_invalid_input — degenerate inputs (zero row group, zero target, zero cluster) return None.
  • row_window_selection_full_target_no_overlap — the previously-buggy target_rows == total_rows case.
  • row_window_selection_fuzz_invariants — 5 000 randomized (total_rows, target_rows, cluster_size, seed) configurations, asserting full coverage, in-bounds positions, and no overlap.
  • row_window_selection_fuzz_determinism — 1 000 iterations verifying identical seeds produce identical layouts.

End-to-end (opener::test):

  • row_group_sampling_end_to_end — writes a 4-row-group parquet to InMemory, scans with fraction = 0.5, asserts exactly 6 rows out (2 row groups × 3 rows).
  • row_fraction_end_to_end — writes a 100-row single-row-group parquet, scans with row_fraction = 0.1 and cluster_size = 4, asserts the result is in the expected range.

cargo build, cargo fmt --all, and cargo clippy -p datafusion-datasource-parquet --all-targets --all-features -- -D warnings are clean.

Are there any user-facing changes?

🤖 Generated with Claude Code

adriangb and others added 2 commits May 5, 2026 07:24
Adds two opt-in sampling primitives to parquet scans, both built on
the existing `ParquetAccessPlan` infrastructure:

* `ParquetSource::with_row_group_sampling(fraction)` — keep `fraction`
  of row groups in each scanned file. Selection is deferred until the
  opener has loaded the parquet footer (so we sample by real row-group
  index, not guess) and is deterministic per `(file_name,
  row_group_count, fraction)` via a seeded `SmallRng`.

* `ParquetSource::with_row_fraction(fraction)` — within each kept row
  group, keep `fraction` of rows by translating to a `RowSelection` of
  K small contiguous windows (size controlled by
  `with_row_cluster_size`, default 32 768 rows). The parquet reader
  uses the page index to read only the data pages covering the
  selected rows, so this gives "page-level" IO savings without
  requiring per-column page alignment. Falls back gracefully (no
  IO win, still correct) when the page index is missing.

The two layers compose: scanning with both `row_group_fraction=0.1`
and `row_fraction=0.1` reads ~1% of the rows in ~10% of the row
groups, with windows spread out so the sample isn't clustered at one
end of each row group.

Selection within a row group is deterministic-but-random per
`(file_name, row_group_index, fraction, cluster_size)` — same inputs
yield the same windows, so re-runs are repeatable.

## Why this lives on `ParquetSource`

The natural entry-point for "I want a sample" is at config time,
before any metadata IO. The actual *which* row groups / *which* rows
selection still has to be deferred to the opener (after the footer is
parsed) — that's why `ParquetSampling` carries fractions plus a cluster
size, and the opener pulls them through to its lazy decision points.

This is intentionally orthogonal to file-level sampling: `ParquetSource`
doesn't own the file list (`FileScanConfig.file_groups` does), so a
file-fraction setter here would have been a confusing no-op. Callers
that want to drop files should rebuild the `FileScanConfig` directly.

## Use cases

* `TABLESAMPLE` SQL syntax (any future implementation can lower to
  these primitives).
* Ad-hoc data exploration / `EXPLAIN ANALYZE` against a sample.
* Mini-query-style stats sampling (a layered helper can call these
  to bound the cost of computing approximate min/max/NDV/histograms
  for the optimizer — out of scope here, see the linked POC in the
  PR description).
* `EXPLAIN ANALYZE`-driven debug runs against a representative slice.

## Tests

5 unit tests on `apply_row_group_sampling` (target count, determinism,
file-name dependence, no-op at fraction=1.0, target floor of 1) plus
2 end-to-end tests that build a real parquet file in `InMemory` object
store and confirm the row counts emitted are what the sampling implies.

`cargo build --workspace`, `cargo fmt --all`, and
`cargo clippy -p datafusion-datasource-parquet --all-targets -- -D warnings`
are clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes responding to review on the parent commit:

1. Key sampling on a stable `file_index` instead of `file_name`
   (apache#22000 (comment)).

   Both `apply_row_group_sampling` and `apply_row_fraction_sampling`
   now take `file_index: usize` rather than `file_name: &str`. The
   parquet opener passes the execution `partition_index`. This makes
   sampling reproducible across environments (no dependency on the
   on-disk path), while still decorrelating files assigned to
   different partitions.

2. Extract the row-window selection into `build_row_window_selectors`
   and add fuzz coverage
   (apache#22000 (comment)).

   The previous inline arithmetic could produce overlapping windows
   when `target_rows` was close to `total_rows`: `window_size =
   ceil(target / n_windows)` could exceed `stride = total / n_windows`,
   so adjacent strides' windows would intersect. The extracted
   function caps `window_size` at `stride` (the construction that
   guarantees disjointness) and is covered by:

   * `row_window_selection_basic_layout` — hand-checked anchor case.
   * `row_window_selection_returns_none_on_invalid_input` — degenerate
     inputs return `None` cleanly.
   * `row_window_selection_full_target_no_overlap` — the previously
     buggy `target_rows == total_rows` case.
   * `row_window_selection_fuzz_invariants` — 5 000 randomized
     `(total_rows, target_rows, cluster_size, seed)` configurations,
     asserting full coverage, in-bounds positions, and no overlap.
   * `row_window_selection_fuzz_determinism` — 1 000 iterations
     verifying identical seeds produce identical layouts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant