fix: generate multi-file benchmark data with scrambled RG order by zhuqi-lucas · Pull Request #21711 · apache/datafusion

zhuqi-lucas · 2026-04-18T05:05:20Z

Which issue does this PR close?

Related to #21580

Rationale for this change

The benchmark data generation for inexact and overlap had two problems:

Single-file ORDER BY widened RG ranges: The parquet writer merges rows from adjacent chunks at RG boundaries, widening RG ranges to ~6M (instead of ~100K). This made reorder_by_statistics a no-op.
Per-file split (1 RG per file): The previous fix split into individual files, but then each file had only 1 RG — reorder_by_statistics had nothing to reorder within each file.

What changes are included in this PR?

Generate multiple files where each file has multiple row groups with scrambled order:

Write a single sorted file with small (100K-row) RGs (~61 RGs total)
Use pyarrow to redistribute RGs into N files, scrambling the RG order within each file

Result:

inexact: 3 files x ~20 RGs each (scrambled within each file)
overlap: 5 files x ~12 RGs each (different permutation)

Each RG has a narrow l_orderkey range (~100K) but appears in scrambled order within its file. This tests:

Row-group-level reorder (reorder_by_statistics within each file)
TopK threshold initialization from RG statistics (feat: initialize TopK dynamic filter threshold from parquet statistics #21712)
File-level ordering effects

Are these changes tested?

Benchmark data generation change only. Verified locally by inspecting parquet metadata (per-file RG min/max ranges are narrow ~100K but scrambled).

Are there any user-facing changes?

No. Adds pyarrow as a dependency for generating these benchmark datasets (pip install pyarrow).

Copilot

Pull request overview

Updates the sort pushdown Inexact overlap benchmark data generator so it actually produces parquet row groups that are both (a) overlapping and (b) out of order, making reorder_by_statistics do meaningful work (matching the streaming / delayed-chunk scenario described in #21580/#21580 follow-ups).

Changes:

Changes overlap dataset generation to order by a deterministic chunk permutation key plus a jittered l_orderkey, producing scrambled + overlapping RGs.
Updates benchmark script messaging/comments to reflect the new generation strategy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

zhuqi-lucas · 2026-04-18T06:34:41Z

Both Copilot comments were about the old ORDER BY + jitter/scramble approach, which has been completely replaced.

The new approach uses pyarrow to redistribute row groups from a sorted temp file into multiple output files with scrambled RG order:

inexact: 3 files x ~20 RGs each (scrambled within each file)
overlap: 5 files x ~12 RGs each (different permutation)

Each RG has a narrow l_orderkey range (~100K) but appears in scrambled order within its file. This properly tests:

Row-group-level reorder (reorder_by_statistics within each file)
TopK threshold initialization from RG statistics
File-level ordering effects

The single-file ORDER BY approach was fundamentally flawed in two ways:

Parquet writer merges rows from adjacent chunks at RG boundaries, widening ranges to ~6M
Splitting into one-RG-per-file meant reorder_by_statistics (an intra-file optimization) had nothing to reorder

Both the inexact and overlap benchmark data generation had problems: 1. The original single-file ORDER BY approach caused the parquet writer to merge rows from adjacent chunks at RG boundaries, widening RG ranges to ~6M and making reorder_by_statistics a no-op. 2. The per-file split fix (one RG per file) meant reorder_by_statistics had nothing to reorder within each file, since each had only 1 RG. RG reorder is an intra-file optimization. Fix by generating multiple files where each file has MULTIPLE row groups with scrambled order: - inexact: 3 files x ~20 RGs each (scrambled within each file) - overlap: 5 files x ~12 RGs each (different permutation) Each RG has a narrow l_orderkey range (~100K) but appears in scrambled order within its file. This properly tests: - Row-group-level reorder (reorder_by_statistics within each file) - TopK threshold initialization from RG statistics - File-level ordering effects Uses pyarrow to read RGs from a sorted temp file and redistribute them into multiple output files with scrambled RG order.

zhuqi-lucas requested review from adriangb and Copilot April 18, 2026 05:05

Copilot started reviewing on behalf of zhuqi-lucas April 18, 2026 05:05 View session

Copilot AI reviewed Apr 18, 2026

View reviewed changes

Comment thread benchmarks/bench.sh Outdated

Comment thread benchmarks/bench.sh Outdated

zhuqi-lucas mentioned this pull request Apr 18, 2026

feat: reorder row groups by statistics during sort pushdown #21580

Open

zhuqi-lucas force-pushed the fix/overlap-benchmark-data branch from ec93c3b to 8fb395a Compare April 18, 2026 06:33

zhuqi-lucas changed the title ~~fix: generate scrambled+overlapping RGs for overlap benchmark~~ fix: generate scrambled benchmark data with correct per-file RG ranges Apr 18, 2026

zhuqi-lucas force-pushed the fix/overlap-benchmark-data branch 2 times, most recently from 6aa52df to 91cdd1b Compare April 19, 2026 14:05

zhuqi-lucas changed the title ~~fix: generate scrambled benchmark data with correct per-file RG ranges~~ fix: generate multi-file benchmark data with scrambled RG order Apr 19, 2026

zhuqi-lucas force-pushed the fix/overlap-benchmark-data branch from 91cdd1b to 122de73 Compare April 19, 2026 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: generate multi-file benchmark data with scrambled RG order#21711

fix: generate multi-file benchmark data with scrambled RG order#21711
zhuqi-lucas wants to merge 1 commit intoapache:mainfrom
zhuqi-lucas:fix/overlap-benchmark-data

zhuqi-lucas commented Apr 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

zhuqi-lucas commented Apr 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhuqi-lucas commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

zhuqi-lucas commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhuqi-lucas commented Apr 18, 2026 •

edited

Loading

zhuqi-lucas commented Apr 18, 2026 •

edited

Loading