fix: generate multi-file benchmark data with scrambled RG order#21711
fix: generate multi-file benchmark data with scrambled RG order#21711zhuqi-lucas wants to merge 1 commit intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the sort pushdown Inexact overlap benchmark data generator so it actually produces parquet row groups that are both (a) overlapping and (b) out of order, making reorder_by_statistics do meaningful work (matching the streaming / delayed-chunk scenario described in #21580/#21580 follow-ups).
Changes:
- Changes overlap dataset generation to order by a deterministic chunk permutation key plus a jittered
l_orderkey, producing scrambled + overlapping RGs. - Updates benchmark script messaging/comments to reflect the new generation strategy.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ec93c3b to
8fb395a
Compare
|
Both Copilot comments were about the old ORDER BY + jitter/scramble approach, which has been completely replaced. The new approach uses pyarrow to redistribute row groups from a sorted temp file into multiple output files with scrambled RG order:
Each RG has a narrow l_orderkey range (~100K) but appears in scrambled order within its file. This properly tests:
The single-file ORDER BY approach was fundamentally flawed in two ways:
|
6aa52df to
91cdd1b
Compare
Both the inexact and overlap benchmark data generation had problems: 1. The original single-file ORDER BY approach caused the parquet writer to merge rows from adjacent chunks at RG boundaries, widening RG ranges to ~6M and making reorder_by_statistics a no-op. 2. The per-file split fix (one RG per file) meant reorder_by_statistics had nothing to reorder within each file, since each had only 1 RG. RG reorder is an intra-file optimization. Fix by generating multiple files where each file has MULTIPLE row groups with scrambled order: - inexact: 3 files x ~20 RGs each (scrambled within each file) - overlap: 5 files x ~12 RGs each (different permutation) Each RG has a narrow l_orderkey range (~100K) but appears in scrambled order within its file. This properly tests: - Row-group-level reorder (reorder_by_statistics within each file) - TopK threshold initialization from RG statistics - File-level ordering effects Uses pyarrow to read RGs from a sorted temp file and redistribute them into multiple output files with scrambled RG order.
91cdd1b to
122de73
Compare
Which issue does this PR close?
Related to #21580
Rationale for this change
The benchmark data generation for
inexactandoverlaphad two problems:Single-file ORDER BY widened RG ranges: The parquet writer merges rows from adjacent chunks at RG boundaries, widening RG ranges to ~6M (instead of ~100K). This made
reorder_by_statisticsa no-op.Per-file split (1 RG per file): The previous fix split into individual files, but then each file had only 1 RG —
reorder_by_statisticshad nothing to reorder within each file.What changes are included in this PR?
Generate multiple files where each file has multiple row groups with scrambled order:
Result:
inexact: 3 files x ~20 RGs each (scrambled within each file)overlap: 5 files x ~12 RGs each (different permutation)Each RG has a narrow
l_orderkeyrange (~100K) but appears in scrambled order within its file. This tests:reorder_by_statisticswithin each file)Are these changes tested?
Benchmark data generation change only. Verified locally by inspecting parquet metadata (per-file RG min/max ranges are narrow ~100K but scrambled).
Are there any user-facing changes?
No. Adds pyarrow as a dependency for generating these benchmark datasets (
pip install pyarrow).