Skip to content

Optimize Parquet metadata row-group level statistics collection#22462

Open
AdamGS wants to merge 4 commits into
apache:mainfrom
AdamGS:adamg/parquet-stats-perf
Open

Optimize Parquet metadata row-group level statistics collection#22462
AdamGS wants to merge 4 commits into
apache:mainfrom
AdamGS:adamg/parquet-stats-perf

Conversation

@AdamGS
Copy link
Copy Markdown
Contributor

@AdamGS AdamGS commented May 22, 2026

Which issue does this PR close?

  • Closes #.

Rationale for this change

The current stats aggregation does a bunch of unnecessary work, this PR tries to do the minimal amount of work at every step.

What changes are included in this PR?

In addition to splitting up the summarization logic into some clearer functions and a reusable function for min/max, I've tried to do the minimal amount of work at each step:

  1. Only allocate boolean masks if there's a mix of exact/inexact stats between row groups.
  2. No need to allocate an Arrow array for null count.
  3. No need to re-calculate the parquet column index - its already in stats_converter, as far as I can tell its exactly the same code path.
  4. No need to recalculate the number of rows - we already know it.

I've also included a benchmark, the effect on my laptop is:

parquet_metadata_statistics/wide_one_row_group
                        time:   [2.9945 ms 3.0313 ms 3.0487 ms]
                        change: [−44.473% −43.790% −43.044%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking parquet_metadata_statistics/moderate_width_many_row_groups: Collecting 10 samples in estimated 5
parquet_metadata_statistics/moderate_width_many_row_groups
                        time:   [236.75 µs 237.37 µs 238.48 µs]
                        change: [−22.330% −21.550% −20.794%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Benchmarking parquet_metadata_statistics/wide_many_row_groups: Collecting 10 samples in estimated 5.0127 s (7
parquet_metadata_statistics/wide_many_row_groups
                        time:   [628.67 µs 636.88 µs 645.79 µs]
                        change: [−29.409% −28.225% −26.999%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Are these changes tested?

Existing tests and few additional small unit tests.

Are there any user-facing changes?

None

AdamGS added 2 commits May 22, 2026 14:58
Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
@github-actions github-actions Bot added the datasource Changes to the datasource crate label May 22, 2026
@AdamGS
Copy link
Copy Markdown
Contributor Author

AdamGS commented May 22, 2026

@asolimando here is my change

},
};

// This is the same logic as parquet_column but we start from arrow schema index
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this value is already stats_converter.parquet_column_index()

Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
Copy link
Copy Markdown
Member

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really impressive numbers (22-44% improvement) and the refactoring improves readability via focused helpers, and the ExactnessSummary enum is a nice touch.

I could only find two concerns (see inline comments), but I should caveat that I'm not deeply familiar yet with this area of the codebase so my review shouldn't be considered comprehensive.

.columns()
.get(parquet_idx)
.and_then(|column| column.statistics())
.map(|statistics| statistics.null_count_opt().unwrap_or(0))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it seems that we aren't honoring StatisticsConverter::missing_null_counts_as_zero when set to false, due to the .unwrap_or(0), the old code was explicitly checking the config property and I think it's worth restoring that behavior. WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, I'll fix this.

{
ExactnessSummary::AllExact => Some(true),
ExactnessSummary::NoneExact => Some(false),
ExactnessSummary::Mixed => {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have missed it, but Mixed case doesn't seem to be covered in benchmarks, while it's interesting IMO as it's the edge case where things aren't "obvious". If it's confirmed to be uncovered, would you mind adding it to the benchmarks?

@AdamGS
Copy link
Copy Markdown
Contributor Author

AdamGS commented May 23, 2026

reworked the benchmarks and they are much nicer now and cover more cases. They also include some logic that can be used for other benchmarks around Parquet stats in the future.

parquet_metadata_statistics/metadata_full_col_8_rg_1
                        time:   [7.8150 µs 7.8257 µs 7.8375 µs]
                        change: [−49.844% −49.709% −49.575%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
parquet_metadata_statistics/metadata_full_col_8_rg_32
                        time:   [19.557 µs 19.635 µs 19.719 µs]
                        change: [−31.058% −30.756% −30.435%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
parquet_metadata_statistics/metadata_full_col_8_rg_128
                        time:   [53.199 µs 53.319 µs 53.456 µs]
                        change: [−20.035% −19.700% −19.330%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  2 (2.00%) high severe
parquet_metadata_statistics/metadata_full_col_64_rg_1
                        time:   [66.350 µs 66.445 µs 66.557 µs]
                        change: [−50.876% −50.713% −50.556%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe
parquet_metadata_statistics/metadata_full_col_64_rg_32
                        time:   [152.25 µs 152.65 µs 153.15 µs]
                        change: [−34.144% −33.914% −33.690%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
Benchmarking parquet_metadata_statistics/metadata_full_col_64_rg_128: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 60.
parquet_metadata_statistics/metadata_full_col_64_rg_128
                        time:   [426.46 µs 431.80 µs 438.79 µs]
                        change: [−21.906% −21.086% −20.271%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
parquet_metadata_statistics/metadata_full_col_256_rg_1
                        time:   [364.88 µs 365.58 µs 366.31 µs]
                        change: [−49.293% −48.963% −48.647%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  2 (2.00%) high severe
Benchmarking parquet_metadata_statistics/metadata_full_col_256_rg_32: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.9s, enable flat sampling, or reduce sample count to 50.
parquet_metadata_statistics/metadata_full_col_256_rg_32
                        time:   [793.94 µs 796.79 µs 799.76 µs]
                        change: [−35.183% −34.000% −32.948%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe
parquet_metadata_statistics/metadata_full_col_256_rg_128
                        time:   [2.7749 ms 2.8016 ms 2.8311 ms]
                        change: [−18.346% −17.368% −16.405%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
parquet_metadata_statistics/metadata_mixed_col_8_rg_1
                        time:   [3.0218 µs 3.0287 µs 3.0354 µs]
                        change: [−1.5245% −1.0760% −0.6204%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe
parquet_metadata_statistics/metadata_mixed_col_8_rg_32
                        time:   [37.171 µs 37.259 µs 37.346 µs]
                        change: [−6.8862% −6.4140% −5.9716%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
parquet_metadata_statistics/metadata_mixed_col_8_rg_128
                        time:   [71.247 µs 71.441 µs 71.641 µs]
                        change: [−4.4617% −3.9458% −3.4404%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
parquet_metadata_statistics/metadata_mixed_col_64_rg_1
                        time:   [24.902 µs 24.941 µs 24.983 µs]
                        change: [+0.0278% +0.3191% +0.6074%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe
parquet_metadata_statistics/metadata_mixed_col_64_rg_32
                        time:   [306.45 µs 307.22 µs 308.04 µs]
                        change: [−5.5007% −4.9854% −4.4696%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe
Benchmarking parquet_metadata_statistics/metadata_mixed_col_64_rg_128: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.5s, enable flat sampling, or reduce sample count to 60.
parquet_metadata_statistics/metadata_mixed_col_64_rg_128
                        time:   [583.24 µs 584.52 µs 585.83 µs]
                        change: [−4.6866% −4.0490% −3.5136%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
parquet_metadata_statistics/metadata_mixed_col_256_rg_1
                        time:   [173.84 µs 174.25 µs 174.70 µs]
                        change: [−1.4822% −0.7702% −0.0522%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe
parquet_metadata_statistics/metadata_mixed_col_256_rg_32
                        time:   [1.4408 ms 1.4542 ms 1.4687 ms]
                        change: [−8.1275% −6.9286% −5.7943%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  8 (8.00%) high mild
  1 (1.00%) high severe
parquet_metadata_statistics/metadata_mixed_col_256_rg_128
                        time:   [3.1768 ms 3.1939 ms 3.2129 ms]
                        change: [−15.720% −14.510% −13.283%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
parquet_metadata_statistics/metadata_none_col_8_rg_1
                        time:   [3.0249 µs 3.0304 µs 3.0362 µs]
                        change: [−2.6646% −2.3106% −1.9615%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
parquet_metadata_statistics/metadata_none_col_8_rg_32
                        time:   [4.6349 µs 4.6467 µs 4.6587 µs]
                        change: [−2.9848% −2.5834% −2.1817%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) low mild
  2 (2.00%) high mild
parquet_metadata_statistics/metadata_none_col_8_rg_128
                        time:   [9.7857 µs 9.8335 µs 9.8789 µs]
                        change: [−1.6055% −1.0128% −0.3802%] (p = 0.00 < 0.05)
                        Change within noise threshold.
parquet_metadata_statistics/metadata_none_col_64_rg_1
                        time:   [24.831 µs 24.863 µs 24.898 µs]
                        change: [−2.2535% −1.9447% −1.6384%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
parquet_metadata_statistics/metadata_none_col_64_rg_32
                        time:   [34.315 µs 34.422 µs 34.527 µs]
                        change: [+0.3927% +0.6664% +0.9535%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
parquet_metadata_statistics/metadata_none_col_64_rg_128
                        time:   [65.464 µs 65.898 µs 66.312 µs]
                        change: [+0.8826% +1.4562% +2.0139%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
parquet_metadata_statistics/metadata_none_col_256_rg_1
                        time:   [167.74 µs 170.48 µs 173.04 µs]
                        change: [−5.4287% −3.9660% −2.3609%] (p = 0.00 < 0.05)
                        Performance has improved.
parquet_metadata_statistics/metadata_none_col_256_rg_32
                        time:   [247.99 µs 251.69 µs 255.70 µs]
                        change: [−0.3098% +1.9875% +4.3010%] (p = 0.08 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
Benchmarking parquet_metadata_statistics/metadata_none_col_256_rg_128: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.2s, enable flat sampling, or reduce sample count to 50.
parquet_metadata_statistics/metadata_none_col_256_rg_128
                        time:   [569.18 µs 573.20 µs 577.52 µs]
                        change: [−1.3984% +1.0214% +3.5434%] (p = 0.45 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

@AdamGS AdamGS force-pushed the adamg/parquet-stats-perf branch from c65243a to c03ed65 Compare May 23, 2026 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants