feat(parquet): add wide-schema writer overhead benchmark by HippoBaro · Pull Request #9723 · apache/arrow-rs

HippoBaro · 2026-04-15T06:13:04Z

Which issue does this PR close?

Contributes to Wide schema performance: eliminate quadratic column-count scaling #9722

Rationale for this change

Existing writer benchmarks use narrow schemas (5–10 columns) and primarily measure data encoding throughput. They don't capture per-column structural overhead that dominates at high column cardinality (thousands to hundreds of thousands of columns), such as allocation, and metadata assembly.

What changes are included in this PR?

This commit adds benchmarks to fill that gap by writing a single-row batch through ArrowWriter with 1k/5k/10k flat Float32 columns and per-column WriterProperties entries, isolating the cost of the writer infrastructure itself.

Baseline results (Apple M1 Max):

  writer_overhead/1000_cols/per_column_props      3.72 ms
  writer_overhead/5000_cols/per_column_props     54.96 ms
  writer_overhead/10000_cols/per_column_props   220.73 ms

Are these changes tested?

N/A

Are there any user-facing changes?

N/A

Existing writer benchmarks use narrow schemas (5–10 columns) and primarily measure data encoding throughput. They don't capture per-column structural overhead that dominates at high column cardinality (thousands to hundreds of thousands of columns), such as allocation, and metadata assembly. This commit adds benchmarks to fill that gap by writing a single-row batch through `ArrowWriter` with 1k/5k/10k flat `Float32` columns and per-column `WriterProperties` entries, isolating the cost of the writer infrastructure itself. Baseline results (Apple M1 Max): writer_overhead/1000_cols/per_column_props 3.72 ms writer_overhead/5000_cols/per_column_props 54.96 ms writer_overhead/10000_cols/per_column_props 220.73 ms Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>

etseidl

Looks good. The metadata bench will do wide tables (10k columns), but only measures decoding the footer. Nice to have something similar on the write side.

alamb · 2026-04-15T20:56:05Z

Thank you @HippoBaro and @etseidl for the review

github-actions bot added the parquet Changes to the parquet crate label Apr 15, 2026

HippoBaro mentioned this pull request Apr 15, 2026

feat(parquet): precompute offset_index_disabled at build-time #9724

Open

etseidl approved these changes Apr 15, 2026

View reviewed changes

alamb merged commit 06c3bd0 into apache:main Apr 15, 2026
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parquet): add wide-schema writer overhead benchmark#9723

feat(parquet): add wide-schema writer overhead benchmark#9723
alamb merged 1 commit intoapache:mainfrom
HippoBaro:wide_schema_writer_bench

HippoBaro commented Apr 15, 2026

Uh oh!

etseidl left a comment

Uh oh!

Uh oh!

alamb commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

HippoBaro commented Apr 15, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants