feat(parquet): precompute offset_index_disabled at build-time#9724
feat(parquet): precompute offset_index_disabled at build-time#9724HippoBaro wants to merge 2 commits intoapache:mainfrom
offset_index_disabled at build-time#9724Conversation
`WriterProperties::offset_index_disabled()` checked whether any column in the `column_properties` HashMap has page-level statistics enabled, scanning the entire map on every call. This method is called from `GenericColumnWriter::new` — once per column per row group. With N columns each having per-column properties, this resulted in quadratic HashMap iterations during row group construction. Move the scan into `WriterPropertiesBuilder::build()` so it runs once at construction time. Benchmark results (vs baseline): writer_overhead/1000_cols/per_column_props 2.44 ms (was 3.25 ms, −25%) writer_overhead/5000_cols/per_column_props 13.28 ms (was 47.45 ms, −72%) writer_overhead/10000_cols/per_column_props 27.97 ms (was 197.97 ms, −86%) Scaling now linear. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
offset_index_disabled at build-time
etseidl
left a comment
There was a problem hiding this comment.
Thanks @HippoBaro, this looks correct to me (and faster 😄🚀). It just took some time for me to get my head around the need for the DisabledOverridden variant. 😅
|
run benchmark writer_overhead |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing cheap_offset_index_disabled (2853ffc) to 7a089ad (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
Those are some nice benchmark numbers |
Which issue does this PR close?
Rationale for this change
WriterProperties::offset_index_disabled()checked whether any column in thecolumn_propertiesHashMap has page-level statistics enabled, scanning the entire map on every call. This method is called fromGenericColumnWriter::new— once per column per row group. With N columns each having per-column properties, this resulted in quadratic HashMap iterations during row group construction.What changes are included in this PR?
Move the scan into
WriterPropertiesBuilder::build()so it runs once at construction time.Benchmark results (vs baseline in #9723):
Scaling now linear.
Are these changes tested?
All tests passing.
Are there any user-facing changes?
None.