perf(arrow-select): gc dictionaries in interleave_fallback_dictionary by asubiotto · Pull Request #9940 · apache/arrow-rs

asubiotto · 2026-05-07T12:40:30Z

Which issue does this PR close?

Closes perf: gc dictionaries in interleave_fallback_dictionary #9939

Rationale for this change

interleave_dictionaries checks if input dictionaries should be merged/GCed, but should_merge_dictionary_values has limited type support due to the bytes interning it does. For other types and in the general fallback case where the heuristic fails, interleave_fallback_dictionary concatenates all the values slices together, resulting in a lot of bloat in cases where the interleave selection is minimal. This happens a lot e.g. in datafusion on a multi-partition sort on a dictionary column.

What changes are included in this PR?

This commit improves these cases by introducing a heuristic to check interleave index coverage assuming uniform value selection. When coverage is less than the number of values, interleave_fallback_dictionary now performs a take on the value slices in order to reduce output size bloat. On a real-world datafusion sort that motivated this change, I saw runtime drop from 20 minutes to 7 minutes. The heuristic allows us to avoid a microbenchmark regression.

Are these changes tested?

Yes, by existing tests and some new tests exercising this path specifically.

Are there any user-facing changes?

interleave_dictionaries checks if input dictionaries should be merged/GCed, but should_merge_dictionary_values has limited type support due to the bytes interning it does. For other types and in the general fallback case where the heuristic fails, interleave_fallback_dictionary concatenates all the values slices together, resulting in a lot of bloat in cases where the interleave selection is minimal. This happens a lot e.g. in datafusion on a multi-partition sort on a dictionary column. This commit improves these cases by introducing a heuristic to check interleave index coverage assuming uniform value selection. When coverage is less than the number of values, interleave_fallback_dictionary now performs a `take` on the value slices in order to reduce output size bloat. On a real-world datafusion sort that motivated this change, I saw runtime drop from 20 minutes to 7 minutes. The heuristic allows us to avoid a microbenchmark regression. Signed-off-by: Alfonso Subiotto Marques <alfonso.subiotto@polarsignals.com>

github-actions Bot added the arrow Changes to the arrow crate label May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(arrow-select): gc dictionaries in interleave_fallback_dictionary#9940

perf(arrow-select): gc dictionaries in interleave_fallback_dictionary#9940
asubiotto wants to merge 1 commit intoapache:mainfrom
polarsignals:asubiotto/idfperf

asubiotto commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

asubiotto commented May 7, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant