Skip to content

perf(arrow-select): gc dictionaries in interleave_fallback_dictionary#9940

Open
asubiotto wants to merge 1 commit intoapache:mainfrom
polarsignals:asubiotto/idfperf
Open

perf(arrow-select): gc dictionaries in interleave_fallback_dictionary#9940
asubiotto wants to merge 1 commit intoapache:mainfrom
polarsignals:asubiotto/idfperf

Conversation

@asubiotto
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

interleave_dictionaries checks if input dictionaries should be merged/GCed, but should_merge_dictionary_values has limited type support due to the bytes interning it does. For other types and in the general fallback case where the heuristic fails, interleave_fallback_dictionary concatenates all the values slices together, resulting in a lot of bloat in cases where the interleave selection is minimal. This happens a lot e.g. in datafusion on a multi-partition sort on a dictionary column.

What changes are included in this PR?

This commit improves these cases by introducing a heuristic to check interleave index coverage assuming uniform value selection. When coverage is less than the number of values, interleave_fallback_dictionary now performs a take on the value slices in order to reduce output size bloat. On a real-world datafusion sort that motivated this change, I saw runtime drop from 20 minutes to 7 minutes. The heuristic allows us to avoid a microbenchmark regression.

Are these changes tested?

Yes, by existing tests and some new tests exercising this path specifically.

Are there any user-facing changes?

interleave_dictionaries checks if input dictionaries should be merged/GCed, but
should_merge_dictionary_values has limited type support due to the bytes
interning it does. For other types and in the general fallback case where the
heuristic fails, interleave_fallback_dictionary concatenates all the values
slices together, resulting in a lot of bloat in cases where the interleave
selection is minimal. This happens a lot e.g. in datafusion on a
multi-partition sort on a dictionary column.

This commit improves these cases by introducing a heuristic to check interleave
index coverage assuming uniform value selection. When coverage is less than the
number of values, interleave_fallback_dictionary now performs a `take` on the
value slices in order to reduce output size bloat. On a real-world datafusion
sort that motivated this change, I saw runtime drop from 20 minutes to 7
minutes. The heuristic allows us to avoid a microbenchmark regression.

Signed-off-by: Alfonso Subiotto Marques <alfonso.subiotto@polarsignals.com>
@github-actions github-actions Bot added the arrow Changes to the arrow crate label May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: gc dictionaries in interleave_fallback_dictionary

1 participant