perf: Improve performance of CaseExpr with many branches and non-literal THEN expressions [WIP]
#19588
+201
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
I ran some microbenchmarks comparing DataFusion and DuckDB (see apache/datafusion-benchmarks#28) and found that
CASE WHENexpressions were much slower in DataFusion, so I asked Claude to make it go faster. Note that this particular optimization doesn't help with the specific benchmark that I was running. I will create another PR for that, but this optimization seems valid too.What changes are included in this PR?
Optimize CASE expr WHEN literal THEN non-literal-expression by using O(1) HashMap lookup for branch selection instead of O(n) sequential comparisons.
Problem
When a CASE expression has many branches, such as:
The existing code falls back to sequential evaluation because the THEN expressions aren't literals, even though the WHEN values are. This results in O(branches × rows) comparisons.
Solution
Added a new
EvalMethod::WithExprLookupTablethat:WhenLiteralIndexMapHashMap infrastructure for O(1) branch lookup per rowAre these changes tested?
Are there any user-facing changes?