perf: Reduce Box and Arc allocation churn during tree rewriting#21749
perf: Reduce Box and Arc allocation churn during tree rewriting#21749Dandandan merged 3 commits intoapache:mainfrom
Box and Arc allocation churn during tree rewriting#21749Conversation
|
run benchmark sql_planner |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-planner-box-arc-take (561de69) to a311d14 (merge-base) diff File an issue against this benchmark runner |
Box and Arc allocation churn in the plannerBox and Arc allocation churn in the planner
Box and Arc allocation churn in the plannerBox and Arc allocation churn in the optimizer
Box and Arc allocation churn in the optimizerBox and Arc allocation churn in logical optimizer
Box and Arc allocation churn in logical optimizerBox and Arc allocation churn during tree rewriting
|
run benchmark tpch |
|
run benchmark sql_planner |
|
@neilconway which tool are you using for profiling, I wonder? |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-planner-box-arc-take (71924ef) to a311d14 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-planner-box-arc-take (71924ef) to a311d14 (merge-base) diff File an issue against this benchmark runner |
| fn map_elements<F: FnMut(T) -> Result<Transformed<T>>>( | ||
| self, | ||
| mut self, | ||
| f: F, |
There was a problem hiding this comment.
Optimizing this API is great, but I wonder two things:
- Do we call this API too much in some passes? i.e. should we make some optimizer pass more efficient by avoiding / reducing the need to call this API at all in some places?
- Are we using the "wrong" abstractions that inherently lead to inneficient code? (I.e. lot's of closures, branches, nesting -> hard to optimize for compiler)
|
run benchmark clickbench_partitioned |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-planner-box-arc-take (71924ef) to a311d14 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
| }) | ||
| .ok() | ||
| })); | ||
| assert!(result.is_err()); |
There was a problem hiding this comment.
It doesn't verify the actual panic-safety invariant — that dropping an unwound Box doesn't double-free or access garbage
There was a problem hiding this comment.
True, although I'm not sure there's a good way to validate that in a standard unit test (unless we run under miri or similar). wdyt?
|
🤖 Criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
Benchmark for this request hit the 7200s job deadline before finishing. Benchmarks requested: Kubernetes messageFile an issue against this benchmark runner |
I've been using |
Which issue does this PR close?
Rationale for this change
Profiling the planner suggests that a surprising amount of time was being spent doing tree rewriting in the logical optimizer. One culprit is
TreeNodeContainer::map_elements()forBox<C>andArc<C>, which do the following:Cvalue from theBox/ArcBox/Arc, respectivelyThis allocates a fresh
BoxorArcfor every node visited while walking an expression or logical plan, even if the tree rewrite we're doing didn't modify the expression/plan node.Instead, we can reuse the current
Box<C>orArc<C>: usestd::mem::take()to swap the inner value withC::default(), pass the inner value to the closure, and put the result back in the original container. Swapping the inner value withC::default()means the container always has a valid value, which is important if the closure panics.For
Arc<C>, we need to useArc::make_mut(), which only clones if theArcis not unique.This reduces the bytes allocated to plan TPC-H Q13 by ~22% (988 kB -> 765 kB), and reduces allocated blocks by 8.5% (210k -> 192k).
What changes are included in this PR?
Box<C>::map_elements()andArc<C>::map_elements()as described abovemap_children()forExpr::Aliasto usemap_elements(), rather than invokingf(*expr)directly; this ensures that it can take advantage of this optimizationLogicalPlan::default()use a sharedDFSchema, rather than allocating a freshDFSchemafor every call. Becausedefault()is not in the hot path for tree rewriting, it is important that it is cheapmap_elements()behaviorAre these changes tested?
Yes, plus new unit tests added.
Are there any user-facing changes?
Yes:
TreeNodeContainerimpls forBox<C>andArc<C>now requireC: Default. This is a breaking API change for third-party code that implementsTreeNodeContainerfor a custom type. The fix is usually straightforward.