perf: hybrid hot-path evaluator — up to 40% faster dispatch#785
Merged
stephenamar-db merged 1 commit intodatabricks:masterfrom Apr 21, 2026
Merged
perf: hybrid hot-path evaluator — up to 40% faster dispatch#785stephenamar-db merged 1 commit intodatabricks:masterfrom
stephenamar-db merged 1 commit intodatabricks:masterfrom
Conversation
Motivation: The NewEvaluator used a pure tag + @switch (tableswitch) dispatch which suffered from invokeinterface overhead on every visitExpr call (~5-8ns) because Expr is a trait. This made it 0-16% slower than the old instanceof-chain evaluator in JMH benchmarks. Modification: - Profile all 66 benchmark files across 5 suites (cpp, go, jrsonnet, sjsonnet, bug) to identify ExprTag frequencies. Top 7 types cover 96.1% of all visitExpr calls: ValidId (30%), BinaryOp (21%), Val.Literal (18%), Select (13%), Apply1 (5%), ObjExtend (4%), IfElse (4%). - Split NewEvaluator.visitExpr into a hot path (~120 bytes, 7 instanceof checks) and a cold path (private visitExprCold using tag + @switch for remaining types). - The hot path fits within JIT FreqInlineSize=325 bytecodes, enabling C2 to inline visitExpr into callers (visitBinaryOp, visitSelect, etc.). The old evaluator's ~700-byte method body never gets inlined. - Add --new-evaluator CLI flag to Config/SjsonnetMainBase for A/B testing via hyperfine. - Add EvaluatorBenchmark (JMH) covering all suites and ExprTagProfile profiling tool. Result: JMH steady-state performance (1 fork, 8 warmup, 10 measurement): Benchmark Old (ms) New (ms) Delta bench.01 0.026 0.018 -31% bench.02 32.58 25.73 -21% bench.03 9.39 5.64 -40% gen_big_object 0.928 0.715 -23% string_render 0.768 0.496 -35% base64_mega 3.462 3.106 -10% realistic1 1.850 1.764 -5% heavy_string 34.80 33.09 -5% realistic2 47.32 47.78 tied bench.04-09 tied tied tied
stephenamar-db
approved these changes
Apr 21, 2026
Collaborator
stephenamar-db
left a comment
There was a problem hiding this comment.
what would it take to just move to the new Evaluator once and for all?
Can you benchmark them?
Contributor
Author
|
I actually think the new one is better now. I will take a benchmark. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
visitExprcalls: ValidId (30%), BinaryOp (21%), Val.Literal (18%), Select (13%), Apply1 (5%), ObjExtend (4%), IfElse (4%).NewEvaluator.visitExprinto a hot path (~120 bytecodes, 7instanceofchecks) and a cold path (private visitExprColdusingtag + @switchfor remaining 30 types).FreqInlineSize=325bytecodes, enabling C2 to inlinevisitExprinto callers (visitBinaryOp,visitSelect, etc.). The old evaluator's ~700-bytecode method body never gets inlined.--new-evaluatorCLI flag for A/B testing.EvaluatorBenchmark(JMH) andExprTagProfileprofiling tool.JMH Results
Steady-state performance (1 fork, 8 warmup, 10 measurement iterations):
Evaluator-heavy benchmarks (bench.01–03, gen_big_object, string_render_perf) show 21–40% improvement. Builtin-dominated benchmarks (bench.04, foldl, comparison) are unaffected — the evaluator dispatch is not their bottleneck.
Why it works
The old evaluator's
visitExprcompiles to a ~700-bytecodeinstanceofchain. This exceeds JIT'sFreqInlineSize=325, so C2 never inlines it into callers. Every recursivevisitExprcall from withinvisitBinaryOp,visitSelect, etc. pays full virtual dispatch overhead.The hybrid approach splits into:
instanceofchecks for 96% of calls — small enough for C2 to inlinetag + @switchtableswitch for the remaining 4% — O(1) dispatch instead of scanning 30+instanceofchecksExprTag frequency data (global across all 66 benchmark files)
Test plan
./mill 'sjsonnet.jvm[3.3.7]'.test— all JVM tests pass (both old and new evaluator)./mill __.reformat— scalafmt clean