Skip to content

perf: comprehension fuse scope+eval and inline BinaryOp(ValidId,ValidId) fast path#686

Open
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/comprehension-binop-inline
Open

perf: comprehension fuse scope+eval and inline BinaryOp(ValidId,ValidId) fast path#686
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/comprehension-binop-inline

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 5, 2026

Motivation

Array comprehensions like [x+y for x in arr for y in arr if x==y] previously collected all valid scopes into an intermediate Array[ValScope], then mapped body evaluation over them. For nested comprehensions, this allocates O(n²) intermediate scopes even when only O(n) results survive filtering. Additionally, the body evaluation dispatches through visitExpr for every element, which has significant overhead for simple expressions like BinaryOp(ValidId, ValidId).

Key Design Decisions

  1. Fused scope+eval: Instead of two passes (collect scopes → map body), visitCompFused directly appends body results during scope traversal, eliminating intermediate array allocation.

  2. BinaryOp(ValidId,ValidId) fast path: For the innermost ForSpec with a binary-op body of two variable references, inline scope lookups and numeric dispatch to avoid 3× visitExpr overhead per iteration. Falls back to general visitExpr for non-numeric types — no code duplication.

  3. Eager evaluation: The fast path evaluates eagerly (not lazy). Both go-jsonnet and jrsonnet evaluate comprehensions eagerly, and eagerness is required for safe mutable scope reuse.

  4. Lean implementation: Unlike the original version, this eliminates the visitBinaryOpValues fallback method (~60 lines), reducing code addition from ~225 to ~200 lines. This avoids native binary size inflation that caused instruction cache regression.

Modification

sjsonnet/src/sjsonnet/Evaluator.scala:

  • Replace visitComp(Comp) with fused version using ArrayBuilder
  • Add visitCompFused — recursive fused scope+eval loop
  • Add evalBinaryOpNumNum@switch-dispatched Num×Num fast path covering arithmetic, comparison, bitwise, and shift ops
  • Op guard filters out OP_in, OP_&&, OP_|| from numeric fast path (these need type dispatch / short-circuit semantics)
  • Overflow checks match existing evaluator: none for OP_+, isInfinite check for OP_-, OP_*, OP_/

sjsonnet/test/resources/new_test_suite/comprehension_binop_types.jsonnet:

  • Regression test covering all binary operators in comprehensions: string concat, numeric arithmetic, comparison, bitwise, string formatting, array concat, in operator

Benchmark Results

JMH (35 benchmarks, 0 regressions)

Benchmark Master (ms/op) This PR (ms/op) Change
comparison2 74.373 37.018 -50.2%
realistic1 3.014 2.794 -7.3%
realistic2 71.861 67.872 -5.5%
large_string_template 2.765 2.538 -8.2%
large_string_join 2.408 2.310 -4.1%
bench.03 13.634 13.526 -0.8%
bench.04 33.975 33.500 -1.4%
reverse 11.032 10.654 -3.4%
bench.02 45.790 45.754 ~0%

No significant regressions across all 35 benchmarks.

Scala Native Hyperfine (vs jrsonnet 0.4.2)

Benchmark Master This PR jrsonnet vs Master vs jrsonnet
comparison2 173.1 ms 81.0 ms 250.1 ms 2.14× faster 3.09× faster
realistic1 (wall) 14.4 ms 17.7 ms 16.1 ms +23%¹ +10%
realistic1 (user) 10.5 ms 10.3 ms 13.4 ms -2% 1.30× faster
realistic2 312.3 ms 318.4 ms 629.6 ms +2%² 1.97× faster

¹ realistic1 wall time increase is startup/binary-loading overhead, not computation — user time is unchanged (10.3 vs 10.5 ms)
² realistic2 +2% is within noise range

Analysis

The 50% improvement on comparison2 comes from two complementary optimizations:

  • Structural (fused scope+eval): eliminates O(n²) intermediate scope array for [x+y for x in arr for y in arr if x==y] with n=5000 — contributes ~6%
  • BinaryOp inlining: for the 5001 body evaluations that survive the if x==y filter, inline scope lookups + @switch numeric dispatch avoids 3× visitExpr overhead — contributes ~44%

The lean implementation avoids the instruction cache regression seen with the original version (which added ~225 lines including a duplicated visitBinaryOpValues method), keeping realistic1 user time flat.

References

  • Upstream: jit branch commits 3466461a (fuse scope+eval) + 71545ba8 (inline BinaryOp)

Result

Array comprehension evaluation is 2-3× faster for comprehension-heavy workloads, with no regressions on other benchmarks. sjsonnet native now beats jrsonnet by 3.09× on comparison2.

@He-Pin He-Pin marked this pull request as ready for review April 5, 2026 09:44
Fuse comprehension scope building with body evaluation, eliminating
intermediate scope array allocation. For nested comprehensions like
[x+y for x in arr for y in arr if x==y], this avoids allocating O(n²)
intermediate scopes — only the O(n) matching results are materialized.

When the innermost body is BinaryOp(ValidId,ValidId), inline scope
lookups and numeric binary-op dispatch to avoid 3× visitExpr overhead
per iteration. Falls back to general visitExpr for non-numeric types.

Key changes:
- visitCompFused: recursive fused scope+eval loop with ArrayBuilder
- evalBinaryOpNumNum: @switch-dispatched Num×Num fast path
- Non-numeric fallback uses existing visitExpr (no code duplication)

Upstream: jit branch commits 3466461 (fuse) + 71545ba (inline)
@He-Pin He-Pin force-pushed the perf/comprehension-binop-inline branch from 9b5caef to 62c6ef6 Compare April 6, 2026 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant