Skip to content

perf: renderer throughput optimization (indent cache + bulk copy + direct long rendering)#730

Merged
stephenamar-db merged 1 commit intodatabricks:masterfrom
He-Pin:perf/renderer-throughput
Apr 10, 2026
Merged

perf: renderer throughput optimization (indent cache + bulk copy + direct long rendering)#730
stephenamar-db merged 1 commit intodatabricks:masterfrom
He-Pin:perf/renderer-throughput

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 10, 2026

Motivation

The materialization/rendering pipeline is the primary bottleneck for large-output workloads. For realistic2 (28.6 MB output, 568K lines, 125K objects, 380K strings), --debug-stats shows 99.8% of wall time is spent in materialization. The previous implementation used per-character loops for indent rendering and intermediate String allocation for number formatting, leaving significant throughput on the table.

Key Design Decisions

  1. Indent cache scope: Lives in BaseCharRenderer (not Renderer) so all renderer subclasses (Renderer, MaterializeJsonRenderer, PythonRenderer) benefit automatically.
  2. MaxCachedDepth = 32: Covers virtually all real-world Jsonnet (realistic2 max depth ~5). Beyond this, falls back to the original per-character loop.
  3. Negative accumulator in appendLong: Handles Long.MinValue correctly without overflow (negating Long.MinValue overflows Long).
  4. Zero-allocation number rendering: For integer-valued doubles (the common case in Jsonnet), digits are written directly into CharBuilder instead of going through Long.toStringString → char-by-char copy.

Modifications

BaseCharRenderer.scala

  • Added companion object with MaxCachedDepth = 32
  • Added indentCache field: pre-computed Array[Array[Char]] with newline + indent*d spaces for each depth level, constructed once at renderer creation
  • Updated renderIndent() to use cached arrays via appendAll (single System.arraycopy) for depths < 32
  • Updated appendString() to use String.getChars bulk copy instead of char-by-char loop

Renderer.scala

  • Updated visitFloat64() to render integers directly via RenderUtils.appendLong()
  • Updated flushBuffer() to use indentCache for bulk indent rendering
  • Added RenderUtils.appendLong(): renders Long directly into CharBuilder using negative accumulator + reverse-in-place algorithm

RendererTests.scala

  • Added appendLong edge case tests: 0, positive, negative, large, Long.MaxValue, Long.MinValue
  • Added visitFloat64Integers tests for end-to-end integer rendering
  • Added indentZero test for indent=0 edge case

Benchmark Results

JMH (JVM, isolated runs, lower is better)

Benchmark Before (ms/op) After (ms/op) Change
realistic2 68.749 58.001 -15.6%
reverse 10.494 8.436 -19.6%
gen_big_object 1.066 1.000 -6.2% ✅
bench.02 39.832 39.322 -1.3% ≈
comparison 20.216 21.060 +4.2% (noise — eval-only, output is true)
realistic1 2.015 2.133 within noise

No regressions across the full 35-benchmark JMH suite.

Hyperfine (Scala Native, --warmup 3 --min-runs 10)

realistic2 (28.6 MB output):

Implementation Time (ms) vs jrsonnet
sjsonnet-native (master) 264.9 ± 4.2 2.48x slower
sjsonnet-native (this PR) 262.2 ± 2.9 2.45x slower
jrsonnet 0.5.0-pre98 106.8 ± 16.3 baseline

reverse (large array output):

Implementation Time (ms) vs jrsonnet
sjsonnet-native (master) 53.1 ± 2.8 2.22x slower
sjsonnet-native (this PR) 38.0 ± 2.3 1.59x slower
jrsonnet 0.5.0-pre98 24.0 ± 1.7 baseline

Gap closed from 2.22x → 1.59x (-28.4% improvement).

gen_big_object:

Implementation Time (ms) vs jrsonnet
sjsonnet-native (master) 12.1 ± 1.5 1.16x slower
sjsonnet-native (this PR) 10.4 ± 1.1 1.01x — tied!
jrsonnet 0.5.0-pre98 10.5 ± 1.3 baseline

realistic1:

Implementation Time (ms) vs jrsonnet
sjsonnet-native (master) 12.9 ± 1.4
sjsonnet-native (this PR) 12.0 ± 1.4 1.15x faster
jrsonnet 0.5.0-pre98 13.9 ± 2.1 baseline

sjsonnet already beats jrsonnet on realistic1 (1.15x faster).

Analysis

The JVM improvement is larger (15.6% on realistic2) because the JIT compiler was still leaving performance on the table with the char-by-char loops. On Scala Native, LLVM already partially optimizes these loops, so the native improvement is smaller for realistic2 but significant for reverse (28.4%), where the output contains many integer-valued doubles that benefit from the zero-allocation appendLong path.

The gen_big_object benchmark is now tied with jrsonnet (10.4ms vs 10.5ms), and realistic1 beats jrsonnet by 1.15x.

Result

  • ✅ All 141 test suites pass (JVM 3.3.7)
  • ✅ Compiles on all platforms (JVM, JS, Native)
  • ✅ No regressions across the full benchmark suite
  • ✅ Comprehensive new test coverage for edge cases

This PR supersedes #676 (renderer-indent-cache), #681 (renderer-bulk-append), and #685 (direct-long-rendering) which implemented subsets of these optimizations individually.

…and direct long rendering

Optimize the materialization/rendering pipeline which is the primary bottleneck
for large-output workloads (e.g. realistic2: 28.6 MB output, 99.8% materialization).

Three complementary optimizations:

1. **Indent cache** (BaseCharRenderer): Pre-compute newline+spaces arrays for depths
   0..31. renderIndent() and Renderer.flushBuffer() now do a single
   System.arraycopy via appendAll instead of per-character space loops.
   Particularly impactful on Scala Native where there is no JIT to unroll loops.

2. **Bulk string copy** (BaseCharRenderer.appendString): Use String.getChars for
   O(1) bulk copy instead of character-by-character loop.

3. **Direct long rendering** (RenderUtils.appendLong): Render integer-valued doubles
   directly into the CharBuilder without intermediate String allocation. Uses
   negative accumulator algorithm to correctly handle Long.MinValue.

Also adds comprehensive tests for appendLong edge cases (0, negatives, Long.MinValue,
Long.MaxValue) and indent=0 rendering.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants