perf: renderer throughput optimization (indent cache + bulk copy + direct long rendering) by He-Pin · Pull Request #730 · databricks/sjsonnet

He-Pin · 2026-04-10T15:36:18Z

Motivation

The materialization/rendering pipeline is the primary bottleneck for large-output workloads. For realistic2 (28.6 MB output, 568K lines, 125K objects, 380K strings), --debug-stats shows 99.8% of wall time is spent in materialization. The previous implementation used per-character loops for indent rendering and intermediate String allocation for number formatting, leaving significant throughput on the table.

Key Design Decisions

Indent cache scope: Lives in BaseCharRenderer (not Renderer) so all renderer subclasses (Renderer, MaterializeJsonRenderer, PythonRenderer) benefit automatically.
MaxCachedDepth = 32: Covers virtually all real-world Jsonnet (realistic2 max depth ~5). Beyond this, falls back to the original per-character loop.
Negative accumulator in appendLong: Handles Long.MinValue correctly without overflow (negating Long.MinValue overflows Long).
Zero-allocation number rendering: For integer-valued doubles (the common case in Jsonnet), digits are written directly into CharBuilder instead of going through Long.toString → String → char-by-char copy.

Modifications

`BaseCharRenderer.scala`

Added companion object with MaxCachedDepth = 32
Added indentCache field: pre-computed Array[Array[Char]] with newline + indent*d spaces for each depth level, constructed once at renderer creation
Updated renderIndent() to use cached arrays via appendAll (single System.arraycopy) for depths < 32
Updated appendString() to use String.getChars bulk copy instead of char-by-char loop

`Renderer.scala`

Updated visitFloat64() to render integers directly via RenderUtils.appendLong()
Updated flushBuffer() to use indentCache for bulk indent rendering
Added RenderUtils.appendLong(): renders Long directly into CharBuilder using negative accumulator + reverse-in-place algorithm

`RendererTests.scala`

Added appendLong edge case tests: 0, positive, negative, large, Long.MaxValue, Long.MinValue
Added visitFloat64Integers tests for end-to-end integer rendering
Added indentZero test for indent=0 edge case

Benchmark Results

JMH (JVM, isolated runs, lower is better)

Benchmark	Before (ms/op)	After (ms/op)	Change
realistic2	68.749	58.001	-15.6% ✅
reverse	10.494	8.436	-19.6% ✅
gen_big_object	1.066	1.000	-6.2% ✅
bench.02	39.832	39.322	-1.3% ≈
comparison	20.216	21.060	+4.2% (noise — eval-only, output is `true`)
realistic1	2.015	2.133	within noise

No regressions across the full 35-benchmark JMH suite.

Hyperfine (Scala Native, `--warmup 3 --min-runs 10`)

realistic2 (28.6 MB output):

Implementation	Time (ms)	vs jrsonnet
sjsonnet-native (master)	264.9 ± 4.2	2.48x slower
sjsonnet-native (this PR)	262.2 ± 2.9	2.45x slower
jrsonnet 0.5.0-pre98	106.8 ± 16.3	baseline

reverse (large array output):

Implementation	Time (ms)	vs jrsonnet
sjsonnet-native (master)	53.1 ± 2.8	2.22x slower
sjsonnet-native (this PR)	38.0 ± 2.3	1.59x slower
jrsonnet 0.5.0-pre98	24.0 ± 1.7	baseline

Gap closed from 2.22x → 1.59x (-28.4% improvement).

gen_big_object:

Implementation	Time (ms)	vs jrsonnet
sjsonnet-native (master)	12.1 ± 1.5	1.16x slower
sjsonnet-native (this PR)	10.4 ± 1.1	1.01x — tied!
jrsonnet 0.5.0-pre98	10.5 ± 1.3	baseline

realistic1:

Implementation	Time (ms)	vs jrsonnet
sjsonnet-native (master)	12.9 ± 1.4	—
sjsonnet-native (this PR)	12.0 ± 1.4	1.15x faster
jrsonnet 0.5.0-pre98	13.9 ± 2.1	baseline

sjsonnet already beats jrsonnet on realistic1 (1.15x faster).

Analysis

The JVM improvement is larger (15.6% on realistic2) because the JIT compiler was still leaving performance on the table with the char-by-char loops. On Scala Native, LLVM already partially optimizes these loops, so the native improvement is smaller for realistic2 but significant for reverse (28.4%), where the output contains many integer-valued doubles that benefit from the zero-allocation appendLong path.

The gen_big_object benchmark is now tied with jrsonnet (10.4ms vs 10.5ms), and realistic1 beats jrsonnet by 1.15x.

Result

✅ All 141 test suites pass (JVM 3.3.7)
✅ Compiles on all platforms (JVM, JS, Native)
✅ No regressions across the full benchmark suite
✅ Comprehensive new test coverage for edge cases

This PR supersedes #676 (renderer-indent-cache), #681 (renderer-bulk-append), and #685 (direct-long-rendering) which implemented subsets of these optimizations individually.

…and direct long rendering Optimize the materialization/rendering pipeline which is the primary bottleneck for large-output workloads (e.g. realistic2: 28.6 MB output, 99.8% materialization). Three complementary optimizations: 1. **Indent cache** (BaseCharRenderer): Pre-compute newline+spaces arrays for depths 0..31. renderIndent() and Renderer.flushBuffer() now do a single System.arraycopy via appendAll instead of per-character space loops. Particularly impactful on Scala Native where there is no JIT to unroll loops. 2. **Bulk string copy** (BaseCharRenderer.appendString): Use String.getChars for O(1) bulk copy instead of character-by-character loop. 3. **Direct long rendering** (RenderUtils.appendLong): Render integer-valued doubles directly into the CharBuilder without intermediate String allocation. Uses negative accumulator algorithm to correctly handle Long.MinValue. Also adds comprehensive tests for appendLong edge cases (0, negatives, Long.MinValue, Long.MaxValue) and indent=0 rendering.

This was referenced Apr 10, 2026

perf: pre-cached indent arrays for bulk newline+spaces #676

Closed

perf: appendString bulk array copy via String.getChars #681

Closed

perf: direct long-to-chars rendering in visitFloat64 #685

Closed

He-Pin marked this pull request as ready for review April 10, 2026 15:47

stephenamar-db merged commit 3e8018e into databricks:master Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: renderer throughput optimization (indent cache + bulk copy + direct long rendering)#730

perf: renderer throughput optimization (indent cache + bulk copy + direct long rendering)#730
stephenamar-db merged 1 commit intodatabricks:masterfrom
He-Pin:perf/renderer-throughput

He-Pin commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

He-Pin commented Apr 10, 2026

Motivation

Key Design Decisions

Modifications

BaseCharRenderer.scala

Renderer.scala

RendererTests.scala

Benchmark Results

JMH (JVM, isolated runs, lower is better)

Hyperfine (Scala Native, --warmup 3 --min-runs 10)

Analysis

Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`BaseCharRenderer.scala`

`Renderer.scala`

`RendererTests.scala`

Hyperfine (Scala Native, `--warmup 3 --min-runs 10`)