perf: render escaped byte strings in one pass by He-Pin · Pull Request #845 · databricks/sjsonnet

He-Pin · 2026-05-12T07:34:41Z

Motivation:
large_string_template renders a ~550 KB JSON string with many escaped newlines. ByteRenderer's long-string path already finds the first escaped byte, but the escaped branch then rescanned the rest of the byte array to compute the exact output length before doing the actual copy/escape pass.

Key Design Decision:
Avoid the second escape scan and keep the existing byte-oriented renderer path. The renderer now reserves a conservative initial buffer size, updates ByteBuilder.length before capacity checks, and refreshes the backing byte array after each possible growth.

Modification:

Remove the escaped-length pre-scan from BaseByteRenderer.visitLongString.
Copy clean chunks and inline escaped bytes in one pass after findFirstEscapeChar.
Add a long escaped string regression test comparing ByteRenderer output against the existing char Renderer, covering newline, quote, backslash, tab, \u00XX control escaping, and trailing plain text.

Benchmark Results:
JMH, JVM 21, single-threaded bench.runRegressions on bench/resources/cpp_suite/large_string_template.jsonnet:

Case	Before	After	Result
`large_string_template`	0.701-0.722 ms/op	0.578-0.655 ms/op	positive
`repeat_format` guard	0.135 ms/op	0.135-0.141 ms/op	neutral/noisy
`large_string_join` guard	0.257 ms/op	0.255-0.281 ms/op	neutral/noisy

JMH GC profile on large_string_template:

Case	Score	Alloc/op
Before	0.701 ms/op	7,199,720 B/op
After	0.584 ms/op	7,199,111 B/op

Scala Native hyperfine, 10 runs, large_string_template:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
sjsonnet native baseline `ca61a7a3`	13.1 ± 1.6	10.7	15.9	2.17 ± 0.58
sjsonnet native single-pass escape	12.9 ± 1.1	11.3	14.5	2.14 ± 0.54
jrsonnet rust	6.0 ± 1.4	4.6	9.6	1.00

jrsonnet upstream benchmark document lists this same case as Rust 2.1 ms and Scala Native 14.5 ms on its machine; local absolute numbers differ, but the remaining local gap is still about 2.1x in favor of jrsonnet.

Analysis:
Stack profiling showed BaseByteRenderer.visitLongString and escape scanning on the hot path, while lazy format concatenation itself did not show enough top-stack weight. A rejected side experiment that marked formatted strings as ASCII-safe was incorrect for newline-containing JSON strings; a correct ASCII-only char loop was much slower. The retained change preserves the existing UTF-8 byte path and only removes duplicated escape scanning.

Validation:

./mill --no-server --ticker false --color false -j 1 'sjsonnet.jvm[3.3.7]'.test.testOnly sjsonnet.RendererTests passed.
./mill --no-server --ticker false --color false -j 1 'sjsonnet.jvm[3.3.7]'.test passed.
./mill --no-server --ticker false --color false -j 1 __.test passed.
./mill --no-server --ticker false --color false -j 1 __.checkFormat passed.
./mill --no-server --ticker false --color false -j 1 __.reformat was run before commit.
./mill --no-server --ticker false --color false -j 1 'sjsonnet.native[3.3.7]'.nativeLink passed for both baseline and this branch.

References:

Target benchmark: bench/resources/cpp_suite/large_string_template.jsonnet.
jrsonnet benchmark source: jrsonnet/docs/benchmarks.adoc, section "Large string template".
Local commit: 55ff93581ab223cc7c10e6a7945097ccce2dc35d.

Result:
Long escaped JSON strings render with one escape/copy pass in ByteRenderer. JVM JMH improves on the target workload; Scala Native is slightly positive but within hyperfine noise, so this remains draft for review.

Motivation: Long JSON strings that contain escape characters used to scan the UTF-8 byte array twice in ByteRenderer: once to find escapes and once to pre-compute the exact escaped length. The large_string_template benchmark spends visible time in this path. Modification: Render escaped long strings with one copy/escape pass, growing ByteBuilder incrementally and refreshing its backing array after capacity checks. Add a regression test that compares ByteRenderer output with the char Renderer for long escaped strings including two-byte escapes, six-byte control escapes, and a trailing plain tail. Result: The large_string_template JMH target improves from roughly 0.70-0.72 ms/op to roughly 0.58-0.65 ms/op in local runs, while full tests and formatting checks remain green. References: bench/resources/cpp_suite/large_string_template.jsonnet

He-Pin · 2026-05-12T11:08:10Z

Closing after review — the change reverses a deliberate design decision from #809 and the benefit is too narrow to justify it.

Why this contradicts #809

PR #809 explicitly stated:

Precompute the exact escaped output length, reserve ByteBuilder once, then write directly to the backing byte array. This removes repeated ensureLength/appendUnsafeC calls from the dirty long-string loop.

This PR puts those per-iteration ensureLength calls (and the matching arr = elemBuilder.arr reloads) back into the hot loop — once before each chunk copy, once before each escape, once before the tail copy, once before the closing quote. The trade is "one SWAR rescan" for "2×N virtual ensureLength calls + array re-fetches", where N is the number of escapes.

Benchmark numbers don't hold up

Metric	Before	After	Verdict
JMH `large_string_template`	0.701–0.722 ms/op	0.578–0.655 ms/op	~17% on JVM
JMH variance (range)	0.021 ms	0.077 ms	3.7× noisier
GC alloc/op	7,199,720 B	7,199,111 B	0.008%, effectively unchanged
Scala Native hyperfine	13.1 ± 1.6 ms	12.9 ± 1.1 ms	within noise

The only positive signal is a single JVM workload, and even there the variance widens significantly. Native shows no movement and allocation is flat — both consistent with "one fewer SWAR pass over warm cache" rather than a structural improvement.

Untested regression risk: dense-escape workloads

`large_string_template.jsonnet` is a sparse-escape workload (one `\n` per ~70 bytes of plain text). The conservative initial buffer `bLen + 2 + (bLen >>> 5)` (~3% slack) is fine for that shape, but for dense `\u00XX` control-character escaping (6 bytes out per byte in):

Initial allocation is short by ~6×, forcing multiple `ByteBuilder` doublings.
Per-escape `ensureLength` overhead and `arr` reloads dominate the loop.
The pre-perf: render escaped byte strings in one pass #845 path needed only one allocation and one SWAR pre-scan.

No benchmark in this PR exercises that case, so the reported win could turn into a regression on real-world strings with many control characters.

Better direction if we revisit this

If we want to keep one pass without giving up on the precompute, the right shape is to make `findFirstEscapeChar` return both the position and an extra-bytes counter in a single SWAR pass — preserving #809's "reserve once, no per-iter ensureLength" property without needing a second scan. That's strictly better than either current master or this PR.

Closing for now. Reopen-worthy with that shape, or with a benchmark suite that includes a dense-escape case.

He-Pin mentioned this pull request May 12, 2026

perf: lazily build platform stdlib #846

Closed

He-Pin marked this pull request as ready for review May 12, 2026 09:38

He-Pin marked this pull request as draft May 12, 2026 09:38

He-Pin closed this May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: render escaped byte strings in one pass#845

perf: render escaped byte strings in one pass#845
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:perf/byte-renderer-single-pass-escape

He-Pin commented May 12, 2026

Uh oh!

He-Pin commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

He-Pin commented May 12, 2026

Uh oh!

He-Pin commented May 12, 2026

Why this contradicts #809

Benchmark numbers don't hold up

Untested regression risk: dense-escape workloads

Better direction if we revisit this

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant