Skip to content

perf: render escaped byte strings in one pass#845

Closed
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:perf/byte-renderer-single-pass-escape
Closed

perf: render escaped byte strings in one pass#845
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:perf/byte-renderer-single-pass-escape

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented May 12, 2026

Motivation:
large_string_template renders a ~550 KB JSON string with many escaped newlines. ByteRenderer's long-string path already finds the first escaped byte, but the escaped branch then rescanned the rest of the byte array to compute the exact output length before doing the actual copy/escape pass.

Key Design Decision:
Avoid the second escape scan and keep the existing byte-oriented renderer path. The renderer now reserves a conservative initial buffer size, updates ByteBuilder.length before capacity checks, and refreshes the backing byte array after each possible growth.

Modification:

  • Remove the escaped-length pre-scan from BaseByteRenderer.visitLongString.
  • Copy clean chunks and inline escaped bytes in one pass after findFirstEscapeChar.
  • Add a long escaped string regression test comparing ByteRenderer output against the existing char Renderer, covering newline, quote, backslash, tab, \u00XX control escaping, and trailing plain text.

Benchmark Results:
JMH, JVM 21, single-threaded bench.runRegressions on bench/resources/cpp_suite/large_string_template.jsonnet:

Case Before After Result
large_string_template 0.701-0.722 ms/op 0.578-0.655 ms/op positive
repeat_format guard 0.135 ms/op 0.135-0.141 ms/op neutral/noisy
large_string_join guard 0.257 ms/op 0.255-0.281 ms/op neutral/noisy

JMH GC profile on large_string_template:

Case Score Alloc/op
Before 0.701 ms/op 7,199,720 B/op
After 0.584 ms/op 7,199,111 B/op

Scala Native hyperfine, 10 runs, large_string_template:

Command Mean [ms] Min [ms] Max [ms] Relative
sjsonnet native baseline ca61a7a3 13.1 ± 1.6 10.7 15.9 2.17 ± 0.58
sjsonnet native single-pass escape 12.9 ± 1.1 11.3 14.5 2.14 ± 0.54
jrsonnet rust 6.0 ± 1.4 4.6 9.6 1.00

jrsonnet upstream benchmark document lists this same case as Rust 2.1 ms and Scala Native 14.5 ms on its machine; local absolute numbers differ, but the remaining local gap is still about 2.1x in favor of jrsonnet.

Analysis:
Stack profiling showed BaseByteRenderer.visitLongString and escape scanning on the hot path, while lazy format concatenation itself did not show enough top-stack weight. A rejected side experiment that marked formatted strings as ASCII-safe was incorrect for newline-containing JSON strings; a correct ASCII-only char loop was much slower. The retained change preserves the existing UTF-8 byte path and only removes duplicated escape scanning.

Validation:

  • ./mill --no-server --ticker false --color false -j 1 'sjsonnet.jvm[3.3.7]'.test.testOnly sjsonnet.RendererTests passed.
  • ./mill --no-server --ticker false --color false -j 1 'sjsonnet.jvm[3.3.7]'.test passed.
  • ./mill --no-server --ticker false --color false -j 1 __.test passed.
  • ./mill --no-server --ticker false --color false -j 1 __.checkFormat passed.
  • ./mill --no-server --ticker false --color false -j 1 __.reformat was run before commit.
  • ./mill --no-server --ticker false --color false -j 1 'sjsonnet.native[3.3.7]'.nativeLink passed for both baseline and this branch.

References:

  • Target benchmark: bench/resources/cpp_suite/large_string_template.jsonnet.
  • jrsonnet benchmark source: jrsonnet/docs/benchmarks.adoc, section "Large string template".
  • Local commit: 55ff93581ab223cc7c10e6a7945097ccce2dc35d.

Result:
Long escaped JSON strings render with one escape/copy pass in ByteRenderer. JVM JMH improves on the target workload; Scala Native is slightly positive but within hyperfine noise, so this remains draft for review.

Motivation:
Long JSON strings that contain escape characters used to scan the UTF-8 byte array twice in ByteRenderer: once to find escapes and once to pre-compute the exact escaped length. The large_string_template benchmark spends visible time in this path.

Modification:
Render escaped long strings with one copy/escape pass, growing ByteBuilder incrementally and refreshing its backing array after capacity checks. Add a regression test that compares ByteRenderer output with the char Renderer for long escaped strings including two-byte escapes, six-byte control escapes, and a trailing plain tail.

Result:
The large_string_template JMH target improves from roughly 0.70-0.72 ms/op to roughly 0.58-0.65 ms/op in local runs, while full tests and formatting checks remain green.

References:
bench/resources/cpp_suite/large_string_template.jsonnet
@He-Pin He-Pin marked this pull request as ready for review May 12, 2026 09:38
@He-Pin He-Pin marked this pull request as draft May 12, 2026 09:38
@He-Pin He-Pin closed this May 12, 2026
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented May 12, 2026

Closing after review — the change reverses a deliberate design decision from #809 and the benefit is too narrow to justify it.

Why this contradicts #809

PR #809 explicitly stated:

Precompute the exact escaped output length, reserve ByteBuilder once, then write directly to the backing byte array. This removes repeated ensureLength/appendUnsafeC calls from the dirty long-string loop.

This PR puts those per-iteration ensureLength calls (and the matching arr = elemBuilder.arr reloads) back into the hot loop — once before each chunk copy, once before each escape, once before the tail copy, once before the closing quote. The trade is "one SWAR rescan" for "2×N virtual ensureLength calls + array re-fetches", where N is the number of escapes.

Benchmark numbers don't hold up

Metric Before After Verdict
JMH `large_string_template` 0.701–0.722 ms/op 0.578–0.655 ms/op ~17% on JVM
JMH variance (range) 0.021 ms 0.077 ms 3.7× noisier
GC alloc/op 7,199,720 B 7,199,111 B 0.008%, effectively unchanged
Scala Native hyperfine 13.1 ± 1.6 ms 12.9 ± 1.1 ms within noise

The only positive signal is a single JVM workload, and even there the variance widens significantly. Native shows no movement and allocation is flat — both consistent with "one fewer SWAR pass over warm cache" rather than a structural improvement.

Untested regression risk: dense-escape workloads

`large_string_template.jsonnet` is a sparse-escape workload (one `\n` per ~70 bytes of plain text). The conservative initial buffer `bLen + 2 + (bLen >>> 5)` (~3% slack) is fine for that shape, but for dense `\u00XX` control-character escaping (6 bytes out per byte in):

  • Initial allocation is short by ~6×, forcing multiple `ByteBuilder` doublings.
  • Per-escape `ensureLength` overhead and `arr` reloads dominate the loop.
  • The pre-perf: render escaped byte strings in one pass #845 path needed only one allocation and one SWAR pre-scan.

No benchmark in this PR exercises that case, so the reported win could turn into a regression on real-world strings with many control characters.

Better direction if we revisit this

If we want to keep one pass without giving up on the precompute, the right shape is to make `findFirstEscapeChar` return both the position and an extra-bytes counter in a single SWAR pass — preserving #809's "reserve once, no per-iter ensureLength" property without needing a second scan. That's strictly better than either current master or this PR.

Closing for now. Reopen-worthy with that shape, or with a benchmark suite that includes a dense-escape case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant