perf: renderer throughput optimization (indent cache + bulk copy + direct long rendering)#730
Merged
stephenamar-db merged 1 commit intodatabricks:masterfrom Apr 10, 2026
Conversation
…and direct long rendering Optimize the materialization/rendering pipeline which is the primary bottleneck for large-output workloads (e.g. realistic2: 28.6 MB output, 99.8% materialization). Three complementary optimizations: 1. **Indent cache** (BaseCharRenderer): Pre-compute newline+spaces arrays for depths 0..31. renderIndent() and Renderer.flushBuffer() now do a single System.arraycopy via appendAll instead of per-character space loops. Particularly impactful on Scala Native where there is no JIT to unroll loops. 2. **Bulk string copy** (BaseCharRenderer.appendString): Use String.getChars for O(1) bulk copy instead of character-by-character loop. 3. **Direct long rendering** (RenderUtils.appendLong): Render integer-valued doubles directly into the CharBuilder without intermediate String allocation. Uses negative accumulator algorithm to correctly handle Long.MinValue. Also adds comprehensive tests for appendLong edge cases (0, negatives, Long.MinValue, Long.MaxValue) and indent=0 rendering.
This was referenced Apr 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The materialization/rendering pipeline is the primary bottleneck for large-output workloads. For
realistic2(28.6 MB output, 568K lines, 125K objects, 380K strings),--debug-statsshows 99.8% of wall time is spent in materialization. The previous implementation used per-character loops for indent rendering and intermediateStringallocation for number formatting, leaving significant throughput on the table.Key Design Decisions
BaseCharRenderer(notRenderer) so all renderer subclasses (Renderer,MaterializeJsonRenderer,PythonRenderer) benefit automatically.appendLong: HandlesLong.MinValuecorrectly without overflow (negatingLong.MinValueoverflowsLong).CharBuilderinstead of going throughLong.toString→String→ char-by-char copy.Modifications
BaseCharRenderer.scalaMaxCachedDepth = 32indentCachefield: pre-computedArray[Array[Char]]withnewline + indent*d spacesfor each depth level, constructed once at renderer creationrenderIndent()to use cached arrays viaappendAll(singleSystem.arraycopy) for depths < 32appendString()to useString.getCharsbulk copy instead of char-by-char loopRenderer.scalavisitFloat64()to render integers directly viaRenderUtils.appendLong()flushBuffer()to useindentCachefor bulk indent renderingRenderUtils.appendLong(): rendersLongdirectly intoCharBuilderusing negative accumulator + reverse-in-place algorithmRendererTests.scalaappendLongedge case tests: 0, positive, negative, large,Long.MaxValue,Long.MinValuevisitFloat64Integerstests for end-to-end integer renderingindentZerotest forindent=0edge caseBenchmark Results
JMH (JVM, isolated runs, lower is better)
true)No regressions across the full 35-benchmark JMH suite.
Hyperfine (Scala Native,
--warmup 3 --min-runs 10)realistic2 (28.6 MB output):
reverse (large array output):
Gap closed from 2.22x → 1.59x (-28.4% improvement).
gen_big_object:
realistic1:
sjsonnet already beats jrsonnet on realistic1 (1.15x faster).
Analysis
The JVM improvement is larger (15.6% on realistic2) because the JIT compiler was still leaving performance on the table with the char-by-char loops. On Scala Native, LLVM already partially optimizes these loops, so the native improvement is smaller for realistic2 but significant for reverse (28.4%), where the output contains many integer-valued doubles that benefit from the zero-allocation
appendLongpath.The
gen_big_objectbenchmark is now tied with jrsonnet (10.4ms vs 10.5ms), andrealistic1beats jrsonnet by 1.15x.Result
This PR supersedes #676 (renderer-indent-cache), #681 (renderer-bulk-append), and #685 (direct-long-rendering) which implemented subsets of these optimizations individually.