perf: pre-cached indent arrays for bulk newline+spaces#676
Closed
He-Pin wants to merge 1 commit intodatabricks:masterfrom
Closed
perf: pre-cached indent arrays for bulk newline+spaces#676He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin wants to merge 1 commit intodatabricks:masterfrom
Conversation
stephenamar-db
requested changes
Apr 8, 2026
f2e7618 to
9db668d
Compare
Contributor
Author
|
Good catch — extracted the magic Note that the indent cache content is instance-specific (depends on the |
7ec85ce to
abfe59a
Compare
f336323 to
75e9d8e
Compare
Extract MaxCachedDepth=16 to Renderer companion object constant per review. Pre-compute indentCache arrays for depths 0..15 to replace per-character space emission with a single bulk appendAll in flushBuffer.
75e9d8e to
47fe1e6
Compare
Contributor
Author
|
Superseded by #730 which combines this optimization with the other renderer throughput improvements (indent cache + bulk copy + direct long rendering) into a single coherent PR with comprehensive benchmarks. |
stephenamar-db
pushed a commit
that referenced
this pull request
Apr 10, 2026
…rect long rendering) (#730) ## Motivation The materialization/rendering pipeline is the primary bottleneck for large-output workloads. For `realistic2` (28.6 MB output, 568K lines, 125K objects, 380K strings), `--debug-stats` shows 99.8% of wall time is spent in materialization. The previous implementation used per-character loops for indent rendering and intermediate `String` allocation for number formatting, leaving significant throughput on the table. ## Key Design Decisions 1. **Indent cache scope**: Lives in `BaseCharRenderer` (not `Renderer`) so all renderer subclasses (`Renderer`, `MaterializeJsonRenderer`, `PythonRenderer`) benefit automatically. 2. **MaxCachedDepth = 32**: Covers virtually all real-world Jsonnet (realistic2 max depth ~5). Beyond this, falls back to the original per-character loop. 3. **Negative accumulator** in `appendLong`: Handles `Long.MinValue` correctly without overflow (negating `Long.MinValue` overflows `Long`). 4. **Zero-allocation number rendering**: For integer-valued doubles (the common case in Jsonnet), digits are written directly into `CharBuilder` instead of going through `Long.toString` → `String` → char-by-char copy. ## Modifications ### `BaseCharRenderer.scala` - Added companion object with `MaxCachedDepth = 32` - Added `indentCache` field: pre-computed `Array[Array[Char]]` with `newline + indent*d spaces` for each depth level, constructed once at renderer creation - Updated `renderIndent()` to use cached arrays via `appendAll` (single `System.arraycopy`) for depths < 32 - Updated `appendString()` to use `String.getChars` bulk copy instead of char-by-char loop ### `Renderer.scala` - Updated `visitFloat64()` to render integers directly via `RenderUtils.appendLong()` - Updated `flushBuffer()` to use `indentCache` for bulk indent rendering - Added `RenderUtils.appendLong()`: renders `Long` directly into `CharBuilder` using negative accumulator + reverse-in-place algorithm ### `RendererTests.scala` - Added `appendLong` edge case tests: 0, positive, negative, large, `Long.MaxValue`, `Long.MinValue` - Added `visitFloat64Integers` tests for end-to-end integer rendering - Added `indentZero` test for `indent=0` edge case ## Benchmark Results ### JMH (JVM, isolated runs, lower is better) | Benchmark | Before (ms/op) | After (ms/op) | Change | |-----------|----------------|---------------|--------| | **realistic2** | 68.749 | 58.001 | **-15.6%** ✅ | | **reverse** | 10.494 | 8.436 | **-19.6%** ✅ | | gen_big_object | 1.066 | 1.000 | -6.2% ✅ | | bench.02 | 39.832 | 39.322 | -1.3% ≈ | | comparison | 20.216 | 21.060 | +4.2% (noise — eval-only, output is `true`) | | realistic1 | 2.015 | 2.133 | within noise | No regressions across the full 35-benchmark JMH suite. ### Hyperfine (Scala Native, `--warmup 3 --min-runs 10`) **realistic2** (28.6 MB output): | Implementation | Time (ms) | vs jrsonnet | |---|---|---| | sjsonnet-native (master) | 264.9 ± 4.2 | 2.48x slower | | sjsonnet-native (this PR) | 262.2 ± 2.9 | 2.45x slower | | jrsonnet 0.5.0-pre98 | 106.8 ± 16.3 | baseline | **reverse** (large array output): | Implementation | Time (ms) | vs jrsonnet | |---|---|---| | sjsonnet-native (master) | 53.1 ± 2.8 | 2.22x slower | | sjsonnet-native (this PR) | 38.0 ± 2.3 | **1.59x slower** | | jrsonnet 0.5.0-pre98 | 24.0 ± 1.7 | baseline | Gap closed from 2.22x → 1.59x (**-28.4%** improvement). **gen_big_object**: | Implementation | Time (ms) | vs jrsonnet | |---|---|---| | sjsonnet-native (master) | 12.1 ± 1.5 | 1.16x slower | | sjsonnet-native (this PR) | 10.4 ± 1.1 | **1.01x — tied!** | | jrsonnet 0.5.0-pre98 | 10.5 ± 1.3 | baseline | **realistic1**: | Implementation | Time (ms) | vs jrsonnet | |---|---|---| | sjsonnet-native (master) | 12.9 ± 1.4 | — | | sjsonnet-native (this PR) | 12.0 ± 1.4 | **1.15x faster** | | jrsonnet 0.5.0-pre98 | 13.9 ± 2.1 | baseline | sjsonnet already **beats** jrsonnet on realistic1 (1.15x faster). ## Analysis The JVM improvement is larger (15.6% on realistic2) because the JIT compiler was still leaving performance on the table with the char-by-char loops. On Scala Native, LLVM already partially optimizes these loops, so the native improvement is smaller for realistic2 but significant for reverse (28.4%), where the output contains many integer-valued doubles that benefit from the zero-allocation `appendLong` path. The `gen_big_object` benchmark is now **tied with jrsonnet** (10.4ms vs 10.5ms), and `realistic1` beats jrsonnet by 1.15x. ## Result - ✅ All 141 test suites pass (JVM 3.3.7) - ✅ Compiles on all platforms (JVM, JS, Native) - ✅ No regressions across the full benchmark suite - ✅ Comprehensive new test coverage for edge cases This PR supersedes #676 (renderer-indent-cache), #681 (renderer-bulk-append), and #685 (direct-long-rendering) which implemented subsets of these optimizations individually.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The JSON renderer generates indentation strings (newline + spaces) on every nested element. For deeply nested Jsonnet output,
Renderer.visitKeyandvisitEndrepeatedly construct identical indent strings. The current implementation callselemBuilder.append('\n')followed by awhileloop appending spaces — this is O(depth) per indent operation.Key Design Decision
Pre-cache indent strings (newline + spaces) in a companion object array up to depth 64 (
MaxCachedDepth). For depths ≤64, indent operations become a single array lookup + bulk write. For depths >64 (rare in practice), fall through to the original loop.Modification
sjsonnet/src/sjsonnet/Renderer.scala:MaxCachedDepth = 64constant andindentCache: Array[Array[Char]]"\n" + " " * (depth * indent)as char arrays for depths 0–64flushBuffer()fast path: when depth ≤MaxCachedDepth, useselemBuilder.appendAll(cachedArray, len)instead of character-by-character loopBenchmark Results
JMH — Full Suite (35 benchmarks, 1+1 warmup)
No regressions detected. All benchmarks within noise margin.
Note
The indentation cache optimization primarily benefits:
std.manifestJsonEx— uses indentation for pretty-printingSystem.arraycopyAnalysis
References
upickle.core.CharBuilder.appendAll(char[], int)for bulk writesRenderer.flushBufferResult
Pre-cached indent arrays eliminate per-character overhead for nested JSON rendering. No regressions. Benefits deeply nested output and Scala Native.