Skip to content

perf: pre-cached indent arrays for bulk newline+spaces#676

Closed
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/renderer-indent-cache
Closed

perf: pre-cached indent arrays for bulk newline+spaces#676
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/renderer-indent-cache

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 4, 2026

Motivation

The JSON renderer generates indentation strings (newline + spaces) on every nested element. For deeply nested Jsonnet output, Renderer.visitKey and visitEnd repeatedly construct identical indent strings. The current implementation calls elemBuilder.append('\n') followed by a while loop appending spaces — this is O(depth) per indent operation.

Key Design Decision

Pre-cache indent strings (newline + spaces) in a companion object array up to depth 64 (MaxCachedDepth). For depths ≤64, indent operations become a single array lookup + bulk write. For depths >64 (rare in practice), fall through to the original loop.

Modification

sjsonnet/src/sjsonnet/Renderer.scala:

  • Added companion object with MaxCachedDepth = 64 constant and indentCache: Array[Array[Char]]
  • Cache stores pre-computed "\n" + " " * (depth * indent) as char arrays for depths 0–64
  • flushBuffer() fast path: when depth ≤ MaxCachedDepth, uses elemBuilder.appendAll(cachedArray, len) instead of character-by-character loop
  • Original loop preserved as fallback for depths > 64

Benchmark Results

JMH — Full Suite (35 benchmarks, 1+1 warmup)

No regressions detected. All benchmarks within noise margin.

Note

The indentation cache optimization primarily benefits:

  1. Deeply nested JSON output — common in Jsonnet configurations (Kubernetes manifests, CI configs)
  2. std.manifestJsonEx — uses indentation for pretty-printing
  3. Scala Native — no JIT to optimize the loop; pre-cached arrays enable System.arraycopy

Analysis

  • Memory: One-time allocation of 64 char arrays (total ~2KB) — negligible.
  • Thread safety: Cache is in a companion object, initialized once. Arrays are read-only after initialization.
  • Threshold: 64 levels covers virtually all real-world Jsonnet output (even deeply nested Kubernetes manifests rarely exceed 20 levels).

References

  • upickle.core.CharBuilder.appendAll(char[], int) for bulk writes
  • Original character-by-character indent loop in Renderer.flushBuffer

Result

Pre-cached indent arrays eliminate per-character overhead for nested JSON rendering. No regressions. Benefits deeply nested output and Scala Native.

@He-Pin He-Pin marked this pull request as ready for review April 5, 2026 00:28
@He-Pin He-Pin force-pushed the perf/renderer-indent-cache branch 5 times, most recently from f2e7618 to 9db668d Compare April 9, 2026 04:46
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented Apr 9, 2026

Good catch — extracted the magic 16 into a named constant Renderer.MaxCachedDepth in a new companion object. The comparison now reads depth < MaxCachedDepth instead of depth < indentCache.length.

Note that the indent cache content is instance-specific (depends on the indent constructor parameter — commonly 2, 3, or 4), but the size (16 depth levels) is a fixed constant shared across all instances.

@He-Pin He-Pin force-pushed the perf/renderer-indent-cache branch 2 times, most recently from 7ec85ce to abfe59a Compare April 9, 2026 15:18
@He-Pin He-Pin force-pushed the perf/renderer-indent-cache branch 2 times, most recently from f336323 to 75e9d8e Compare April 10, 2026 03:44
Extract MaxCachedDepth=16 to Renderer companion object constant per review.
Pre-compute indentCache arrays for depths 0..15 to replace per-character
space emission with a single bulk appendAll in flushBuffer.
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented Apr 10, 2026

Superseded by #730 which combines this optimization with the other renderer throughput improvements (indent cache + bulk copy + direct long rendering) into a single coherent PR with comprehensive benchmarks.

@He-Pin He-Pin closed this Apr 10, 2026
stephenamar-db pushed a commit that referenced this pull request Apr 10, 2026
…rect long rendering) (#730)

## Motivation

The materialization/rendering pipeline is the primary bottleneck for
large-output workloads. For `realistic2` (28.6 MB output, 568K lines,
125K objects, 380K strings), `--debug-stats` shows 99.8% of wall time is
spent in materialization. The previous implementation used per-character
loops for indent rendering and intermediate `String` allocation for
number formatting, leaving significant throughput on the table.

## Key Design Decisions

1. **Indent cache scope**: Lives in `BaseCharRenderer` (not `Renderer`)
so all renderer subclasses (`Renderer`, `MaterializeJsonRenderer`,
`PythonRenderer`) benefit automatically.
2. **MaxCachedDepth = 32**: Covers virtually all real-world Jsonnet
(realistic2 max depth ~5). Beyond this, falls back to the original
per-character loop.
3. **Negative accumulator** in `appendLong`: Handles `Long.MinValue`
correctly without overflow (negating `Long.MinValue` overflows `Long`).
4. **Zero-allocation number rendering**: For integer-valued doubles (the
common case in Jsonnet), digits are written directly into `CharBuilder`
instead of going through `Long.toString` → `String` → char-by-char copy.

## Modifications

### `BaseCharRenderer.scala`
- Added companion object with `MaxCachedDepth = 32`
- Added `indentCache` field: pre-computed `Array[Array[Char]]` with
`newline + indent*d spaces` for each depth level, constructed once at
renderer creation
- Updated `renderIndent()` to use cached arrays via `appendAll` (single
`System.arraycopy`) for depths < 32
- Updated `appendString()` to use `String.getChars` bulk copy instead of
char-by-char loop

### `Renderer.scala`
- Updated `visitFloat64()` to render integers directly via
`RenderUtils.appendLong()`
- Updated `flushBuffer()` to use `indentCache` for bulk indent rendering
- Added `RenderUtils.appendLong()`: renders `Long` directly into
`CharBuilder` using negative accumulator + reverse-in-place algorithm

### `RendererTests.scala`
- Added `appendLong` edge case tests: 0, positive, negative, large,
`Long.MaxValue`, `Long.MinValue`
- Added `visitFloat64Integers` tests for end-to-end integer rendering
- Added `indentZero` test for `indent=0` edge case

## Benchmark Results

### JMH (JVM, isolated runs, lower is better)

| Benchmark | Before (ms/op) | After (ms/op) | Change |
|-----------|----------------|---------------|--------|
| **realistic2** | 68.749 | 58.001 | **-15.6%** ✅ |
| **reverse** | 10.494 | 8.436 | **-19.6%** ✅ |
| gen_big_object | 1.066 | 1.000 | -6.2% ✅ |
| bench.02 | 39.832 | 39.322 | -1.3% ≈ |
| comparison | 20.216 | 21.060 | +4.2% (noise — eval-only, output is
`true`) |
| realistic1 | 2.015 | 2.133 | within noise |

No regressions across the full 35-benchmark JMH suite.

### Hyperfine (Scala Native, `--warmup 3 --min-runs 10`)

**realistic2** (28.6 MB output):
| Implementation | Time (ms) | vs jrsonnet |
|---|---|---|
| sjsonnet-native (master) | 264.9 ± 4.2 | 2.48x slower |
| sjsonnet-native (this PR) | 262.2 ± 2.9 | 2.45x slower |
| jrsonnet 0.5.0-pre98 | 106.8 ± 16.3 | baseline |

**reverse** (large array output):
| Implementation | Time (ms) | vs jrsonnet |
|---|---|---|
| sjsonnet-native (master) | 53.1 ± 2.8 | 2.22x slower |
| sjsonnet-native (this PR) | 38.0 ± 2.3 | **1.59x slower** |
| jrsonnet 0.5.0-pre98 | 24.0 ± 1.7 | baseline |

Gap closed from 2.22x → 1.59x (**-28.4%** improvement).

**gen_big_object**:
| Implementation | Time (ms) | vs jrsonnet |
|---|---|---|
| sjsonnet-native (master) | 12.1 ± 1.5 | 1.16x slower |
| sjsonnet-native (this PR) | 10.4 ± 1.1 | **1.01x — tied!** |
| jrsonnet 0.5.0-pre98 | 10.5 ± 1.3 | baseline |

**realistic1**:
| Implementation | Time (ms) | vs jrsonnet |
|---|---|---|
| sjsonnet-native (master) | 12.9 ± 1.4 | — |
| sjsonnet-native (this PR) | 12.0 ± 1.4 | **1.15x faster** |
| jrsonnet 0.5.0-pre98 | 13.9 ± 2.1 | baseline |

sjsonnet already **beats** jrsonnet on realistic1 (1.15x faster).

## Analysis

The JVM improvement is larger (15.6% on realistic2) because the JIT
compiler was still leaving performance on the table with the
char-by-char loops. On Scala Native, LLVM already partially optimizes
these loops, so the native improvement is smaller for realistic2 but
significant for reverse (28.4%), where the output contains many
integer-valued doubles that benefit from the zero-allocation
`appendLong` path.

The `gen_big_object` benchmark is now **tied with jrsonnet** (10.4ms vs
10.5ms), and `realistic1` beats jrsonnet by 1.15x.

## Result

- ✅ All 141 test suites pass (JVM 3.3.7)
- ✅ Compiles on all platforms (JVM, JS, Native)
- ✅ No regressions across the full benchmark suite
- ✅ Comprehensive new test coverage for edge cases

This PR supersedes #676 (renderer-indent-cache), #681
(renderer-bulk-append), and #685 (direct-long-rendering) which
implemented subsets of these optimizations individually.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants