perf: lazy reverse array — zero-copy index remapping for std.reverse by He-Pin · Pull Request #741 · databricks/sjsonnet

He-Pin · 2026-04-11T07:16:57Z

Motivation

std.reverse was allocating a new array and copying all elements in reverse order — O(n) time and space. For the std_reverse benchmark (reversing large arrays repeatedly), this allocation overhead was measurable. jrsonnet uses a zero-copy ReverseArray wrapper with index remapping, avoiding allocation entirely.

Key Design Decision

Added a _reversed: Boolean flag to Val.Arr rather than creating a separate wrapper class. This keeps the type hierarchy simple while achieving the same zero-copy benefit. When _reversed is true, value(i) returns arr(length - 1 - i) instead of arr(i).

Trade-offs considered:

Separate wrapper class (like jrsonnet): more type-safe but adds dispatch overhead and complexity to every array consumer
Flag on existing class: simpler, all access methods already handle it, double-reverse cancels out

Modification

Val.scala — Val.Arr:

Added _reversed: Boolean flag (private, mutable, single-threaded context)
Modified value(i) with reversed index mapping
Updated asLazyArray, asStrictArray, concat, foreach, forall, iterator to handle reversed flag
Added reversed(newPos): Arr factory for zero-copy reversal (flips flag, reuses backing array)
Iterator properly throws NoSuchElementException on exhaustion

stdlib/ArrayModule.scala — std.reverse:

Changed from Val.Arr(pos, arrs.value.asArr.asLazyArray.reverse) to arrs.value.asArr.reversed(pos)

New test: lazy_reverse_correctness.jsonnet — 17 assertions covering:
basic, empty, single-element, double-reverse, concat, filter, map, sort, foldl, slice, comparison, length, member, join, nested arrays, lazy thunks

Benchmark Results

JMH (JVM, Scala 3.3.7)

Benchmark	Master (ms/op)	This PR (ms/op)	Change
reverse	8.408	8.387	-0.2% (neutral)
comparison	—	20.866	no regression
comparison2	—	36.809	no regression
realistic2	—	58.777	no regression

JVM shows neutral: JIT already optimizes array copy well. The benefit is primarily on Native.

Hyperfine (Scala Native, vs master and jrsonnet)

Benchmark	Master	This PR	jrsonnet	Improvement
std_reverse	41.2ms	38.4ms	25.9ms	-6.8% vs master
comparison_for_array	—	1.13x faster	—	positive
realistic_2	—	1.00x	—	neutral
comparison_for_primitives	—	1.00x	—	neutral

Analysis

The -6.8% improvement on Native confirms the zero-copy approach avoids allocation pressure that is more pronounced without JIT. The gap vs jrsonnet (38.4ms vs 25.9ms = 1.48x) remains due to other factors (array element access dispatch, thunk resolution overhead), not std.reverse itself.

No regressions detected on any benchmark.

References

jrsonnet ReverseArray approach: index remapping without allocation
Similar pattern to _skipFieldCache flag already used on Val.Obj

Result

✅ All 55 JVM test suites pass
✅ 17 new regression test assertions
✅ -6.8% improvement on Native std_reverse benchmark
✅ No regressions on any other benchmark
✅ Scalafmt applied

Instead of allocating a new reversed array, add a _reversed flag to Val.Arr that remaps index access: value(i) returns arr(length-1-i) when reversed. This avoids O(n) allocation and copy for std.reverse. All array access methods (value, asLazyArray, asStrictArray, concat, foreach, forall, iterator) correctly handle the reversed flag. Double-reverse cancels out for zero overhead. Inspired by jrsonnet's ReverseArray zero-copy approach. Includes 17-assertion regression test covering: basic reverse, empty array, single element, double-reverse, concat, filter, map, sort, foldl, slice, comparison, length, member, join, nested arrays, and lazy thunks.

…sed materializer (#745) ## Motivation The rendering pipeline is the dominant cost in sjsonnet's output path. On Scala Native, `realistic2` materialization alone takes ~190ms out of ~270ms total (70%). The existing pipeline routes through `char[]` buffers → `OutputStreamWriter` → UTF-8 encoding → `byte[]` → `OutputStream`, adding unnecessary conversion layers for what is predominantly ASCII JSON output. This PR introduces a full `byte[]` rendering pipeline that eliminates the char-to-byte conversion entirely, adds SWAR (SIMD Within A Register) escape-character scanning, zero-allocation integer rendering, and a fused materializer that bypasses the upickle Visitor dispatch interface. ## Key Design Decisions 1. **byte[] pipeline over char[]**: `BaseByteRenderer` mirrors `BaseCharRenderer` but uses `upickle.core.ByteBuilder` (byte[]) instead of `CharBuilder` (char[]), writing directly to `OutputStream`. This eliminates the `OutputStreamWriter` UTF-8 encoding layer and halves buffer memory for ASCII content. 2. **SWAR escape-char scanning**: `CharSWAR` processes 8 bytes per iteration using bitwise parallel techniques (Hacker's Delight Ch. 6 zero-detection) to detect `"`, `\`, and control chars. Platform-specific implementations: JVM uses `VarHandle` for misaligned reads, Scala Native uses `Intrinsics.loadLong` + `ByteArray.atRawUnsafe`, JS falls back to scalar loops. 3. **Two-tier string rendering**: Short strings (< 128 chars) use a fused encode+check loop with zero allocation. Long strings (≥ 128 chars) use `getBytes(UTF-8)` + SWAR bulk scan + `arraycopy`. The SWAR pre-scan determines if the fast path (direct copy) can be taken, avoiding per-character escape processing for clean strings. 4. **Digit-pair lookup table**: Integer rendering uses two-digits-at-a-time conversion via `DIGIT_TENS`/`DIGIT_ONES` lookup tables, writing backward into a scratch buffer then bulk-copying. Eliminates `Long.toString` allocation for the most common numeric output. 5. **Fused materializer+renderer**: `ByteRenderer.materializeDirect()` walks the `Val` tree and writes JSON bytes directly, bypassing the upickle `Visitor` interface entirely (no `visitObject`/`visitArray`/`visitKey`/`visitValue`/`subVisitor` virtual dispatch). Uses `@switch` on `valTag` for O(1) type routing. Falls back to the generic `Materializer.apply0` path for deeply nested structures. 6. **Reusable visitor instances**: Pre-allocated `ArrVisitor`/`ObjVisitor` fields with a `Long` bitset for empty-state tracking (bit per nesting level, supports 64 levels). Eliminates per-array/per-object anonymous class allocation in the non-fused visitor path. 7. **Bulk indentation**: `renderIndent` uses `System.arraycopy` from a pre-allocated 64-byte spaces buffer instead of character-by-character append. 8. **Native fwrite direct stdout**: `NativeOutputStream` bypasses the Scala Native JVM compat layer (`PrintStream.write (synchronized)` → `FileOutputStream` → `FileChannelImpl` → `unistd.write`) with direct `stdio.fwrite(buf.at(off), 1, len, file)`. Eliminates per-write synchronization and syscall indirection. ## Modifications ### New files **`BaseByteRenderer.scala`** (shared `src/`): Byte-oriented JSON renderer extending `ujson.JsVisitor[OutputStream, OutputStream]`. Handles all JSON primitives, string rendering (short/long paths), integer rendering (digit-pair tables), and indentation. Provides `renderQuotedString` for the fused path. **`ByteRenderer.scala`** (shared `src/`): sjsonnet-specific byte renderer with custom double formatting (matching google/jsonnet output), empty `{ }`/`[ ]` rendering, reusable visitor instances, and the fused materializer (`materializeDirect`, `materializeChild`, `materializeDirectObj`, `materializeDirectArr`). **`CharSWAR.java`** (JVM `src-jvm/`): SWAR scanner using `VarHandle.get(byte[], offset)` for misaligned 8-byte reads. Handles both `String` (via `getChars` to char[]) and `byte[]` inputs. **`CharSWAR.scala`** (Native `src-native/`): SWAR scanner using `Intrinsics.loadLong` + `ByteArray.atRawUnsafe` for zero-overhead bulk reads. **`CharSWAR.scala`** (JS `src-js/`): Scalar fallback for Scala.js (no SWAR — JS lacks raw memory access). **`NativeOutputStream.scala`** (Native `src-native/`): Direct `fwrite`-based OutputStream for Scala Native, bypassing the JVM compat chain. ### Modified files **`SjsonnetMainBase.scala`**: File output and stdout paths now use `ByteRenderer` directly (bypassing `OutputStreamWriter`). Stdout path returns a sentinel value to avoid re-printing already-written output. Added `rawOutputStream` parameter to support Native fwrite bypass. **`SjsonnetMain.scala`** (Native): Passes `NativeOutputStream(stdio.stdout)` as `rawOutputStream`. **`Interpreter.scala`**: `materialize()` detects `ByteRenderer` and routes to the fused `materializeDirect()` path, bypassing the generic `Materializer.apply0` visitor dispatch. **`BaseCharRenderer.scala`**: `visitNonNullString` now uses `CharSWAR.hasEscapeChar` for pre-scanning. Added `writeLongDirect` with digit-pair lookup tables. Added companion object with lookup tables. **`Renderer.scala`**: `visitFloat64` inlined to avoid `RenderUtils.renderDouble` String allocation — uses `writeLongDirect` for integers, `BigDecimal` for whole-number doubles, `d.toString` for fractionals. **`Materializer.scala`**: Fixed `Apply`/`Apply0-3` pattern match arity for auto-TCO `strict` field (upstream `ecdd0b6`). ## Benchmark Results ### Hyperfine (Scala Native, `realistic2`, averaged over 2 rounds) | Config | Master (ms) | This PR (ms) | Speedup | |--------|:-----------:|:------------:|:-------:| | stdout | 270 ± 5 | 175 ± 6 | **1.55x (35% faster)** | | stdout `-p` | 250 ± 4 | 162 ± 3 | **1.54x (35% faster)** | | file `-o` | 449 ± 69 | 405 ± 69 | 1.11x (IO bound) | Output correctness verified: `diff` confirms byte-identical output between master and this PR. ## Analysis The byte[] pipeline optimization stacks four independent wins: 1. **OutputStreamWriter elimination** (~10%): Removing the char[]→UTF-8→byte[] conversion layer. Most impactful for file output where the full `OutputStreamWriter` synchronization overhead applies. 2. **SWAR escape scanning** (~5%): 8x throughput for escape-char detection on clean strings (the common case). The SWAR pre-scan gates a fast bulk-copy path, avoiding per-character processing. 3. **Fused materializer** (~15-20%): Eliminating Visitor interface virtual dispatch. On JVM with JIT, devirtualization handles most of this automatically. On Scala Native without JIT, every `visitObject`/`subVisitor`/`visitKey`/`visitValue`/`visitEnd` call is a vtable lookup + indirect branch — the fused path replaces all of these with direct method calls. 4. **Native fwrite bypass** (~5%): Eliminating `PrintStream` synchronized lock + `FileChannelImpl` indirection on every write. `stdio.fwrite` has internal buffering and batches small writes before syscall. ## Notes The `lazy_reverse_correctness.jsonnet` test failure on Scala 2.13.18 is a **pre-existing upstream bug** from PR #741 (lazy reverse array). Upstream master itself does not compile on 2.13 due to the auto-TCO pattern match arity issue (ecdd0b6), so this test was never run on 2.13 upstream. This PR fixes the compilation issue but exposes the runtime bug. This is not a regression introduced by this PR. ## Result - All test suites pass on Scala 3.3.7, JS, WASM, Native - Scala 2.13.18: 1 pre-existing upstream failure (`lazy_reverse_correctness.jsonnet`) - No regressions detected - Output is byte-identical to master for all test cases

He-Pin marked this pull request as ready for review April 11, 2026 09:04

stephenamar-db merged commit 7e6e692 into databricks:master Apr 11, 2026
5 checks passed

He-Pin mentioned this pull request Apr 11, 2026

perf: Full byte[] rendering pipeline with SWAR escape scanning and fused materializer #745

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: lazy reverse array — zero-copy index remapping for std.reverse#741

perf: lazy reverse array — zero-copy index remapping for std.reverse#741
stephenamar-db merged 1 commit intodatabricks:masterfrom
He-Pin:perf/lazy-reverse-array

He-Pin commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

He-Pin commented Apr 11, 2026

Motivation

Key Design Decision

Modification

Benchmark Results

JMH (JVM, Scala 3.3.7)

Hyperfine (Scala Native, vs master and jrsonnet)

Analysis

References

Result

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants