perf: lazy reverse array — zero-copy index remapping for std.reverse#741
Merged
stephenamar-db merged 1 commit intodatabricks:masterfrom Apr 11, 2026
Merged
Conversation
Instead of allocating a new reversed array, add a _reversed flag to Val.Arr that remaps index access: value(i) returns arr(length-1-i) when reversed. This avoids O(n) allocation and copy for std.reverse. All array access methods (value, asLazyArray, asStrictArray, concat, foreach, forall, iterator) correctly handle the reversed flag. Double-reverse cancels out for zero overhead. Inspired by jrsonnet's ReverseArray zero-copy approach. Includes 17-assertion regression test covering: basic reverse, empty array, single element, double-reverse, concat, filter, map, sort, foldl, slice, comparison, length, member, join, nested arrays, and lazy thunks.
stephenamar-db
pushed a commit
that referenced
this pull request
Apr 11, 2026
…sed materializer (#745) ## Motivation The rendering pipeline is the dominant cost in sjsonnet's output path. On Scala Native, `realistic2` materialization alone takes ~190ms out of ~270ms total (70%). The existing pipeline routes through `char[]` buffers → `OutputStreamWriter` → UTF-8 encoding → `byte[]` → `OutputStream`, adding unnecessary conversion layers for what is predominantly ASCII JSON output. This PR introduces a full `byte[]` rendering pipeline that eliminates the char-to-byte conversion entirely, adds SWAR (SIMD Within A Register) escape-character scanning, zero-allocation integer rendering, and a fused materializer that bypasses the upickle Visitor dispatch interface. ## Key Design Decisions 1. **byte[] pipeline over char[]**: `BaseByteRenderer` mirrors `BaseCharRenderer` but uses `upickle.core.ByteBuilder` (byte[]) instead of `CharBuilder` (char[]), writing directly to `OutputStream`. This eliminates the `OutputStreamWriter` UTF-8 encoding layer and halves buffer memory for ASCII content. 2. **SWAR escape-char scanning**: `CharSWAR` processes 8 bytes per iteration using bitwise parallel techniques (Hacker's Delight Ch. 6 zero-detection) to detect `"`, `\`, and control chars. Platform-specific implementations: JVM uses `VarHandle` for misaligned reads, Scala Native uses `Intrinsics.loadLong` + `ByteArray.atRawUnsafe`, JS falls back to scalar loops. 3. **Two-tier string rendering**: Short strings (< 128 chars) use a fused encode+check loop with zero allocation. Long strings (≥ 128 chars) use `getBytes(UTF-8)` + SWAR bulk scan + `arraycopy`. The SWAR pre-scan determines if the fast path (direct copy) can be taken, avoiding per-character escape processing for clean strings. 4. **Digit-pair lookup table**: Integer rendering uses two-digits-at-a-time conversion via `DIGIT_TENS`/`DIGIT_ONES` lookup tables, writing backward into a scratch buffer then bulk-copying. Eliminates `Long.toString` allocation for the most common numeric output. 5. **Fused materializer+renderer**: `ByteRenderer.materializeDirect()` walks the `Val` tree and writes JSON bytes directly, bypassing the upickle `Visitor` interface entirely (no `visitObject`/`visitArray`/`visitKey`/`visitValue`/`subVisitor` virtual dispatch). Uses `@switch` on `valTag` for O(1) type routing. Falls back to the generic `Materializer.apply0` path for deeply nested structures. 6. **Reusable visitor instances**: Pre-allocated `ArrVisitor`/`ObjVisitor` fields with a `Long` bitset for empty-state tracking (bit per nesting level, supports 64 levels). Eliminates per-array/per-object anonymous class allocation in the non-fused visitor path. 7. **Bulk indentation**: `renderIndent` uses `System.arraycopy` from a pre-allocated 64-byte spaces buffer instead of character-by-character append. 8. **Native fwrite direct stdout**: `NativeOutputStream` bypasses the Scala Native JVM compat layer (`PrintStream.write (synchronized)` → `FileOutputStream` → `FileChannelImpl` → `unistd.write`) with direct `stdio.fwrite(buf.at(off), 1, len, file)`. Eliminates per-write synchronization and syscall indirection. ## Modifications ### New files **`BaseByteRenderer.scala`** (shared `src/`): Byte-oriented JSON renderer extending `ujson.JsVisitor[OutputStream, OutputStream]`. Handles all JSON primitives, string rendering (short/long paths), integer rendering (digit-pair tables), and indentation. Provides `renderQuotedString` for the fused path. **`ByteRenderer.scala`** (shared `src/`): sjsonnet-specific byte renderer with custom double formatting (matching google/jsonnet output), empty `{ }`/`[ ]` rendering, reusable visitor instances, and the fused materializer (`materializeDirect`, `materializeChild`, `materializeDirectObj`, `materializeDirectArr`). **`CharSWAR.java`** (JVM `src-jvm/`): SWAR scanner using `VarHandle.get(byte[], offset)` for misaligned 8-byte reads. Handles both `String` (via `getChars` to char[]) and `byte[]` inputs. **`CharSWAR.scala`** (Native `src-native/`): SWAR scanner using `Intrinsics.loadLong` + `ByteArray.atRawUnsafe` for zero-overhead bulk reads. **`CharSWAR.scala`** (JS `src-js/`): Scalar fallback for Scala.js (no SWAR — JS lacks raw memory access). **`NativeOutputStream.scala`** (Native `src-native/`): Direct `fwrite`-based OutputStream for Scala Native, bypassing the JVM compat chain. ### Modified files **`SjsonnetMainBase.scala`**: File output and stdout paths now use `ByteRenderer` directly (bypassing `OutputStreamWriter`). Stdout path returns a sentinel value to avoid re-printing already-written output. Added `rawOutputStream` parameter to support Native fwrite bypass. **`SjsonnetMain.scala`** (Native): Passes `NativeOutputStream(stdio.stdout)` as `rawOutputStream`. **`Interpreter.scala`**: `materialize()` detects `ByteRenderer` and routes to the fused `materializeDirect()` path, bypassing the generic `Materializer.apply0` visitor dispatch. **`BaseCharRenderer.scala`**: `visitNonNullString` now uses `CharSWAR.hasEscapeChar` for pre-scanning. Added `writeLongDirect` with digit-pair lookup tables. Added companion object with lookup tables. **`Renderer.scala`**: `visitFloat64` inlined to avoid `RenderUtils.renderDouble` String allocation — uses `writeLongDirect` for integers, `BigDecimal` for whole-number doubles, `d.toString` for fractionals. **`Materializer.scala`**: Fixed `Apply`/`Apply0-3` pattern match arity for auto-TCO `strict` field (upstream `ecdd0b6`). ## Benchmark Results ### Hyperfine (Scala Native, `realistic2`, averaged over 2 rounds) | Config | Master (ms) | This PR (ms) | Speedup | |--------|:-----------:|:------------:|:-------:| | stdout | 270 ± 5 | 175 ± 6 | **1.55x (35% faster)** | | stdout `-p` | 250 ± 4 | 162 ± 3 | **1.54x (35% faster)** | | file `-o` | 449 ± 69 | 405 ± 69 | 1.11x (IO bound) | Output correctness verified: `diff` confirms byte-identical output between master and this PR. ## Analysis The byte[] pipeline optimization stacks four independent wins: 1. **OutputStreamWriter elimination** (~10%): Removing the char[]→UTF-8→byte[] conversion layer. Most impactful for file output where the full `OutputStreamWriter` synchronization overhead applies. 2. **SWAR escape scanning** (~5%): 8x throughput for escape-char detection on clean strings (the common case). The SWAR pre-scan gates a fast bulk-copy path, avoiding per-character processing. 3. **Fused materializer** (~15-20%): Eliminating Visitor interface virtual dispatch. On JVM with JIT, devirtualization handles most of this automatically. On Scala Native without JIT, every `visitObject`/`subVisitor`/`visitKey`/`visitValue`/`visitEnd` call is a vtable lookup + indirect branch — the fused path replaces all of these with direct method calls. 4. **Native fwrite bypass** (~5%): Eliminating `PrintStream` synchronized lock + `FileChannelImpl` indirection on every write. `stdio.fwrite` has internal buffering and batches small writes before syscall. ## Notes The `lazy_reverse_correctness.jsonnet` test failure on Scala 2.13.18 is a **pre-existing upstream bug** from PR #741 (lazy reverse array). Upstream master itself does not compile on 2.13 due to the auto-TCO pattern match arity issue (ecdd0b6), so this test was never run on 2.13 upstream. This PR fixes the compilation issue but exposes the runtime bug. This is not a regression introduced by this PR. ## Result - All test suites pass on Scala 3.3.7, JS, WASM, Native - Scala 2.13.18: 1 pre-existing upstream failure (`lazy_reverse_correctness.jsonnet`) - No regressions detected - Output is byte-identical to master for all test cases
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
std.reversewas allocating a new array and copying all elements in reverse order — O(n) time and space. For thestd_reversebenchmark (reversing large arrays repeatedly), this allocation overhead was measurable. jrsonnet uses a zero-copyReverseArraywrapper with index remapping, avoiding allocation entirely.Key Design Decision
Added a
_reversed: Booleanflag toVal.Arrrather than creating a separate wrapper class. This keeps the type hierarchy simple while achieving the same zero-copy benefit. When_reversedis true,value(i)returnsarr(length - 1 - i)instead ofarr(i).Trade-offs considered:
Modification
Val.scala—Val.Arr:_reversed: Booleanflag (private, mutable, single-threaded context)value(i)with reversed index mappingasLazyArray,asStrictArray,concat,foreach,forall,iteratorto handle reversed flagreversed(newPos): Arrfactory for zero-copy reversal (flips flag, reuses backing array)NoSuchElementExceptionon exhaustionstdlib/ArrayModule.scala—std.reverse:Val.Arr(pos, arrs.value.asArr.asLazyArray.reverse)toarrs.value.asArr.reversed(pos)New test:
lazy_reverse_correctness.jsonnet— 17 assertions covering:basic, empty, single-element, double-reverse, concat, filter, map, sort, foldl, slice, comparison, length, member, join, nested arrays, lazy thunks
Benchmark Results
JMH (JVM, Scala 3.3.7)
JVM shows neutral: JIT already optimizes array copy well. The benefit is primarily on Native.
Hyperfine (Scala Native, vs master and jrsonnet)
Analysis
The -6.8% improvement on Native confirms the zero-copy approach avoids allocation pressure that is more pronounced without JIT. The gap vs jrsonnet (38.4ms vs 25.9ms = 1.48x) remains due to other factors (array element access dispatch, thunk resolution overhead), not
std.reverseitself.No regressions detected on any benchmark.
References
_skipFieldCacheflag already used onVal.ObjResult
✅ All 55 JVM test suites pass
✅ 17 new regression test assertions
✅ -6.8% improvement on Native std_reverse benchmark
✅ No regressions on any other benchmark
✅ Scalafmt applied