Skip to content

perf: lazy reverse array — zero-copy index remapping for std.reverse#741

Merged
stephenamar-db merged 1 commit intodatabricks:masterfrom
He-Pin:perf/lazy-reverse-array
Apr 11, 2026
Merged

perf: lazy reverse array — zero-copy index remapping for std.reverse#741
stephenamar-db merged 1 commit intodatabricks:masterfrom
He-Pin:perf/lazy-reverse-array

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 11, 2026

Motivation

std.reverse was allocating a new array and copying all elements in reverse order — O(n) time and space. For the std_reverse benchmark (reversing large arrays repeatedly), this allocation overhead was measurable. jrsonnet uses a zero-copy ReverseArray wrapper with index remapping, avoiding allocation entirely.

Key Design Decision

Added a _reversed: Boolean flag to Val.Arr rather than creating a separate wrapper class. This keeps the type hierarchy simple while achieving the same zero-copy benefit. When _reversed is true, value(i) returns arr(length - 1 - i) instead of arr(i).

Trade-offs considered:

  • Separate wrapper class (like jrsonnet): more type-safe but adds dispatch overhead and complexity to every array consumer
  • Flag on existing class: simpler, all access methods already handle it, double-reverse cancels out

Modification

Val.scalaVal.Arr:

  • Added _reversed: Boolean flag (private, mutable, single-threaded context)
  • Modified value(i) with reversed index mapping
  • Updated asLazyArray, asStrictArray, concat, foreach, forall, iterator to handle reversed flag
  • Added reversed(newPos): Arr factory for zero-copy reversal (flips flag, reuses backing array)
  • Iterator properly throws NoSuchElementException on exhaustion

stdlib/ArrayModule.scalastd.reverse:

  • Changed from Val.Arr(pos, arrs.value.asArr.asLazyArray.reverse) to arrs.value.asArr.reversed(pos)

New test: lazy_reverse_correctness.jsonnet — 17 assertions covering:
basic, empty, single-element, double-reverse, concat, filter, map, sort, foldl, slice, comparison, length, member, join, nested arrays, lazy thunks

Benchmark Results

JMH (JVM, Scala 3.3.7)

Benchmark Master (ms/op) This PR (ms/op) Change
reverse 8.408 8.387 -0.2% (neutral)
comparison 20.866 no regression
comparison2 36.809 no regression
realistic2 58.777 no regression

JVM shows neutral: JIT already optimizes array copy well. The benefit is primarily on Native.

Hyperfine (Scala Native, vs master and jrsonnet)

Benchmark Master This PR jrsonnet Improvement
std_reverse 41.2ms 38.4ms 25.9ms -6.8% vs master
comparison_for_array 1.13x faster positive
realistic_2 1.00x neutral
comparison_for_primitives 1.00x neutral

Analysis

The -6.8% improvement on Native confirms the zero-copy approach avoids allocation pressure that is more pronounced without JIT. The gap vs jrsonnet (38.4ms vs 25.9ms = 1.48x) remains due to other factors (array element access dispatch, thunk resolution overhead), not std.reverse itself.

No regressions detected on any benchmark.

References

  • jrsonnet ReverseArray approach: index remapping without allocation
  • Similar pattern to _skipFieldCache flag already used on Val.Obj

Result

✅ All 55 JVM test suites pass
✅ 17 new regression test assertions
✅ -6.8% improvement on Native std_reverse benchmark
✅ No regressions on any other benchmark
✅ Scalafmt applied

Instead of allocating a new reversed array, add a _reversed flag to Val.Arr
that remaps index access: value(i) returns arr(length-1-i) when reversed.
This avoids O(n) allocation and copy for std.reverse.

All array access methods (value, asLazyArray, asStrictArray, concat, foreach,
forall, iterator) correctly handle the reversed flag. Double-reverse cancels
out for zero overhead.

Inspired by jrsonnet's ReverseArray zero-copy approach.

Includes 17-assertion regression test covering: basic reverse, empty array,
single element, double-reverse, concat, filter, map, sort, foldl, slice,
comparison, length, member, join, nested arrays, and lazy thunks.
@He-Pin He-Pin marked this pull request as ready for review April 11, 2026 09:04
@stephenamar-db stephenamar-db merged commit 7e6e692 into databricks:master Apr 11, 2026
5 checks passed
stephenamar-db pushed a commit that referenced this pull request Apr 11, 2026
…sed materializer (#745)

## Motivation

The rendering pipeline is the dominant cost in sjsonnet's output path.
On Scala Native, `realistic2` materialization alone takes ~190ms out of
~270ms total (70%). The existing pipeline routes through `char[]`
buffers → `OutputStreamWriter` → UTF-8 encoding → `byte[]` →
`OutputStream`, adding unnecessary conversion layers for what is
predominantly ASCII JSON output.

This PR introduces a full `byte[]` rendering pipeline that eliminates
the char-to-byte conversion entirely, adds SWAR (SIMD Within A Register)
escape-character scanning, zero-allocation integer rendering, and a
fused materializer that bypasses the upickle Visitor dispatch interface.

## Key Design Decisions

1. **byte[] pipeline over char[]**: `BaseByteRenderer` mirrors
`BaseCharRenderer` but uses `upickle.core.ByteBuilder` (byte[]) instead
of `CharBuilder` (char[]), writing directly to `OutputStream`. This
eliminates the `OutputStreamWriter` UTF-8 encoding layer and halves
buffer memory for ASCII content.

2. **SWAR escape-char scanning**: `CharSWAR` processes 8 bytes per
iteration using bitwise parallel techniques (Hacker's Delight Ch. 6
zero-detection) to detect `"`, `\`, and control chars. Platform-specific
implementations: JVM uses `VarHandle` for misaligned reads, Scala Native
uses `Intrinsics.loadLong` + `ByteArray.atRawUnsafe`, JS falls back to
scalar loops.

3. **Two-tier string rendering**: Short strings (< 128 chars) use a
fused encode+check loop with zero allocation. Long strings (≥ 128 chars)
use `getBytes(UTF-8)` + SWAR bulk scan + `arraycopy`. The SWAR pre-scan
determines if the fast path (direct copy) can be taken, avoiding
per-character escape processing for clean strings.

4. **Digit-pair lookup table**: Integer rendering uses
two-digits-at-a-time conversion via `DIGIT_TENS`/`DIGIT_ONES` lookup
tables, writing backward into a scratch buffer then bulk-copying.
Eliminates `Long.toString` allocation for the most common numeric
output.

5. **Fused materializer+renderer**: `ByteRenderer.materializeDirect()`
walks the `Val` tree and writes JSON bytes directly, bypassing the
upickle `Visitor` interface entirely (no
`visitObject`/`visitArray`/`visitKey`/`visitValue`/`subVisitor` virtual
dispatch). Uses `@switch` on `valTag` for O(1) type routing. Falls back
to the generic `Materializer.apply0` path for deeply nested structures.

6. **Reusable visitor instances**: Pre-allocated
`ArrVisitor`/`ObjVisitor` fields with a `Long` bitset for empty-state
tracking (bit per nesting level, supports 64 levels). Eliminates
per-array/per-object anonymous class allocation in the non-fused visitor
path.

7. **Bulk indentation**: `renderIndent` uses `System.arraycopy` from a
pre-allocated 64-byte spaces buffer instead of character-by-character
append.

8. **Native fwrite direct stdout**: `NativeOutputStream` bypasses the
Scala Native JVM compat layer (`PrintStream.write (synchronized)` →
`FileOutputStream` → `FileChannelImpl` → `unistd.write`) with direct
`stdio.fwrite(buf.at(off), 1, len, file)`. Eliminates per-write
synchronization and syscall indirection.

## Modifications

### New files

**`BaseByteRenderer.scala`** (shared `src/`): Byte-oriented JSON
renderer extending `ujson.JsVisitor[OutputStream, OutputStream]`.
Handles all JSON primitives, string rendering (short/long paths),
integer rendering (digit-pair tables), and indentation. Provides
`renderQuotedString` for the fused path.

**`ByteRenderer.scala`** (shared `src/`): sjsonnet-specific byte
renderer with custom double formatting (matching google/jsonnet output),
empty `{ }`/`[ ]` rendering, reusable visitor instances, and the fused
materializer (`materializeDirect`, `materializeChild`,
`materializeDirectObj`, `materializeDirectArr`).

**`CharSWAR.java`** (JVM `src-jvm/`): SWAR scanner using
`VarHandle.get(byte[], offset)` for misaligned 8-byte reads. Handles
both `String` (via `getChars` to char[]) and `byte[]` inputs.

**`CharSWAR.scala`** (Native `src-native/`): SWAR scanner using
`Intrinsics.loadLong` + `ByteArray.atRawUnsafe` for zero-overhead bulk
reads.

**`CharSWAR.scala`** (JS `src-js/`): Scalar fallback for Scala.js (no
SWAR — JS lacks raw memory access).

**`NativeOutputStream.scala`** (Native `src-native/`): Direct
`fwrite`-based OutputStream for Scala Native, bypassing the JVM compat
chain.

### Modified files

**`SjsonnetMainBase.scala`**: File output and stdout paths now use
`ByteRenderer` directly (bypassing `OutputStreamWriter`). Stdout path
returns a sentinel value to avoid re-printing already-written output.
Added `rawOutputStream` parameter to support Native fwrite bypass.

**`SjsonnetMain.scala`** (Native): Passes
`NativeOutputStream(stdio.stdout)` as `rawOutputStream`.

**`Interpreter.scala`**: `materialize()` detects `ByteRenderer` and
routes to the fused `materializeDirect()` path, bypassing the generic
`Materializer.apply0` visitor dispatch.

**`BaseCharRenderer.scala`**: `visitNonNullString` now uses
`CharSWAR.hasEscapeChar` for pre-scanning. Added `writeLongDirect` with
digit-pair lookup tables. Added companion object with lookup tables.

**`Renderer.scala`**: `visitFloat64` inlined to avoid
`RenderUtils.renderDouble` String allocation — uses `writeLongDirect`
for integers, `BigDecimal` for whole-number doubles, `d.toString` for
fractionals.

**`Materializer.scala`**: Fixed `Apply`/`Apply0-3` pattern match arity
for auto-TCO `strict` field (upstream `ecdd0b6`).

## Benchmark Results

### Hyperfine (Scala Native, `realistic2`, averaged over 2 rounds)

| Config | Master (ms) | This PR (ms) | Speedup |
|--------|:-----------:|:------------:|:-------:|
| stdout | 270 ± 5 | 175 ± 6 | **1.55x (35% faster)** |
| stdout `-p` | 250 ± 4 | 162 ± 3 | **1.54x (35% faster)** |
| file `-o` | 449 ± 69 | 405 ± 69 | 1.11x (IO bound) |

Output correctness verified: `diff` confirms byte-identical output
between master and this PR.

## Analysis

The byte[] pipeline optimization stacks four independent wins:

1. **OutputStreamWriter elimination** (~10%): Removing the
char[]→UTF-8→byte[] conversion layer. Most impactful for file output
where the full `OutputStreamWriter` synchronization overhead applies.

2. **SWAR escape scanning** (~5%): 8x throughput for escape-char
detection on clean strings (the common case). The SWAR pre-scan gates a
fast bulk-copy path, avoiding per-character processing.

3. **Fused materializer** (~15-20%): Eliminating Visitor interface
virtual dispatch. On JVM with JIT, devirtualization handles most of this
automatically. On Scala Native without JIT, every
`visitObject`/`subVisitor`/`visitKey`/`visitValue`/`visitEnd` call is a
vtable lookup + indirect branch — the fused path replaces all of these
with direct method calls.

4. **Native fwrite bypass** (~5%): Eliminating `PrintStream`
synchronized lock + `FileChannelImpl` indirection on every write.
`stdio.fwrite` has internal buffering and batches small writes before
syscall.

## Notes

The `lazy_reverse_correctness.jsonnet` test failure on Scala 2.13.18 is
a **pre-existing upstream bug** from PR #741 (lazy reverse array).
Upstream master itself does not compile on 2.13 due to the auto-TCO
pattern match arity issue (ecdd0b6), so this test was never run on 2.13
upstream. This PR fixes the compilation issue but exposes the runtime
bug. This is not a regression introduced by this PR.

## Result

- All test suites pass on Scala 3.3.7, JS, WASM, Native
- Scala 2.13.18: 1 pre-existing upstream failure
(`lazy_reverse_correctness.jsonnet`)
- No regressions detected
- Output is byte-identical to master for all test cases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants