refactor: extract RangeArr subclass from Arr to reduce memory footprint by He-Pin · Pull Request #772 · databricks/sjsonnet

He-Pin · 2026-04-12T17:16:25Z

Motivation

Follow-up to #771 (lazy range arrays). The initial implementation added _isRange (Boolean) and _rangeFrom (Int) as inline fields directly on Val.Arr. While functionally correct, these fields are wasted memory for the vast majority of arrays that are not created via std.range.

On a 64-bit JVM with compressed oops, each Arr instance carries ~9 bytes of overhead (boolean + int + alignment padding) that only range arrays use.

Key Design Decision

Extract range-specific state into a dedicated RangeArr subclass instead of keeping it inline:

Arr: stays lean — only arr, _reversed, _concatLeft/Right, _length (the fields every array actually needs)
RangeArr extends Arr: adds rangeFrom: Int + size: Int; delegates to parent after materialization

This follows the same pattern as jrsonnet's specialized array variants (RangeArray, ReverseArray, etc.) where each representation is a distinct type.

Modification

Val.scala — Single file change:

Arr is no longer final; arr and _length widened to private[Val] for subclass access
isConcatView made final for Scala 2.x @inline compatibility
Removed _isRange, _rangeFrom, isRange, and materializeRange() from Arr
Removed range-specific branches from Arr.value(), eval(), asLazyArray(), reversed()
Added RangeArr(pos, rangeFrom, size) extends Arr(pos, null) with overrides for value, eval, asLazyArray, reversed
Arr.range() factory now returns new RangeArr(...) instead of mutating a plain Arr

Benchmark Results

JMH (JVM, Scala 3.3.7)

No regression on any benchmark. Key results:

Benchmark	Before (ms/op)	After (ms/op)	Change
comparison	0.028	0.029	≈ same
reverse	6.909	6.912	≈ same
bench.06	2.117	2.097	≈ same
realistic2	60.2	60.4	≈ same

Hyperfine (Scala Native vs jrsonnet)

Benchmark	sjsonnet native	jrsonnet	Ratio
comparison	1.2 ms	7.4 ms	sjsonnet 6.3x faster
reverse	18.1 ms	24.4 ms	sjsonnet 1.35x faster

Analysis

Pure refactoring — moves existing range logic into a subclass without algorithmic changes. The virtual dispatch overhead for RangeArr overrides is negligible (confirmed by JMH). The memory savings benefit every non-range Arr instance in the program.

References

Follow-up to perf: lazy range arrays for O(1) std.range creation #771 (lazy range arrays)
jrsonnet array specialization: crates/jrsonnet-evaluator/src/arr/spec.rs

Result

✅ All tests pass (./mill __.test — all platforms × all Scala versions)
✅ No JMH regression
✅ No native benchmark regression
✅ ~9 bytes saved per non-range Arr instance

JMH Benchmark Results (vs master `0d13274`)

Benchmark	Master (ms/op)	This PR (ms/op)	Change
regressed assertions	0.207	0.212	+2.4%
base64	0.156	0.154	-1.3%
improved base64Decode	0.123	0.116	-5.7%
base64DecodeBytes	5.899	5.790	-1.8%
improved base64_byte_array	0.803	0.768	-4.4%
bench.01	0.052	0.051	-1.9%
bench.02	35.401	35.096	-0.9%
regressed bench.03	9.583	9.954	+3.9%
improved bench.04	0.122	0.119	-2.5%
improved bench.06	0.224	0.217	-3.1%
improved bench.07	3.332	3.177	-4.7%
regressed bench.08	0.038	0.039	+2.6%
regressed bench.09	0.041	0.042	+2.4%
comparison	0.028	0.028	+0.0%
improved comparison2	18.681	17.886	-4.3%
improved escapeStringJson	0.032	0.031	-3.1%
improved foldl	0.077	0.075	-2.6%
regressed gen_big_object	0.918	0.953	+3.8%
improved large_string_join	0.555	0.512	-7.7%
regressed large_string_template	1.600	1.639	+2.4%
lstripChars	0.113	0.114	+0.9%
manifestJsonEx	0.052	0.051	-1.9%
manifestTomlEx	0.069	0.070	+1.4%
regressed manifestYamlDoc	0.055	0.058	+5.5%
improved member	0.656	0.637	-2.9%
parseInt	0.032	0.032	+0.0%
regressed realistic1	1.661	1.757	+5.8%
realistic2	57.541	56.998	-0.9%
regressed reverse	6.717	7.172	+6.8%
regressed rstripChars	0.119	0.123	+3.4%
improved setDiff	0.431	0.413	-4.2%
setInter	0.371	0.372	+0.3%
setUnion	0.604	0.605	+0.2%
stripChars	0.117	0.117	+0.0%
substr	0.057	0.057	+0.0%

Summary: 11 improvements, 10 regressions, 14 neutral
Platform: Apple Silicon, JMH single-shot avg

@inline

Move lazy range state (_isRange, _rangeFrom) from inline fields in Arr to a dedicated RangeArr subclass. This saves ~9 bytes per non-range Arr instance (boolean + int + alignment padding), benefiting the common case where arrays are not created via std.range. Key changes: - Arr is no longer final; RangeArr extends Arr with range-specific fields - arr and _length visibility widened to private[Val] for subclass access - isConcatView made final for Scala 2.x @inline compatibility - Range-specific branches removed from Arr.value/eval/asLazyArray/reversed - RangeArr overrides value/eval/asLazyArray/reversed with range logic - Arr.range() factory now returns RangeArr instances Upstream: follow-up to databricks#771 (lazy range arrays)

## Motivation On Scala Native, `java.util.Base64` is a pure-Scala implementation that uses Wrapper objects, `@tailrec` recursive `iterate()`, and per-byte pattern matching — significantly slower than HotSpot's intrinsic-backed implementation. Beyond the raw codec, `base64DecodeBytes` was creating `Array[Eval](N)` and filling each slot with `Val.cachedNum` — N allocations for an N-byte decode. The materializer then needed per-element type dispatch to render these arrays. And `base64` encode output (guaranteed ASCII-safe) was still being scanned for JSON escape characters. `Val.Arr` carried inline `_isRange`/`_byteData` fields that bloated every regular array instance (~13 bytes wasted per non-specialized array). ## Modification ### 1. Platform-agnostic `FastBase64` encoder/decoder - `ENCODE_TABLE` (char[64]) and `DECODE_TABLE` (int[256]) pre-computed lookup tables - `encodeString()`: ASCII fast path does direct char→char encoding without intermediate `byte[]` - `decodeToString()` / `decodeToBytes()`: Direct string→bytes via lookup table - ISO-8859-1 compatibility: chars > 0xFF → 0x3F ('?') matching `java.util.Base64` behavior ### 2. C FFI SIMD base64 for Scala Native (`sjsonnet_base64.c`) - **AArch64 NEON**: `vld3`/`vst4` interleaved load/store + `vqtbl4q` 64-byte lookup for encode; `vbslq`/`vmovl_u8`/`vmovn_u16` for byte↔char widening/narrowing - **x86_64**: SSSE3/AVX2/AVX-512 VBMI paths via `pshufb`/`vpshufb`/`vpermi2b` - **Fallback**: Scalar with loop unrolling for other architectures - `sjsonnet_base64_decode_validated()`: Single-pass validation + decode with specific error codes - RFC 4648 compliant with '=' padding ### 3. Native-specific optimizations - Reusable module-level buffers (safe: Scala Native is single-threaded) — eliminates per-call array allocations - ASCII fast-path in `encodeString`: skip UTF-8 encoding for pure ASCII strings - Direct char array construction instead of charset lookup ### 4. `RangeArr` and `ByteArr` subclasses of `Val.Arr` - `Val.Arr` changed from `final class` to non-final `class`, enabling specialization - **`RangeArr extends Arr`**: Lazy integer range — keeps `rangeFrom` field out of regular arrays, saving ~9 bytes per non-range array (merges #772) - **`ByteArr extends Arr`**: Compact `Array[Byte]` backing store for 0–255 integer arrays - `byteData` is an immutable `val` — never cleared after materialization, guaranteeing `rawBytes` is always non-null for safe multi-use - `reversed()` materializes first to keep `value()`/`eval()` simple and avoid reversed-index bugs - `rawBytes` accessor enables zero-copy fast paths in `base64` encode and materializer - Callers use pattern match (`case ba: Val.ByteArr =>`) instead of null-returning `rawBytes` on base class ### 5. Materializer fast-path for byte arrays - Recursive, iterative, and fused ByteRenderer paths all detect `ByteArr` via pattern match - Skip `value(i)` lookup + type dispatch + `asDouble` conversion - Directly emit `visitFloat64((bytes(i) & 0xff).toDouble)` in a tight loop ### 6. ASCII-safe string rendering - `Val.Str._asciiSafe` flag marks strings known to contain only printable ASCII (no JSON escaping needed) - `Val.Str.asciiSafe(pos, s)` factory for creating flagged strings - `BaseByteRenderer.renderAsciiSafeString()` skips SWAR escape scanning and UTF-8 encoding — writes bytes directly from chars - `base64` encode output is marked as ASCII-safe since base64 alphabet is `[A-Za-z0-9+/=]` ### 7. `EncodingModule` updates - `base64DecodeBytes`: Uses `Val.Arr.fromBytes(pos, decoded)` — one allocation instead of N - `base64` encode: Pattern matches `ByteArr` for zero-copy bypass; output marked `asciiSafe` ## Benchmark Results ### JMH (JVM, Scala 3.3.7, Apple Silicon M4 Max) | Benchmark | Master (ms/op) | PR (ms/op) | Change | |-----------|---------------|------------|--------| | base64 | 0.153 | 0.145 | **-5.2%** | | base64Decode | 0.117 | 0.115 | -1.7% | | base64DecodeBytes | 5.692 | 5.109 | **-10.2%** | | base64_byte_array | 0.757 | 0.758 | ~same | | base64_stress | — | 0.188 | (new) | ### Scala Native (hyperfine -N, 30 runs, Apple Silicon M4 Max) Compared against jrsonnet **0.5.0-pre98** (built from source, `cargo build --release`). | Benchmark | sjsonnet master | sjsonnet PR | jrsonnet 0.5.0 | PR vs master | PR vs jrsonnet | |-----------|----------------|-------------|----------------|--------------|----------------| | base64 | 8.7ms | 6.5ms | 4.4ms | **1.34× faster** | 1.47× slower | | base64Decode | 7.3ms | 6.8ms | 4.3ms | 1.07× faster | 1.60× slower | | base64DecodeBytes | 28.7ms | 13.5ms | 20.1ms | **2.13× faster** | **1.50× faster** | | base64_byte_array | 10.5ms | 8.5ms | 17.3ms | **1.23× faster** | **2.02× faster** | | base64_stress | 6.6ms | 6.3ms | 5.0ms | ~same | 1.28× slower | **Compute-heavy benchmarks** (`base64DecodeBytes`, `base64_byte_array`): sjsonnet significantly outperforms jrsonnet — 1.50× and 2.02× faster respectively. **Small benchmarks** (`base64`, `base64Decode`, `base64_stress`): jrsonnet is faster due to lower startup overhead (~3ms vs ~5ms). The actual base64 computation time is comparable; the gap is dominated by process startup. ## Files Changed | File | Change | |------|--------| | `sjsonnet/src/sjsonnet/Val.scala` | `Arr` non-final, `RangeArr` + `ByteArr` subclasses, `_asciiSafe` flag, `asciiSafe` factory | | `sjsonnet/src/sjsonnet/Materializer.scala` | ByteArr pattern-match fast path in recursive + iterative paths | | `sjsonnet/src/sjsonnet/ByteRenderer.scala` | ByteArr fast path in fused materializer + ASCII-safe string dispatch | | `sjsonnet/src/sjsonnet/BaseByteRenderer.scala` | `renderAsciiSafeString()` for escape-free rendering | | `sjsonnet/src/sjsonnet/stdlib/EncodingModule.scala` | `fromBytes` for DecodeBytes, ByteArr match for encode, `asciiSafe` for output | | `sjsonnet/src-js/sjsonnet/stdlib/FastBase64.scala` | Pure Scala implementation (JS/WASM) | | `sjsonnet/src-jvm/sjsonnet/stdlib/FastBase64.scala` | Delegates to `java.util.Base64` (unchanged behavior) | | `sjsonnet/src-native/sjsonnet/stdlib/FastBase64.scala` | C FFI wrappers + buffer reuse + ASCII fast paths | | `sjsonnet/resources/scala-native/sjsonnet_base64.c` | SIMD C implementation (NEON/SSSE3/AVX2/AVX-512 + scalar fallback) | | `sjsonnet/test/resources/new_test_suite/byte_arr_correctness.jsonnet` | Regression tests for ByteArr (multi-use, reverse, concat, round-trip) | | `sjsonnet/test/resources/new_test_suite/range_arr_correctness.jsonnet` | Regression tests for RangeArr correctness | | `bench/resources/go_suite/base64_stress.jsonnet` | New benchmark for mixed encode/decode stress test | ## Result - base64DecodeBytes **2.13× faster** than master, **1.50× faster** than jrsonnet 0.5.0 - base64_byte_array **2.02× faster** than jrsonnet 0.5.0 - JVM base64DecodeBytes improved **10.2%** vs master - All JVM, JS, and Native tests pass

He-Pin marked this pull request as ready for review April 12, 2026 17:16

He-Pin commented Apr 12, 2026

View reviewed changes

Comment thread sjsonnet/src/sjsonnet/Val.scala

He-Pin commented Apr 12, 2026

View reviewed changes

Comment thread sjsonnet/src/sjsonnet/Val.scala

He-Pin commented Apr 12, 2026

View reviewed changes

Comment thread sjsonnet/src/sjsonnet/Val.scala

He-Pin mentioned this pull request Apr 12, 2026

performance optimization #666

Open

He-Pin marked this pull request as draft April 12, 2026 18:57

He-Pin marked this pull request as ready for review April 13, 2026 10:12

He-Pin marked this pull request as draft April 13, 2026 11:56

He-Pin mentioned this pull request Apr 13, 2026

perf: SIMD-accelerated FastBase64 for Scala Native via C FFI #749

Merged

He-Pin closed this Apr 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: extract RangeArr subclass from Arr to reduce memory footprint#772

refactor: extract RangeArr subclass from Arr to reduce memory footprint#772
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/range-arr-subclass

He-Pin commented Apr 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

He-Pin commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Key Design Decision

Modification

Benchmark Results

JMH (JVM, Scala 3.3.7)

Hyperfine (Scala Native vs jrsonnet)

Analysis

References

Result

JMH Benchmark Results (vs master 0d13274)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

He-Pin commented Apr 12, 2026 •

edited

Loading

JMH Benchmark Results (vs master `0d13274`)