refactor: extract RangeArr subclass from Arr to reduce memory footprint#772
Closed
He-Pin wants to merge 1 commit intodatabricks:masterfrom
Closed
refactor: extract RangeArr subclass from Arr to reduce memory footprint#772He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin wants to merge 1 commit intodatabricks:masterfrom
Conversation
Move lazy range state (_isRange, _rangeFrom) from inline fields in Arr to a dedicated RangeArr subclass. This saves ~9 bytes per non-range Arr instance (boolean + int + alignment padding), benefiting the common case where arrays are not created via std.range. Key changes: - Arr is no longer final; RangeArr extends Arr with range-specific fields - arr and _length visibility widened to private[Val] for subclass access - isConcatView made final for Scala 2.x @inline compatibility - Range-specific branches removed from Arr.value/eval/asLazyArray/reversed - RangeArr overrides value/eval/asLazyArray/reversed with range logic - Arr.range() factory now returns RangeArr instances Upstream: follow-up to databricks#771 (lazy range arrays)
He-Pin
commented
Apr 12, 2026
He-Pin
commented
Apr 12, 2026
He-Pin
commented
Apr 12, 2026
stephenamar-db
pushed a commit
that referenced
this pull request
Apr 13, 2026
## Motivation
On Scala Native, `java.util.Base64` is a pure-Scala implementation that
uses Wrapper objects, `@tailrec` recursive `iterate()`, and per-byte
pattern matching — significantly slower than HotSpot's intrinsic-backed
implementation.
Beyond the raw codec, `base64DecodeBytes` was creating `Array[Eval](N)`
and filling each slot with `Val.cachedNum` — N allocations for an N-byte
decode. The materializer then needed per-element type dispatch to render
these arrays. And `base64` encode output (guaranteed ASCII-safe) was
still being scanned for JSON escape characters. `Val.Arr` carried inline
`_isRange`/`_byteData` fields that bloated every regular array instance
(~13 bytes wasted per non-specialized array).
## Modification
### 1. Platform-agnostic `FastBase64` encoder/decoder
- `ENCODE_TABLE` (char[64]) and `DECODE_TABLE` (int[256]) pre-computed
lookup tables
- `encodeString()`: ASCII fast path does direct char→char encoding
without intermediate `byte[]`
- `decodeToString()` / `decodeToBytes()`: Direct string→bytes via lookup
table
- ISO-8859-1 compatibility: chars > 0xFF → 0x3F ('?') matching
`java.util.Base64` behavior
### 2. C FFI SIMD base64 for Scala Native (`sjsonnet_base64.c`)
- **AArch64 NEON**: `vld3`/`vst4` interleaved load/store + `vqtbl4q`
64-byte lookup for encode; `vbslq`/`vmovl_u8`/`vmovn_u16` for byte↔char
widening/narrowing
- **x86_64**: SSSE3/AVX2/AVX-512 VBMI paths via
`pshufb`/`vpshufb`/`vpermi2b`
- **Fallback**: Scalar with loop unrolling for other architectures
- `sjsonnet_base64_decode_validated()`: Single-pass validation + decode
with specific error codes
- RFC 4648 compliant with '=' padding
### 3. Native-specific optimizations
- Reusable module-level buffers (safe: Scala Native is single-threaded)
— eliminates per-call array allocations
- ASCII fast-path in `encodeString`: skip UTF-8 encoding for pure ASCII
strings
- Direct char array construction instead of charset lookup
### 4. `RangeArr` and `ByteArr` subclasses of `Val.Arr`
- `Val.Arr` changed from `final class` to non-final `class`, enabling
specialization
- **`RangeArr extends Arr`**: Lazy integer range — keeps `rangeFrom`
field out of regular arrays, saving ~9 bytes per non-range array (merges
#772)
- **`ByteArr extends Arr`**: Compact `Array[Byte]` backing store for
0–255 integer arrays
- `byteData` is an immutable `val` — never cleared after
materialization, guaranteeing `rawBytes` is always non-null for safe
multi-use
- `reversed()` materializes first to keep `value()`/`eval()` simple and
avoid reversed-index bugs
- `rawBytes` accessor enables zero-copy fast paths in `base64` encode
and materializer
- Callers use pattern match (`case ba: Val.ByteArr =>`) instead of
null-returning `rawBytes` on base class
### 5. Materializer fast-path for byte arrays
- Recursive, iterative, and fused ByteRenderer paths all detect
`ByteArr` via pattern match
- Skip `value(i)` lookup + type dispatch + `asDouble` conversion
- Directly emit `visitFloat64((bytes(i) & 0xff).toDouble)` in a tight
loop
### 6. ASCII-safe string rendering
- `Val.Str._asciiSafe` flag marks strings known to contain only
printable ASCII (no JSON escaping needed)
- `Val.Str.asciiSafe(pos, s)` factory for creating flagged strings
- `BaseByteRenderer.renderAsciiSafeString()` skips SWAR escape scanning
and UTF-8 encoding — writes bytes directly from chars
- `base64` encode output is marked as ASCII-safe since base64 alphabet
is `[A-Za-z0-9+/=]`
### 7. `EncodingModule` updates
- `base64DecodeBytes`: Uses `Val.Arr.fromBytes(pos, decoded)` — one
allocation instead of N
- `base64` encode: Pattern matches `ByteArr` for zero-copy bypass;
output marked `asciiSafe`
## Benchmark Results
### JMH (JVM, Scala 3.3.7, Apple Silicon M4 Max)
| Benchmark | Master (ms/op) | PR (ms/op) | Change |
|-----------|---------------|------------|--------|
| base64 | 0.153 | 0.145 | **-5.2%** |
| base64Decode | 0.117 | 0.115 | -1.7% |
| base64DecodeBytes | 5.692 | 5.109 | **-10.2%** |
| base64_byte_array | 0.757 | 0.758 | ~same |
| base64_stress | — | 0.188 | (new) |
### Scala Native (hyperfine -N, 30 runs, Apple Silicon M4 Max)
Compared against jrsonnet **0.5.0-pre98** (built from source, `cargo
build --release`).
| Benchmark | sjsonnet master | sjsonnet PR | jrsonnet 0.5.0 | PR vs
master | PR vs jrsonnet |
|-----------|----------------|-------------|----------------|--------------|----------------|
| base64 | 8.7ms | 6.5ms | 4.4ms | **1.34× faster** | 1.47× slower |
| base64Decode | 7.3ms | 6.8ms | 4.3ms | 1.07× faster | 1.60× slower |
| base64DecodeBytes | 28.7ms | 13.5ms | 20.1ms | **2.13× faster** |
**1.50× faster** |
| base64_byte_array | 10.5ms | 8.5ms | 17.3ms | **1.23× faster** |
**2.02× faster** |
| base64_stress | 6.6ms | 6.3ms | 5.0ms | ~same | 1.28× slower |
**Compute-heavy benchmarks** (`base64DecodeBytes`, `base64_byte_array`):
sjsonnet significantly outperforms jrsonnet — 1.50× and 2.02× faster
respectively.
**Small benchmarks** (`base64`, `base64Decode`, `base64_stress`):
jrsonnet is faster due to lower startup overhead (~3ms vs ~5ms). The
actual base64 computation time is comparable; the gap is dominated by
process startup.
## Files Changed
| File | Change |
|------|--------|
| `sjsonnet/src/sjsonnet/Val.scala` | `Arr` non-final, `RangeArr` +
`ByteArr` subclasses, `_asciiSafe` flag, `asciiSafe` factory |
| `sjsonnet/src/sjsonnet/Materializer.scala` | ByteArr pattern-match
fast path in recursive + iterative paths |
| `sjsonnet/src/sjsonnet/ByteRenderer.scala` | ByteArr fast path in
fused materializer + ASCII-safe string dispatch |
| `sjsonnet/src/sjsonnet/BaseByteRenderer.scala` |
`renderAsciiSafeString()` for escape-free rendering |
| `sjsonnet/src/sjsonnet/stdlib/EncodingModule.scala` | `fromBytes` for
DecodeBytes, ByteArr match for encode, `asciiSafe` for output |
| `sjsonnet/src-js/sjsonnet/stdlib/FastBase64.scala` | Pure Scala
implementation (JS/WASM) |
| `sjsonnet/src-jvm/sjsonnet/stdlib/FastBase64.scala` | Delegates to
`java.util.Base64` (unchanged behavior) |
| `sjsonnet/src-native/sjsonnet/stdlib/FastBase64.scala` | C FFI
wrappers + buffer reuse + ASCII fast paths |
| `sjsonnet/resources/scala-native/sjsonnet_base64.c` | SIMD C
implementation (NEON/SSSE3/AVX2/AVX-512 + scalar fallback) |
| `sjsonnet/test/resources/new_test_suite/byte_arr_correctness.jsonnet`
| Regression tests for ByteArr (multi-use, reverse, concat, round-trip)
|
| `sjsonnet/test/resources/new_test_suite/range_arr_correctness.jsonnet`
| Regression tests for RangeArr correctness |
| `bench/resources/go_suite/base64_stress.jsonnet` | New benchmark for
mixed encode/decode stress test |
## Result
- base64DecodeBytes **2.13× faster** than master, **1.50× faster** than
jrsonnet 0.5.0
- base64_byte_array **2.02× faster** than jrsonnet 0.5.0
- JVM base64DecodeBytes improved **10.2%** vs master
- All JVM, JS, and Native tests pass
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Follow-up to #771 (lazy range arrays). The initial implementation added
_isRange(Boolean) and_rangeFrom(Int) as inline fields directly onVal.Arr. While functionally correct, these fields are wasted memory for the vast majority of arrays that are not created viastd.range.On a 64-bit JVM with compressed oops, each
Arrinstance carries ~9 bytes of overhead (boolean + int + alignment padding) that only range arrays use.Key Design Decision
Extract range-specific state into a dedicated
RangeArrsubclass instead of keeping it inline:Arr: stays lean — onlyarr,_reversed,_concatLeft/Right,_length(the fields every array actually needs)RangeArr extends Arr: addsrangeFrom: Int+size: Int; delegates to parent after materializationThis follows the same pattern as jrsonnet's specialized array variants (RangeArray, ReverseArray, etc.) where each representation is a distinct type.
Modification
Val.scala— Single file change:Arris no longerfinal;arrand_lengthwidened toprivate[Val]for subclass accessisConcatViewmadefinalfor Scala 2.x@inlinecompatibility_isRange,_rangeFrom,isRange, andmaterializeRange()fromArrArr.value(),eval(),asLazyArray(),reversed()RangeArr(pos, rangeFrom, size) extends Arr(pos, null)with overrides forvalue,eval,asLazyArray,reversedArr.range()factory now returnsnew RangeArr(...)instead of mutating a plainArrBenchmark Results
JMH (JVM, Scala 3.3.7)
No regression on any benchmark. Key results:
Hyperfine (Scala Native vs jrsonnet)
Analysis
Pure refactoring — moves existing range logic into a subclass without algorithmic changes. The virtual dispatch overhead for
RangeArroverrides is negligible (confirmed by JMH). The memory savings benefit every non-rangeArrinstance in the program.References
crates/jrsonnet-evaluator/src/arr/spec.rsResult
✅ All tests pass (
./mill __.test— all platforms × all Scala versions)✅ No JMH regression
✅ No native benchmark regression
✅ ~9 bytes saved per non-range Arr instance
JMH Benchmark Results (vs master 0d13274)
Summary: 11 improvements, 10 regressions, 14 neutral
Platform: Apple Silicon, JMH single-shot avg