Skip to content

refactor: extract RangeArr subclass from Arr to reduce memory footprint#772

Closed
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/range-arr-subclass
Closed

refactor: extract RangeArr subclass from Arr to reduce memory footprint#772
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/range-arr-subclass

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 12, 2026

Motivation

Follow-up to #771 (lazy range arrays). The initial implementation added _isRange (Boolean) and _rangeFrom (Int) as inline fields directly on Val.Arr. While functionally correct, these fields are wasted memory for the vast majority of arrays that are not created via std.range.

On a 64-bit JVM with compressed oops, each Arr instance carries ~9 bytes of overhead (boolean + int + alignment padding) that only range arrays use.

Key Design Decision

Extract range-specific state into a dedicated RangeArr subclass instead of keeping it inline:

  • Arr: stays lean — only arr, _reversed, _concatLeft/Right, _length (the fields every array actually needs)
  • RangeArr extends Arr: adds rangeFrom: Int + size: Int; delegates to parent after materialization

This follows the same pattern as jrsonnet's specialized array variants (RangeArray, ReverseArray, etc.) where each representation is a distinct type.

Modification

Val.scala — Single file change:

  • Arr is no longer final; arr and _length widened to private[Val] for subclass access
  • isConcatView made final for Scala 2.x @inline compatibility
  • Removed _isRange, _rangeFrom, isRange, and materializeRange() from Arr
  • Removed range-specific branches from Arr.value(), eval(), asLazyArray(), reversed()
  • Added RangeArr(pos, rangeFrom, size) extends Arr(pos, null) with overrides for value, eval, asLazyArray, reversed
  • Arr.range() factory now returns new RangeArr(...) instead of mutating a plain Arr

Benchmark Results

JMH (JVM, Scala 3.3.7)

No regression on any benchmark. Key results:

Benchmark Before (ms/op) After (ms/op) Change
comparison 0.028 0.029 ≈ same
reverse 6.909 6.912 ≈ same
bench.06 2.117 2.097 ≈ same
realistic2 60.2 60.4 ≈ same

Hyperfine (Scala Native vs jrsonnet)

Benchmark sjsonnet native jrsonnet Ratio
comparison 1.2 ms 7.4 ms sjsonnet 6.3x faster
reverse 18.1 ms 24.4 ms sjsonnet 1.35x faster

Analysis

Pure refactoring — moves existing range logic into a subclass without algorithmic changes. The virtual dispatch overhead for RangeArr overrides is negligible (confirmed by JMH). The memory savings benefit every non-range Arr instance in the program.

References

Result

✅ All tests pass (./mill __.test — all platforms × all Scala versions)
✅ No JMH regression
✅ No native benchmark regression
✅ ~9 bytes saved per non-range Arr instance


JMH Benchmark Results (vs master 0d13274)

Benchmark Master (ms/op) This PR (ms/op) Change
regressed assertions 0.207 0.212 +2.4%
base64 0.156 0.154 -1.3%
improved base64Decode 0.123 0.116 -5.7%
base64DecodeBytes 5.899 5.790 -1.8%
improved base64_byte_array 0.803 0.768 -4.4%
bench.01 0.052 0.051 -1.9%
bench.02 35.401 35.096 -0.9%
regressed bench.03 9.583 9.954 +3.9%
improved bench.04 0.122 0.119 -2.5%
improved bench.06 0.224 0.217 -3.1%
improved bench.07 3.332 3.177 -4.7%
regressed bench.08 0.038 0.039 +2.6%
regressed bench.09 0.041 0.042 +2.4%
comparison 0.028 0.028 +0.0%
improved comparison2 18.681 17.886 -4.3%
improved escapeStringJson 0.032 0.031 -3.1%
improved foldl 0.077 0.075 -2.6%
regressed gen_big_object 0.918 0.953 +3.8%
improved large_string_join 0.555 0.512 -7.7%
regressed large_string_template 1.600 1.639 +2.4%
lstripChars 0.113 0.114 +0.9%
manifestJsonEx 0.052 0.051 -1.9%
manifestTomlEx 0.069 0.070 +1.4%
regressed manifestYamlDoc 0.055 0.058 +5.5%
improved member 0.656 0.637 -2.9%
parseInt 0.032 0.032 +0.0%
regressed realistic1 1.661 1.757 +5.8%
realistic2 57.541 56.998 -0.9%
regressed reverse 6.717 7.172 +6.8%
regressed rstripChars 0.119 0.123 +3.4%
improved setDiff 0.431 0.413 -4.2%
setInter 0.371 0.372 +0.3%
setUnion 0.604 0.605 +0.2%
stripChars 0.117 0.117 +0.0%
substr 0.057 0.057 +0.0%

Summary: 11 improvements, 10 regressions, 14 neutral
Platform: Apple Silicon, JMH single-shot avg

Move lazy range state (_isRange, _rangeFrom) from inline fields in Arr
to a dedicated RangeArr subclass. This saves ~9 bytes per non-range Arr
instance (boolean + int + alignment padding), benefiting the common case
where arrays are not created via std.range.

Key changes:
- Arr is no longer final; RangeArr extends Arr with range-specific fields
- arr and _length visibility widened to private[Val] for subclass access
- isConcatView made final for Scala 2.x @inline compatibility
- Range-specific branches removed from Arr.value/eval/asLazyArray/reversed
- RangeArr overrides value/eval/asLazyArray/reversed with range logic
- Arr.range() factory now returns RangeArr instances

Upstream: follow-up to databricks#771 (lazy range arrays)
@He-Pin He-Pin marked this pull request as ready for review April 12, 2026 17:16
Comment thread sjsonnet/src/sjsonnet/Val.scala
Comment thread sjsonnet/src/sjsonnet/Val.scala
Comment thread sjsonnet/src/sjsonnet/Val.scala
@He-Pin He-Pin marked this pull request as draft April 12, 2026 18:57
@He-Pin He-Pin marked this pull request as ready for review April 13, 2026 10:12
@He-Pin He-Pin marked this pull request as draft April 13, 2026 11:56
@He-Pin He-Pin closed this Apr 13, 2026
stephenamar-db pushed a commit that referenced this pull request Apr 13, 2026
## Motivation

On Scala Native, `java.util.Base64` is a pure-Scala implementation that
uses Wrapper objects, `@tailrec` recursive `iterate()`, and per-byte
pattern matching — significantly slower than HotSpot's intrinsic-backed
implementation.

Beyond the raw codec, `base64DecodeBytes` was creating `Array[Eval](N)`
and filling each slot with `Val.cachedNum` — N allocations for an N-byte
decode. The materializer then needed per-element type dispatch to render
these arrays. And `base64` encode output (guaranteed ASCII-safe) was
still being scanned for JSON escape characters. `Val.Arr` carried inline
`_isRange`/`_byteData` fields that bloated every regular array instance
(~13 bytes wasted per non-specialized array).

## Modification

### 1. Platform-agnostic `FastBase64` encoder/decoder
- `ENCODE_TABLE` (char[64]) and `DECODE_TABLE` (int[256]) pre-computed
lookup tables
- `encodeString()`: ASCII fast path does direct char→char encoding
without intermediate `byte[]`
- `decodeToString()` / `decodeToBytes()`: Direct string→bytes via lookup
table
- ISO-8859-1 compatibility: chars > 0xFF → 0x3F ('?') matching
`java.util.Base64` behavior

### 2. C FFI SIMD base64 for Scala Native (`sjsonnet_base64.c`)
- **AArch64 NEON**: `vld3`/`vst4` interleaved load/store + `vqtbl4q`
64-byte lookup for encode; `vbslq`/`vmovl_u8`/`vmovn_u16` for byte↔char
widening/narrowing
- **x86_64**: SSSE3/AVX2/AVX-512 VBMI paths via
`pshufb`/`vpshufb`/`vpermi2b`
- **Fallback**: Scalar with loop unrolling for other architectures
- `sjsonnet_base64_decode_validated()`: Single-pass validation + decode
with specific error codes
- RFC 4648 compliant with '=' padding

### 3. Native-specific optimizations
- Reusable module-level buffers (safe: Scala Native is single-threaded)
— eliminates per-call array allocations
- ASCII fast-path in `encodeString`: skip UTF-8 encoding for pure ASCII
strings
- Direct char array construction instead of charset lookup

### 4. `RangeArr` and `ByteArr` subclasses of `Val.Arr`
- `Val.Arr` changed from `final class` to non-final `class`, enabling
specialization
- **`RangeArr extends Arr`**: Lazy integer range — keeps `rangeFrom`
field out of regular arrays, saving ~9 bytes per non-range array (merges
#772)
- **`ByteArr extends Arr`**: Compact `Array[Byte]` backing store for
0–255 integer arrays
- `byteData` is an immutable `val` — never cleared after
materialization, guaranteeing `rawBytes` is always non-null for safe
multi-use
- `reversed()` materializes first to keep `value()`/`eval()` simple and
avoid reversed-index bugs
- `rawBytes` accessor enables zero-copy fast paths in `base64` encode
and materializer
- Callers use pattern match (`case ba: Val.ByteArr =>`) instead of
null-returning `rawBytes` on base class

### 5. Materializer fast-path for byte arrays
- Recursive, iterative, and fused ByteRenderer paths all detect
`ByteArr` via pattern match
- Skip `value(i)` lookup + type dispatch + `asDouble` conversion
- Directly emit `visitFloat64((bytes(i) & 0xff).toDouble)` in a tight
loop

### 6. ASCII-safe string rendering
- `Val.Str._asciiSafe` flag marks strings known to contain only
printable ASCII (no JSON escaping needed)
- `Val.Str.asciiSafe(pos, s)` factory for creating flagged strings
- `BaseByteRenderer.renderAsciiSafeString()` skips SWAR escape scanning
and UTF-8 encoding — writes bytes directly from chars
- `base64` encode output is marked as ASCII-safe since base64 alphabet
is `[A-Za-z0-9+/=]`

### 7. `EncodingModule` updates
- `base64DecodeBytes`: Uses `Val.Arr.fromBytes(pos, decoded)` — one
allocation instead of N
- `base64` encode: Pattern matches `ByteArr` for zero-copy bypass;
output marked `asciiSafe`

## Benchmark Results

### JMH (JVM, Scala 3.3.7, Apple Silicon M4 Max)

| Benchmark | Master (ms/op) | PR (ms/op) | Change |
|-----------|---------------|------------|--------|
| base64 | 0.153 | 0.145 | **-5.2%** |
| base64Decode | 0.117 | 0.115 | -1.7% |
| base64DecodeBytes | 5.692 | 5.109 | **-10.2%** |
| base64_byte_array | 0.757 | 0.758 | ~same |
| base64_stress | — | 0.188 | (new) |

### Scala Native (hyperfine -N, 30 runs, Apple Silicon M4 Max)

Compared against jrsonnet **0.5.0-pre98** (built from source, `cargo
build --release`).

| Benchmark | sjsonnet master | sjsonnet PR | jrsonnet 0.5.0 | PR vs
master | PR vs jrsonnet |

|-----------|----------------|-------------|----------------|--------------|----------------|
| base64 | 8.7ms | 6.5ms | 4.4ms | **1.34× faster** | 1.47× slower |
| base64Decode | 7.3ms | 6.8ms | 4.3ms | 1.07× faster | 1.60× slower |
| base64DecodeBytes | 28.7ms | 13.5ms | 20.1ms | **2.13× faster** |
**1.50× faster** |
| base64_byte_array | 10.5ms | 8.5ms | 17.3ms | **1.23× faster** |
**2.02× faster** |
| base64_stress | 6.6ms | 6.3ms | 5.0ms | ~same | 1.28× slower |

**Compute-heavy benchmarks** (`base64DecodeBytes`, `base64_byte_array`):
sjsonnet significantly outperforms jrsonnet — 1.50× and 2.02× faster
respectively.

**Small benchmarks** (`base64`, `base64Decode`, `base64_stress`):
jrsonnet is faster due to lower startup overhead (~3ms vs ~5ms). The
actual base64 computation time is comparable; the gap is dominated by
process startup.

## Files Changed

| File | Change |
|------|--------|
| `sjsonnet/src/sjsonnet/Val.scala` | `Arr` non-final, `RangeArr` +
`ByteArr` subclasses, `_asciiSafe` flag, `asciiSafe` factory |
| `sjsonnet/src/sjsonnet/Materializer.scala` | ByteArr pattern-match
fast path in recursive + iterative paths |
| `sjsonnet/src/sjsonnet/ByteRenderer.scala` | ByteArr fast path in
fused materializer + ASCII-safe string dispatch |
| `sjsonnet/src/sjsonnet/BaseByteRenderer.scala` |
`renderAsciiSafeString()` for escape-free rendering |
| `sjsonnet/src/sjsonnet/stdlib/EncodingModule.scala` | `fromBytes` for
DecodeBytes, ByteArr match for encode, `asciiSafe` for output |
| `sjsonnet/src-js/sjsonnet/stdlib/FastBase64.scala` | Pure Scala
implementation (JS/WASM) |
| `sjsonnet/src-jvm/sjsonnet/stdlib/FastBase64.scala` | Delegates to
`java.util.Base64` (unchanged behavior) |
| `sjsonnet/src-native/sjsonnet/stdlib/FastBase64.scala` | C FFI
wrappers + buffer reuse + ASCII fast paths |
| `sjsonnet/resources/scala-native/sjsonnet_base64.c` | SIMD C
implementation (NEON/SSSE3/AVX2/AVX-512 + scalar fallback) |
| `sjsonnet/test/resources/new_test_suite/byte_arr_correctness.jsonnet`
| Regression tests for ByteArr (multi-use, reverse, concat, round-trip)
|
| `sjsonnet/test/resources/new_test_suite/range_arr_correctness.jsonnet`
| Regression tests for RangeArr correctness |
| `bench/resources/go_suite/base64_stress.jsonnet` | New benchmark for
mixed encode/decode stress test |

## Result

- base64DecodeBytes **2.13× faster** than master, **1.50× faster** than
jrsonnet 0.5.0
- base64_byte_array **2.02× faster** than jrsonnet 0.5.0
- JVM base64DecodeBytes improved **10.2%** vs master
- All JVM, JS, and Native tests pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant