Skip to content

perf: optimize sort allocation paths in std.sort and set operations#752

Merged
stephenamar-db merged 1 commit intodatabricks:masterfrom
He-Pin:perf/primitive-double-sort
Apr 11, 2026
Merged

perf: optimize sort allocation paths in std.sort and set operations#752
stephenamar-db merged 1 commit intodatabricks:masterfrom
He-Pin:perf/primitive-double-sort

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 11, 2026

Motivation

std.sort and set operations (std.setDiff, std.setInter, std.setUnion) internally sort arrays but use allocation-heavy patterns: .map() for forcing lazy values, .map().sortBy() creating intermediate copies, and allocating a new Array(1) per key function call.

Key Design Decisions

  • Keep in-place Comparator sort for numerics rather than primitive double sort + reconstruction. While primitive Arrays.sort(double[]) is faster on JVM, the Val.cachedNum reconstruction step creates GC pressure on Scala Native (measured 1.26x regression). In-place Comparator sort avoids any reconstruction.
  • Reuse argument buffer for key function calls: single Array[Val](1) shared across all iterations instead of Array(v.value) per element.
  • While-loops over .map(): eliminates closure allocation, iterator overhead, and intermediate array copies.

Modifications

sjsonnet/src/sjsonnet/stdlib/SetModule.scala:

  1. Key function path: Reuse single-element argBuf across all key function calls, use while-loop for key computation
  2. Result construction: Pre-allocated Array[Eval] with while-loop instead of sortedIndices.map(i => vs(i))
  3. Strict force: while loop with pre-allocated Array[Val] instead of vs.map(_.value)
  4. String sort (no key): In-place Arrays.sort with Comparator instead of .map(_.cast[Val.Str]).sortBy(_.asString) (2 intermediate array copies)
  5. Array sort (no key): In-place Arrays.sort with Comparator instead of .map(_.cast[Val.Arr]).sortBy(identity) (2 intermediate copies)

Benchmark Results

JMH (JVM, single iteration)

Benchmark Before (ms/op) After (ms/op) Change
bench.06 (sort) 0.359 0.251 -30.1%
setDiff 0.533 0.446 -16.3%
setInter 0.367 0.386 neutral
setUnion 0.727 0.677 -6.9%

Scala Native hyperfine (-N --warmup 10)

Benchmark Before (ms) After (ms) Speedup
bench.06 (sort, 30 runs) 7.6 ± 0.4 5.5 ± 0.2 1.39x
setDiff (20 runs) 8.8 ± 0.6 7.7 ± 0.6 1.13x
setInter (20 runs) 8.6 ± 1.3 8.3 ± 0.8 neutral

Analysis

The allocation reduction benefits both JVM and Native, but the impact is more pronounced on Native where GC overhead is higher. The sort benchmark sees the largest improvement because it exercises all the optimized paths (force + sort + result construction). Set operations see moderate improvement since the merge-based intersection/difference only calls sort once on already-sorted inputs.

References

  • Upstream exploration: he-pin/sjsonnet jit branch b1f64df0

Result

Sort and set operations are faster on both JVM and Scala Native with zero semantic changes. All existing tests pass.

@He-Pin He-Pin force-pushed the perf/primitive-double-sort branch from b6906f6 to 42d465a Compare April 11, 2026 20:14
Reduce allocation overhead in sort and set operations:

1. Key function path: reuse a single-element argument buffer across all
   key function calls, avoiding Array(1) allocation per element.

2. Result construction: use while-loop with pre-allocated array instead
   of sortedIndices.map() to avoid closure + iterator allocation.

3. Strict force: use while-loop instead of vs.map(_.value) to avoid
   closure and intermediate array allocation.

4. String sort (no key): in-place Comparator sort via Arrays.sort instead
   of .map(_.cast[Val.Str]).sortBy(_.asString) which creates two
   intermediate array copies.

5. Array sort (no key): in-place Comparator sort via Arrays.sort instead
   of .map(_.cast[Val.Arr]).sortBy(identity) with intermediate copies.

JMH (bench.06 sort, 1 iteration):
  Before: 0.359 ms/op
  After:  0.251 ms/op  (-30.1%)

Scala Native hyperfine (bench.06, --warmup 10, -N, 30 runs):
  Before: 7.6 ± 0.4 ms
  After:  5.5 ± 0.2 ms  (1.39x faster)

Set operations (native, 20 runs):
  setDiff:  8.8 → 7.7 ms (1.13x faster)
  setInter: 8.6 → 8.3 ms (neutral)

Upstream: he-pin/sjsonnet jit branch b1f64df
@He-Pin He-Pin force-pushed the perf/primitive-double-sort branch from 42d465a to aa7ee14 Compare April 11, 2026 20:53
@He-Pin He-Pin marked this pull request as ready for review April 11, 2026 21:47
@stephenamar-db stephenamar-db merged commit a17ec44 into databricks:master Apr 11, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants