Sjsonnet performance improvements #117

szeiger · 2021-04-06T15:35:49Z

This is the collection of changes that @ahirreddy and I made to improve Sjsonnet performance.

I added an sbt build for building on Scala 2.13 JVM only, in order to simplify testing and benchmarking. This includes a JMH benchmark (which relies on Databricks-internal Jsonnet code for the numbers I'm quoting here; YMMV when running it with other test sources).
Acyclic is disabled in the Mill build because it caused a compiler crash
I left the full set of commits, we may or may not want to squash these when merging, but they might come in handy if any problems show up afterwards.
The changes touch most of the code, we left no stone unturned.

Performance baseline for me was Ahir's branch (which should already be 1/3 faster than the current release) running on Zulu 11:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  2592.522 ± 46.356  ms/op

Switching to OpenJDK 16:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  2113.235 ± 80.190  ms/op

And my various improvements on top of that:

after apply / func simplification:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1940.899 ± 23.091  ms/op

after pos cleanup:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1864.302 ± 16.638  ms/op

scopes with arrays:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1778.695 ± 32.388  ms/op

scope option unwrapping:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1695.916 ± 16.999  ms/op

array in arr:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1607.315 ± 21.366  ms/op

improve scope handling:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1572.612 ± 25.555  ms/op

remove filescope threading:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1540.017 ± 14.426  ms/op

simplify:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1529.828 ± 11.489  ms/op

Val$Obj optimization:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1510.550 ± 13.998  ms/op

Stricter Val$Func:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1450.644 ± 21.938  ms/op

Fast path for simple function calls:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1324.096 ± 12.077  ms/op

Cache visible keys:

[info] Benchmark           Mode  Cnt     Score   Error  Units
[info] MainBenchmark.main  avgt   20  1151.553 ± 9.346  ms/op

Simplify scope handling:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1036.262 ± 15.039  ms/op

Val literals:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt   20  996.629 ± 9.548  ms/op

Remove Parened:

[info] Benchmark           Mode  Cnt    Score    Error  Units
[info] MainBenchmark.main  avgt   20  985.715 ± 10.770  ms/op

Java LinkedHashMap:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt   20  977.546 ± 7.460  ms/op

Improved BinaryOp handling:

[info] Benchmark           Mode  Cnt    Score    Error  Units
[info] MainBenchmark.main  avgt   20  966.175 ± 12.388  ms/op

Remove more Options:

[info] Benchmark           Mode  Cnt    Score    Error  Units
[info] MainBenchmark.main  avgt   20  965.484 ± 15.138  ms/op

Improve Std:

[info] Benchmark           Mode  Cnt    Score    Error  Units
[info] MainBenchmark.main  avgt   20  809.635 ± 10.877  ms/op

Equality checks without Materializer:

[info] Benchmark           Mode  Cnt    Score    Error  Units
[info] MainBenchmark.main  avgt   20  789.823 ± 11.598  ms/op

Simplify member handling:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  785.889 ± 5.099  ms/op

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  778.887 ± 4.786  ms/op

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  768.794 ± 3.908  ms/op

Improved equality:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  763.693 ± 3.704  ms/op

Optimize super handling:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  754.915 ± 7.295  ms/op

Static objects

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  751.757 ± 6.697  ms/op

Hottest methods in the final state:

High-level overview of important changes:

In most cases anything that removes allocations and indirections is faster. Obvious candidates for removal are tuples, Options and boxed primitives. For example, an Array[(Int, Option[T])] has an extra Tuple2, an extra boxed Integer and potentially an extra Some per element. Replacing it by two separate arrays Array[Int] and Array[T] (using null instead of None) is always going to be faster. There's not even an argument to be made for better locality when using the tupled version because the tuple contains references. Overall locality is worse (in the new version at least the Ints are colocated). Many commits in this PR are concerned with removing more and more unnecessary objects.
Picking the fastest collection types between Scala and Java. We're now using a mishmash of Java and Scala types for performance reasons: java.util.BitSet and java.util.LinkedHashMap but scala.collection.mutable.HashMap. The choice is easier for sequence types: Just use plain arrays.
Use primitive array iteration over higher-order functions like map. The scalac optimizer could inline all of these calls but we can't allow inlining from the standard library because of binary compatibility requirements, so we have to do the rewriting manually.
Improved position handling. This falls under removing allocations. Previously positions were tracked in the parser as Int offsets in the source file. They got combined with file names at each evaluation to create a Position object. An AST node can be evaluated many times so it is better to create the Position objects directly in the parser. This is also more correct. An offset is always tied to a specific file.
Rewriting Lazy. The old approach relied on the usual Scala mechanisms:
```
class Lazy(f: => Val) { lazy val force: Val = f }
```
This has 3 potential performance issues: We have to allocate 2 objects every time (a subclass of Function0 for the thunk, plus the actual Lazy), the lazy val requires a separate Boolean flag to keep track of the state, and lazy val uses thread-safe double-check locking. We can cut the allocations in half by making Lazy a SAM type, and use our own state handling to encode a missing value as null (because we know that f can never return null) and omit the locking (because we don't need thread-safety).
Let dispatch work for you, not against you. Virtual calls should target class types, not interface types, wherever possible. Matching on types should always use effectively final class types. Big match expressions like in visitExpr and visitBinaryOp are very efficient. The Scala compiler will remove the tuples and you end up with a nice table dispatch on final class types. We are avoiding separate handling of special cases (like the lazy and and or operators) because a single table-based dispatch is faster.
Add fast paths. Function calls don't need to be validated if number of args matches number of params and there are no named params. The scope doesn't even have to be populated with potential default values because we already know they are not going to be used. This makes a large number of function calls vastly faster.
Allow the parser to produce literal Vals. There is no need to have the parser emit, for example, an Expr.Str(s), only to get it turned into a Val.Str(s) immediately by the evaluator. All literals are emitted directly as Vals and the evaluator can skip right over them.
Simplify the AST. For example, Expr.Parened(e) was used for an expression in parentheses. This information is not needed for evaluating the expression. The result is the same as evaluating e directly, so we don't even have to produce the node first.
Improve standard library abstractions. The Jsonnet standard library uses a set of builtin methods to avoid some boilerplate. These had a common builtin0 abstraction for arbitrary arities, which required wrapping values in arrays. We only need 3 different arities and avoiding this abstraction actually makes the code simpler (in addition to removing the unnecessary arrays). For the most common functions we do not use builtin anymore at all. A direct implementation with a bit more boilerplate is even faster.
Don't materialize to JSON only to check equality of two values. We can at least avoid the unnecessary allocations with a dedicated equality check method, and if the values are not equal, we can shortcut the evaluation.
Optimize common standard library functions. For example, String.split(Char) quotes the char into a regular expression which then gets compiled just so it can reuse the regular expression engine for splitting. This can be implemented much more efficiently.

We have to include the absolute path in addition to the source text in the cache key now that we're generating the FileSscope up front and sharing it across multiple compilations. Otherwise we could get wrong positions with wrong paths in identical files.

szeiger · 2021-04-06T20:05:04Z

I've ran some tests with bin/jsonnet_output in universe. I already added a fix for the parse cache, there are 1 or 2 small bugs remaining to get to an identical output.

szeiger · 2021-04-07T11:02:41Z

Added yet another bug fix. Now bin/jsonnet_output in universe produced no more differences with the new version.

lihaoyi-databricks

@szeiger I read through all the commits. I don't really have any code-level feedback, but I trust that the test suites (both in this repo, and in our work codebase using Sjsonnet) would be enough to catch any correctness issues (including on the Scala.js and Scala.Native backends), and I trust that the benchmarks are sufficient to ensure the performance improvement is as stated.

Some high-level feedback:

Could you write up a few-paragraph summary of the major things that happened in this PR as part of its description? e.g. "unboxed options", "convert collection combinators to while-loops", "re-use literal AST nodes", "replace bitsets", "avoid position allocations", "optimize std", and so on, along with some explanation of why each change was helpful. That would definitely help future reviewers to be able to see at a glance what your high-level approach was, v.s. having to re-construct your intentions from the line-diff like I did when reviewing this.
I noticed you turned on -opt in the SBT build. Should we turn it on in the Mill build? Or should we only publish from the SBT build?
If we're publishing artifacts compiled with -opt enabled, that will break scala binary compatibility right? Would that cause an issue given that we're still using 2.13.3 in our work code, but this repo is 2.13.4?
On a similar note, should we publish both -opt and non--opt artifacts? There are folks using Sjsonnet as a library rather than executable, and they may be on other Scala point versions different from what we use at Databricks.

szeiger · 2021-04-07T15:45:13Z

Oh, I thought Ahir already added the optimizer options to the Mill build. Maybe that's why it wasn't as fast in universe as I expected? I never benchmarked without optimizations. There's no need to publish two versions or to turn optimization off, but we should only inline from our own codebase. (At this point there aren't many standard library calls left that are worth inlining anyway.) This will not affect binary compatibility in any way.

szeiger · 2021-04-07T17:12:31Z

I added the optimizer settings to the Mill build. It doesn't make a big difference to performance.

lihaoyi-databricks · 2021-04-07T17:14:13Z

If it doesn't make much difference, let's leave it out. That would simplify things and avoid questions about binary compatibility

szeiger · 2021-04-07T17:49:44Z

It doesn't cause any binary compatibility issues. We're only inlining from sjsonnet, nothing else.

lihaoyi-databricks · 2021-04-08T00:30:06Z

Oh got it

ahirreddy and others added 30 commits March 25, 2021 11:32

jsonnet

d61fdfe

loop

6d31126

faster bitset

5238163

remove vecotr builder

b7298d5

remove builder

6831465

cut allocations

5f9072d

remove more loops

9b18abd

eliminate more allocations

c9e24f1

Fix BitSet size bug

e8ed484

Add sbt build

a33f10f

Add simple benchmark using JMH

9f9e128

Remove unnecessary abstractions/wrapping/boxing related to Func/Apply

7df7e75

Improve position handling

e95bc0c

Further Val optimization

8fbb8b1

Scopes with arrays

4919adc

Unwrap scope Options

ffa3d9c

Remove more Options

bc1a833

Lazy SAM

872514b

Use Array in Arr

29b29a7

Improve scope handling

f07fa51

Remove FileScope threading

72d475c

simplify

11e1a49

Disable acyclic because of binary incompatibility with 2.13.4

0a3b97f

Val$Obj optimization

398869a

Stricter Val$Func + Fast path for simple function calls

69f1f92

Cache visible keys

791c98e

Option-free bindings lookup

7238c24

Some simplifications

a397ba2

Simplify scope handling

535a153

Optimistic function parameter validation

a748001

szeiger added 17 commits March 28, 2021 18:58

Use Java LinkedHashMap

39ac310

Improved BinaryOp handling

2d8183a

Remove more Options

8c510b6

Improve Std

2f62e22

Equality checks without Materializer

c729f79

More std improvements

68d687f

Improve pos handling

c70635b

Simplify Obj.value

3a83606

Simplify member handling

f1928fb

Allow null bindings in locals

a292189

Remove Expr.Obj

e6b876b

Clean up visible keys handling

e6df7d9

Improved equality

8d12f17

Optimize super handling

2d2e8cb

wip: static objects

4979ac6

Clean up and make all versions compile again

a270482

Fix parse cache

8537f0e

We have to include the absolute path in addition to the source text in the cache key now that we're generating the FileSscope up front and sharing it across multiple compilations. Otherwise we could get wrong positions with wrong paths in identical files.

Fix std.join bug

7abb425

szeiger requested a review from lihaoyi-databricks April 7, 2021 11:25

lihaoyi-databricks approved these changes Apr 7, 2021

View reviewed changes

Enable optimizer in the Mill build

147d85c

szeiger merged commit 781a5b6 into databricks:master Apr 8, 2021

lihaoyi-databricks mentioned this pull request Apr 20, 2021

[WIP] Sjsonnet perf improvements #115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sjsonnet performance improvements #117

Sjsonnet performance improvements #117

szeiger commented Apr 6, 2021 •

edited

Loading

szeiger commented Apr 6, 2021

szeiger commented Apr 7, 2021

lihaoyi-databricks left a comment

szeiger commented Apr 7, 2021

szeiger commented Apr 7, 2021

lihaoyi-databricks commented Apr 7, 2021

szeiger commented Apr 7, 2021

lihaoyi-databricks commented Apr 8, 2021

Sjsonnet performance improvements #117

Sjsonnet performance improvements #117

Conversation

szeiger commented Apr 6, 2021 • edited Loading

szeiger commented Apr 6, 2021

szeiger commented Apr 7, 2021

lihaoyi-databricks left a comment

Choose a reason for hiding this comment

szeiger commented Apr 7, 2021

szeiger commented Apr 7, 2021

lihaoyi-databricks commented Apr 7, 2021

szeiger commented Apr 7, 2021

lihaoyi-databricks commented Apr 8, 2021

szeiger commented Apr 6, 2021 •

edited

Loading