Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sjsonnet performance improvements #117

Merged
merged 52 commits into from
Apr 8, 2021

Conversation

szeiger
Copy link
Collaborator

@szeiger szeiger commented Apr 6, 2021

This is the collection of changes that @ahirreddy and I made to improve Sjsonnet performance.

  • I added an sbt build for building on Scala 2.13 JVM only, in order to simplify testing and benchmarking. This includes a JMH benchmark (which relies on Databricks-internal Jsonnet code for the numbers I'm quoting here; YMMV when running it with other test sources).
  • Acyclic is disabled in the Mill build because it caused a compiler crash
  • I left the full set of commits, we may or may not want to squash these when merging, but they might come in handy if any problems show up afterwards.
  • The changes touch most of the code, we left no stone unturned.

Performance baseline for me was Ahir's branch (which should already be 1/3 faster than the current release) running on Zulu 11:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  2592.522 ± 46.356  ms/op

Switching to OpenJDK 16:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  2113.235 ± 80.190  ms/op

And my various improvements on top of that:

after apply / func simplification:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1940.899 ± 23.091  ms/op

after pos cleanup:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1864.302 ± 16.638  ms/op

scopes with arrays:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1778.695 ± 32.388  ms/op

scope option unwrapping:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1695.916 ± 16.999  ms/op

array in arr:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1607.315 ± 21.366  ms/op

improve scope handling:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1572.612 ± 25.555  ms/op

remove filescope threading:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1540.017 ± 14.426  ms/op

simplify:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1529.828 ± 11.489  ms/op

Val$Obj optimization:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1510.550 ± 13.998  ms/op

Stricter Val$Func:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1450.644 ± 21.938  ms/op

Fast path for simple function calls:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1324.096 ± 12.077  ms/op

Cache visible keys:

[info] Benchmark           Mode  Cnt     Score   Error  Units
[info] MainBenchmark.main  avgt   20  1151.553 ± 9.346  ms/op

Simplify scope handling:

[info] Benchmark           Mode  Cnt     Score    Error  Units
[info] MainBenchmark.main  avgt   20  1036.262 ± 15.039  ms/op

Val literals:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt   20  996.629 ± 9.548  ms/op

Remove Parened:

[info] Benchmark           Mode  Cnt    Score    Error  Units
[info] MainBenchmark.main  avgt   20  985.715 ± 10.770  ms/op

Java LinkedHashMap:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt   20  977.546 ± 7.460  ms/op

Improved BinaryOp handling:

[info] Benchmark           Mode  Cnt    Score    Error  Units
[info] MainBenchmark.main  avgt   20  966.175 ± 12.388  ms/op

Remove more Options:

[info] Benchmark           Mode  Cnt    Score    Error  Units
[info] MainBenchmark.main  avgt   20  965.484 ± 15.138  ms/op

Improve Std:

[info] Benchmark           Mode  Cnt    Score    Error  Units
[info] MainBenchmark.main  avgt   20  809.635 ± 10.877  ms/op

Equality checks without Materializer:

[info] Benchmark           Mode  Cnt    Score    Error  Units
[info] MainBenchmark.main  avgt   20  789.823 ± 11.598  ms/op

Simplify member handling:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  785.889 ± 5.099  ms/op

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  778.887 ± 4.786  ms/op

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  768.794 ± 3.908  ms/op

Improved equality:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  763.693 ± 3.704  ms/op

Optimize super handling:

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  754.915 ± 7.295  ms/op

Static objects

[info] Benchmark           Mode  Cnt    Score   Error  Units
[info] MainBenchmark.main  avgt  160  751.757 ± 6.697  ms/op

Hottest methods in the final state:
Screen Shot 2021-04-06 at 5 35 02 PM

High-level overview of important changes:

  • In most cases anything that removes allocations and indirections is faster. Obvious candidates for removal are tuples, Options and boxed primitives. For example, an Array[(Int, Option[T])] has an extra Tuple2, an extra boxed Integer and potentially an extra Some per element. Replacing it by two separate arrays Array[Int] and Array[T] (using null instead of None) is always going to be faster. There's not even an argument to be made for better locality when using the tupled version because the tuple contains references. Overall locality is worse (in the new version at least the Ints are colocated). Many commits in this PR are concerned with removing more and more unnecessary objects.
  • Picking the fastest collection types between Scala and Java. We're now using a mishmash of Java and Scala types for performance reasons: java.util.BitSet and java.util.LinkedHashMap but scala.collection.mutable.HashMap. The choice is easier for sequence types: Just use plain arrays.
  • Use primitive array iteration over higher-order functions like map. The scalac optimizer could inline all of these calls but we can't allow inlining from the standard library because of binary compatibility requirements, so we have to do the rewriting manually.
  • Improved position handling. This falls under removing allocations. Previously positions were tracked in the parser as Int offsets in the source file. They got combined with file names at each evaluation to create a Position object. An AST node can be evaluated many times so it is better to create the Position objects directly in the parser. This is also more correct. An offset is always tied to a specific file.
  • Rewriting Lazy. The old approach relied on the usual Scala mechanisms:
    class Lazy(f: => Val) { lazy val force: Val = f }
    
    This has 3 potential performance issues: We have to allocate 2 objects every time (a subclass of Function0 for the thunk, plus the actual Lazy), the lazy val requires a separate Boolean flag to keep track of the state, and lazy val uses thread-safe double-check locking. We can cut the allocations in half by making Lazy a SAM type, and use our own state handling to encode a missing value as null (because we know that f can never return null) and omit the locking (because we don't need thread-safety).
  • Let dispatch work for you, not against you. Virtual calls should target class types, not interface types, wherever possible. Matching on types should always use effectively final class types. Big match expressions like in visitExpr and visitBinaryOp are very efficient. The Scala compiler will remove the tuples and you end up with a nice table dispatch on final class types. We are avoiding separate handling of special cases (like the lazy and and or operators) because a single table-based dispatch is faster.
  • Add fast paths. Function calls don't need to be validated if number of args matches number of params and there are no named params. The scope doesn't even have to be populated with potential default values because we already know they are not going to be used. This makes a large number of function calls vastly faster.
  • Allow the parser to produce literal Vals. There is no need to have the parser emit, for example, an Expr.Str(s), only to get it turned into a Val.Str(s) immediately by the evaluator. All literals are emitted directly as Vals and the evaluator can skip right over them.
  • Simplify the AST. For example, Expr.Parened(e) was used for an expression in parentheses. This information is not needed for evaluating the expression. The result is the same as evaluating e directly, so we don't even have to produce the node first.
  • Improve standard library abstractions. The Jsonnet standard library uses a set of builtin methods to avoid some boilerplate. These had a common builtin0 abstraction for arbitrary arities, which required wrapping values in arrays. We only need 3 different arities and avoiding this abstraction actually makes the code simpler (in addition to removing the unnecessary arrays). For the most common functions we do not use builtin anymore at all. A direct implementation with a bit more boilerplate is even faster.
  • Don't materialize to JSON only to check equality of two values. We can at least avoid the unnecessary allocations with a dedicated equality check method, and if the values are not equal, we can shortcut the evaluation.
  • Optimize common standard library functions. For example, String.split(Char) quotes the char into a regular expression which then gets compiled just so it can reuse the regular expression engine for splitting. This can be implemented much more efficiently.

@szeiger
Copy link
Collaborator Author

szeiger commented Apr 6, 2021

I've ran some tests with bin/jsonnet_output in universe. I already added a fix for the parse cache, there are 1 or 2 small bugs remaining to get to an identical output.

@szeiger
Copy link
Collaborator Author

szeiger commented Apr 7, 2021

Added yet another bug fix. Now bin/jsonnet_output in universe produced no more differences with the new version.

Copy link
Contributor

@lihaoyi-databricks lihaoyi-databricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szeiger I read through all the commits. I don't really have any code-level feedback, but I trust that the test suites (both in this repo, and in our work codebase using Sjsonnet) would be enough to catch any correctness issues (including on the Scala.js and Scala.Native backends), and I trust that the benchmarks are sufficient to ensure the performance improvement is as stated.

Some high-level feedback:

  1. Could you write up a few-paragraph summary of the major things that happened in this PR as part of its description? e.g. "unboxed options", "convert collection combinators to while-loops", "re-use literal AST nodes", "replace bitsets", "avoid position allocations", "optimize std", and so on, along with some explanation of why each change was helpful. That would definitely help future reviewers to be able to see at a glance what your high-level approach was, v.s. having to re-construct your intentions from the line-diff like I did when reviewing this.

  2. I noticed you turned on -opt in the SBT build. Should we turn it on in the Mill build? Or should we only publish from the SBT build?

  3. If we're publishing artifacts compiled with -opt enabled, that will break scala binary compatibility right? Would that cause an issue given that we're still using 2.13.3 in our work code, but this repo is 2.13.4?

  4. On a similar note, should we publish both -opt and non--opt artifacts? There are folks using Sjsonnet as a library rather than executable, and they may be on other Scala point versions different from what we use at Databricks.

@szeiger
Copy link
Collaborator Author

szeiger commented Apr 7, 2021

Oh, I thought Ahir already added the optimizer options to the Mill build. Maybe that's why it wasn't as fast in universe as I expected? I never benchmarked without optimizations. There's no need to publish two versions or to turn optimization off, but we should only inline from our own codebase. (At this point there aren't many standard library calls left that are worth inlining anyway.) This will not affect binary compatibility in any way.

@szeiger
Copy link
Collaborator Author

szeiger commented Apr 7, 2021

I added the optimizer settings to the Mill build. It doesn't make a big difference to performance.

@lihaoyi-databricks
Copy link
Contributor

If it doesn't make much difference, let's leave it out. That would simplify things and avoid questions about binary compatibility

@szeiger
Copy link
Collaborator Author

szeiger commented Apr 7, 2021

It doesn't cause any binary compatibility issues. We're only inlining from sjsonnet, nothing else.

@lihaoyi-databricks
Copy link
Contributor

Oh got it

@szeiger szeiger merged commit 781a5b6 into databricks:master Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants