Dart AOT performance issue #37455

renatoathaydes · 2019-07-06T10:09:54Z

Dart SDK Version (dart --version)

Dart VM version: 2.4.0 (Unknown timestamp) on "linux_x64"

Whether you are using Windows, MacOSX, or Linux (if applicable)

Linux renato 4.4.0-154-generic #181-Ubuntu SMP Tue Jun 25 05:29:03 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Sample code

The code I am using to benchmark different Dart compilation modes is a simple brainfuck implementation.

This is the bf program I am using for benchmarking.

When running with dart as a script, I get the following output:

➜  dart-bf git:(master) ✗ time dart bin/dart-bf.dart bf/example.bf
dart bin/dart-bf.dart bf/example.bf   3,21s  user 0,13s system 112% cpu 2,956 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                144 MB
page faults from disk:     0
other page faults:         39866

With dartaotruntime:

➜  dart-bf git:(master) ✗ dart2aot bin/dart-bf.dart dart-bf.aot
➜  dart-bf git:(master) ✗ time dartaotruntime dart-bf.aot bf/example.bf
dartaotruntime dart-bf.aot bf/example.bf   11,42s  user 0,00s system 99% cpu 11,421 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                15 MB
page faults from disk:     0
other page faults:         2680

With dart --snapshot-kind=kernel:

➜  dart-bf git:(master) ✗ dart --snapshot=main.snapshot --snapshot-kind=kernel bin/dart-bf.dart
➜  dart-bf git:(master) ✗ time dart main.snapshot bf/example.bf
dart main.snapshot bf/example.bf   2,47s  user 0,04s system 106% cpu 2,348 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                51 MB
page faults from disk:     0
other page faults:         10598

Summary:

Method	Time (secs)	Memory (MB)
dart script	3.21	144
dartaotruntime	11.42	15
dart --snapshot-kind=kernel	2.47	51

It seems that dartaotruntime has a performance that's just too far from the alternatives. It might be due to it not having a JIT, as the other options? But still, would be reasonable to make further optimizations to get a more predictable performance, I'd think.

BTW: the snapshot performance is amazing, pretty close to Go and Java!

The text was updated successfully, but these errors were encountered:

mraleph · 2019-07-08T08:02:50Z

Thank you for the detailed bug report and the reproduction.

It is known that depending on the work-load AOT might exhibit worse (sometimes much worse) performance characteristics than JIT.

I did a quick look and I think ultimately there are few reasons why AOT version is slower:

TFA fails to infer type of Loop.ops (should be inferred as _GrowableList). (/cc @alexmarkov is this because Program._parseOps is recursive?). As a result we don't really inline anything called on ops, which for example means that the loop below allocates closures and calls forEach dynamically - which makes it considerably more expensive than a normal loop. (JIT would actually inline forEach and eliminate closure allocation).

  @override
  void call(Tape tape) {
    while (tape.current > 0) {
      ops.forEach((op) => op(tape));
    }
  }

AOT does not do polymorphic inlining at op.call - meaning that this call becomes a major performance sink compared to JIT which does polymorphic inlining. Realistically I think this can only be addressed if we introduce faster dispatch mechanisms (e.g. virtual / interface dispatch tables) - because doing polymorphic inlining would go against AOT's goal to keep code size under control.

/cc @mkustermann

alexmarkov · 2019-07-08T21:31:39Z

I can confirm that type of Loop.ops is not inferred because Program._parseOps is recursive. In particular, while analyzing Program._parseOps, the result of recursive call into Program._parseOps is approximated using its static type and then it is passed to the constructor of Loop, where it is used to initialize Loop.ops. So the field gets an approximate type. The approximation in case of recursive calls is suboptimal, but it allows us to cut down compilation time.

However, in this particular case result type of Program._parseOps does not depend on the incoming parameters, so it might be possible to use it for recursive calls even without extra iterations of the analysis.

In general case, TFA approximates results of recursive calls using static types. However, if result type of a function does not depend on the flow inside its body, it cannot change and it can be used in case of recursive calls instead of a static type. This improves micro-benchmark from #37455: Before: 0m11.506s After: 0m7.324s Issue: #37455 Change-Id: I967d7add906c8dbd59dbbea1b993e1b4e1733514 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/108500 Commit-Queue: Alexander Markov <alexmarkov@google.com> Reviewed-by: Martin Kustermann <kustermann@google.com>

alexmarkov · 2019-07-10T19:29:39Z

318a482 fixed the first problem mentioned by @mraleph - actual type of Loop.ops is now inferred.
This improves benchmark on Dart AOT from 11.506s to 7.324s on my machine.

mraleph added the area-vm Use area-vm for VM related issues, including code coverage, FFI, and the AOT and JIT backends. label Jul 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dart AOT performance issue #37455

Dart AOT performance issue #37455

renatoathaydes commented Jul 6, 2019 •

edited

mraleph commented Jul 8, 2019

alexmarkov commented Jul 8, 2019

alexmarkov commented Jul 10, 2019

Dart AOT performance issue #37455

Dart AOT performance issue #37455

Comments

renatoathaydes commented Jul 6, 2019 • edited

Sample code

mraleph commented Jul 8, 2019

alexmarkov commented Jul 8, 2019

alexmarkov commented Jul 10, 2019

renatoathaydes commented Jul 6, 2019 •

edited