VM: const array indexing modulo #37789

jtmcdole · 2019-08-08T16:53:37Z

My code for number of trailing zeros should be pretty fast: https://gist.github.com/jtmcdole/297434f327077dbfe5fb19da3b4ef5be

But a const array of size modulo still has a range check when indexed by said modulo. Also, modulo is apparently slow on all arm64 and arm32 dartvm snapshots?

https://github.com/dart-lang/sdk/blob/master/runtime/vm/cpu_arm.cc#L227

a-siva · 2019-08-08T21:23:25Z

/cc @aartbik

aartbik · 2019-08-13T18:01:25Z

Let's start with the bounds check. For the constant arrays, we see the array lengths as constants, so it is indeed a low hanging fruit to eliminate the check altogether given the modulo operator. Coming up!

aartbik · 2019-08-13T23:48:12Z

I completed a CL that improves range analysis for modulo operators. This implies the bounds check is now eliminated. Performance of a micro-benchmark (on an Intel Xeon 2.60GHz desktop for the time being) before this improvement:

JIT
DartNTZ.NTZ32(RunTime): 439.99929036515624 us.
DartNTZ.NTZ64(RunTime): 373.55112268907567 us.

AOT
DartNTZ.NTZ32(RunTime): 562.560742407199 us.
DartNTZ.NTZ64(RunTime): 560.3938464555898 us.

And after the range analysis improvements:

JIT
DartNTZ.NTZ32(RunTime): 377.65250944108766 us.
DartNTZ.NTZ64(RunTime): 357.13150723085164 us.

AOT
DartNTZ.NTZ32(RunTime): 518.6002999740732 us.
DartNTZ.NTZ64(RunTime): 536.2601919571046 us.

Rationale: Improves range analysis over MOD with the goal of removing more bounds checks in a[i % length] array accesses. #37789 Change-Id: I1fb275d53a00400e2343628295fa010ac9b21786 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/112933 Commit-Queue: Aart Bik <ajcbik@google.com> Reviewed-by: Martin Kustermann <kustermann@google.com> Reviewed-by: Vyacheslav Egorov <vegorov@google.com>

aartbik · 2019-08-22T22:22:04Z

The performance of modulo (or division for that matter) on arm is clearly a much "higher" hanging fruit. There is probably not much we can do about the assumptions what processor on what device supports integer division across all potential target architectures. However, perhaps we can do slightly better for the constant x % c operator for various c (currently, and to a certain extent, we do that for powers of two only). Investigating.....

ds84182 · 2019-08-23T16:29:32Z

I imagine you could emit a call directly to the ARM builtins (__aeabi_idiv and friends). It would probably be faster than transitioning in and out of the VM.

aartbik · 2019-08-23T21:58:08Z

Note, to clarify, the cpu_arm code linked to above determines what to generate when compiling arm32, but running on arm32 or arm64. When compiling for arm64, cpu_arm64 is used, and you should unconditionally see a division, e.g. for ntz64:

0x7f81247788f0 d2801062 movz r2, 0x83 // 131
0x7f81247788f4 9ac20c85 sdiv r5, r4, r2

aartbik · 2019-08-26T18:11:45Z

Intel results on our performance cluster are in (from bounds check elimination only):

  | Benchmark     |Improvement|Mode      |Processor      |OS          |Noise Level
  | DartNTZ.NTZ32 | +3.88 %   | dart     | Intel Core i5 | linux-ia32 | 1.09 x noise
  | DartNTZ.NTZ32 | +5.57 %   | dart     | Intel Core i5 | linux-x64  | 1.07 x noise
  | DartNTZ.NTZ32 | +13.7 %   | dart-aot | Intel Core i5 | linux-x64  | 3.01 x noise
  | DartNTZ.NTZ32 | +12.5 %   | dart-aot | Intel Xeon    | linux-x64  | 1.99 x noise
  | DartNTZ.NTZ64 | +5.29 %   | dart     | Intel Core i5 | linux-x64  | 2.42 x noise
  | DartNTZ.NTZ64 | +10.5 %   | dart-aot | Intel Core i5 | linux-x64  | 3.16 x noise
  | DartNTZ.NTZ64 | +12.8 %   | dart-aot | Intel Xeon    | linux-x64  | 1.41 x noise

jtmcdole · 2019-08-26T18:18:50Z

How do I read those results? Does that mean they are faster?

aartbik · 2019-08-26T18:28:53Z

@jtmcdole I added a header. The second column shows improvement, the noise level indicates how reliable the improvement is compared to noise that the benchmarks has between runs (> 1 is confident, and higher is more confident).

Rationale: Needed for an upcoming improvement of DIV/MOD by constant. #37789 Change-Id: I3966049a3f14fd3becec797c4bc0038f2852bc1b Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/114680 Commit-Queue: Aart Bik <ajcbik@google.com> Reviewed-by: Alexander Markov <alexmarkov@google.com>

Rationale: Mostly an AOT 64-bit improvement of DIV and MOV by constant on X64_64. Also a few minor supporting optimizations that were missed. With unit tests. #37789 Change-Id: I2f12d6a24bbe983324521d7d4d778f20b34d18a1 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/114901 Commit-Queue: Aart Bik <ajcbik@google.com> Reviewed-by: Ryan Macnak <rmacnak@google.com>

aartbik · 2019-08-29T21:10:28Z

With the latest DIV/MOD by magic improvement (x86_64 only for now), we get a nice boost:

AOT
DartNTZ.NTZ32(RunTime): 385.1173925683481 us.
DartNTZ.NTZ64(RunTime): 384.06057430875575 us.

However, I overlooked something completely so far. The omission of the type in

 ntz32(int x) => _ntzLut32[(x & -x) % 37];
 ntz64(int x) => _ntzLut64[(x & -x) % 131];

has a big impact on AOT performance (the static type of any use remains dynamic, as can been seen by inspecting the intermediate representation and the generated code). If I use

 int ntz32(int x) => _ntzLut32[(x & -x) % 37];
 int ntz64(int x) => _ntzLut64[(x & -x) % 131];

I get a much higher boost, with AOT now outperforming JIT by far:

JIT
DartNTZ.NTZ32(RunTime): 374.6525572204532 us.
DartNTZ.NTZ64(RunTime): 355.48469912919853 us.

AOT
DartNTZ.NTZ32(RunTime): 187.82733552446237 us.
DartNTZ.NTZ64(RunTime): 188.34485271682834 us.

So one obvious step for our customer would be to simply add this 'int'.

Talking with @alexmarkov and @rmacnak-google we debated whether automatically inferring the int type on the const array and using that to infer the type of the methods would be worthwhile.....

@mraleph @mkustermann

aartbik · 2019-08-29T21:28:56Z

IR of loop-body without type:

B3[target]:24
    PushArgument(v7)
    v35 <- UnaryInt64Op(unary-, v9 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
    v37 <- BinaryInt64Op(& [tr], v9 T{int}, v35) [-9223372036854775808, 9223372036854775807] T{int}
    v40 <- BinaryInt64Op(% [tr], v37, v73) [0, 36] T{int}
    v67 <- BoxInt64(v40) [0, 36] T{_Smi}
  0:     v48 <- LoadIndexed(v50, v67)
    AssertAssignable:34(v48 T{*?}, num, '', instantiator_type_args(v0), function_type_args(v0)) T{num?}
    PushArgument(v48 T{num?})
    v17 <- StaticCall:38( +<0> v7, v48 T{num?}, using unchecked entrypoint, recognized_kind = Integer_add, result_type = T{!null}) T{num}
    AssertAssignable:40(v17, int, '', instantiator_type_args(v0), function_type_args(v0)) T{int}
    v20 <- BinaryInt64Op(+ [tr], v9 T{int}, v75) [-9223372036854775808, 9223372036854775807] T{int}
    goto:46 B5

IR of loop-body with type (note that automatic type inference by looking at the array contents could even remove the null-check over this version!):

B3[target]:24
    v37 <- UnaryInt64Op(unary-, v9 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
    v39 <- BinaryInt64Op(& [tr], v9 T{int}, v37) [-9223372036854775808, 9223372036854775807] T{int}
    v42 <- BinaryInt64Op(% [tr], v39, v81) [0, 36] T{int}
    v71 <- BoxInt64(v42) [0, 36] T{_Smi}
  0:     v50 <- LoadIndexed(v52, v71)
    CheckNull:36(v50 T{int?}) [-9223372036854775808, 9223372036854775807] T{int}
    v73 <- UnboxInt64(v50 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
    v17 <- BinaryInt64Op(+ [tr], v7 T{int}, v73 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
    v20 <- BinaryInt64Op(+ [tr], v9 T{int}, v83) [-9223372036854775808, 9223372036854775807] T{int}
    goto:42 B5

alexmarkov · 2019-08-30T00:07:40Z

Created CL to infer type for indexing into a constant list.

Results on my system (AOT, linux/x64):

Before:
DartNTZ.NTZ32(RunTime): 351.9357010381841 us.
DartNTZ.NTZ64(RunTime): 351.99127771911304 us.

After:
DartNTZ.NTZ32(RunTime): 150.19256867162275 us.
DartNTZ.NTZ64(RunTime): 149.48622578475337 us.

jtmcdole · 2019-08-30T00:52:48Z

That's pretty awesome - and man, I would have assumed the inference would have picked up. note to codefu: add types on the LHS.

aartbik · 2019-08-30T16:33:42Z

Results of x86_64 change. I will also include the type inference results when they become available, which will be across architectures. After that we need to revisit using "the magic" for ARM64.

  | Benchmark     |Improvement|Mode      |Processor      |OS         |Noise Level
  | DartNTZ.NTZ32 | +11.9 %   | dart-aot | Intel Core i5 | linux-x64 | 2.97 x noise
  | DartNTZ.NTZ32 | +18.6 %   | dart-aot | Intel Xeon    | linux-x64 | 1.73 x noise
  | DartNTZ.NTZ64 | +11.6 %   | dart-aot | Intel Core i5 | linux-x64 | 3.87 x noise
  | DartNTZ.NTZ64 | +23.9 %   | dart-aot | Intel Xeon    | linux-x64 | 2.88 x noise

Before: DartNTZ.NTZ32(RunTime): 351.9357010381841 us. DartNTZ.NTZ64(RunTime): 351.99127771911304 us. After: DartNTZ.NTZ32(RunTime): 150.19256867162275 us. DartNTZ.NTZ64(RunTime): 149.48622578475337 us. Issue: #37789 Change-Id: Ie8b16b2602a15140df9f0f21199ee9eb2bbf158c Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/115012 Reviewed-by: Aart Bik <ajcbik@google.com> Reviewed-by: Martin Kustermann <kustermann@google.com> Reviewed-by: Samir Jindel <sjindel@google.com> Commit-Queue: Alexander Markov <alexmarkov@google.com>

aartbik · 2019-08-30T18:07:43Z

Type inference preliminary improvements \o/

  | Benchmark     |Improvement|Mode      |Processor      |OS 
  | DartNTZ.NTZ32 | +118 %    | dart-aot | Intel Core i5 | linux-x64
  | DartNTZ.NTZ32 | +120 %    | dart-aot | Intel Xeon    | linux-x64
  | DartNTZ.NTZ64 | +119 %    | dart-aot | Intel Core i5 | linux-x64
  | DartNTZ.NTZ64 | +119 %    | dart-aot | Intel Xeon    | linux-x64
  | DartNTZ.NTZ32 |  +24 %    | dart-aot | Odroid-C2     | linux-armv7
  | DartNTZ.NTZ64 |  +27 %    | dart-aot | Odroid-C2     | linux-armv7
  | DartNTZ.NTZ32 |  +89 %    | dart-aot | Odroid-C2     | linux-armv8
  | DartNTZ.NTZ64 | +102 %    | dart-aot | Odroid-C2     | linux-armv8

Much better loop-body, no null-check.

B3[target]:24 specu repr=tagged
    v35 <- UnaryInt64Op(unary-, v9 T{int}) [-9223372036854775808, 9223372036854775807] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
    v37 <- BinaryInt64Op(& [tr], v9 T{int}, v35) [-9223372036854775808, 9223372036854775807] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
    v40 <- BinaryInt64Op(% [tr], v37, v79) [0, 36] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
    v69 <- BoxInt64(v40) [0, 36] T{_Smi} specu repr=tagged tp=T{_Smi} #uses=1 #env-uses=0
  0:     v48 <- LoadIndexed(v50, v69) specu repr=tagged tp=T{*?} #uses=1 #env-uses=0
    v71 <- UnboxInt64(v48 T{_Smi}) [-9223372036854775808, 9223372036854775807] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
    v17 <- BinaryInt64Op(+ [tr], v7 T{int}, v71 T{_Smi}) [-9223372036854775808, 9223372036854775807] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
    v20 <- BinaryInt64Op(+ [tr], v9 T{int}, v81) [-9223372036854775808, 9223372036854775807] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
    goto:42 B5 specu repr=tagged

aartbik · 2019-08-31T00:08:50Z

To consider before closing this bug

On ARM64, the magic sequence performs worse than the direct division/modulo; my friends at ARM are pondering over this (most likely because, although sdiv is high-latency, the latency is data dependent; perhaps 131 and 37 are just lucky).
What to do on ARM32, if anything. We pondered over making native calls in the past (AOT: implement non-speculative int64 versions of MOD and TRUNCDIV #33967) and decided against it. Perhaps we need to reconsider? At the very least, the lack of recognizing the operator negatively impacts our inlining heuristics.

aartbik · 2019-09-10T22:09:10Z

Several comments to wrap this up.

I got excellent feedback from my ARM friends. In general, the "magic" sequence has consistently lower latency than the sdiv operation, although the sdiv can sometimes luck out, not just for the specific right-hand-side values 37 and 131 but depending on the actual left-hand-side value as well. They strongly recommend using the magic sequence though. Perhaps my benchmarks uses the wrong inputs. Therefore I am going to put the magic sequence under our "O3" flag. Once the code is available you can experiment whether the magic sequence works better in your particular case.
On ARM32, the modulo operation is ultimately implemented as a static call into _moduloFromInteger. So no further action will be taken here, since I don't expect there is much performance left at the table.

@pragma("vm:non-nullable-result-type")
  num operator %(num other) {
    if ((other is int) && (other == 0)) {
      throw const IntegerDivisionByZeroException();
    }
    return other._moduloFromInteger(this);
  }
  int _moduloFromInteger(int other) native "Integer_moduloFromInteger";

One last comment:

You are actually probably not even interested in div/mod performance, but want a very fast "trailing zeros" implementation. Feel free to file a feature request for direct support for these in our library (and assign to me). On all architectures, a much faster way of implementing this operation exists (clearly, we don't want to resort to "idiom recognition" on your table + function).

Rationale: Follow up on prior CL, now with the ARM64 version. Unit tests were included in that previous CL. (https://dart-review.googlesource.com/c/sdk/+/114901) #37789 Change-Id: I6177b48743cd615914dbfbddab166a6c71ef71c7 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/115064 Reviewed-by: Ryan Macnak <rmacnak@google.com> Commit-Queue: Aart Bik <ajcbik@google.com>

mraleph · 2019-09-11T06:24:10Z

On ARM32, the modulo operation is ultimately implemented as a static call into _moduloFromInteger. So no further action will be taken here, since I don't expect there is much performance left at the table.

I would make a guess (out of thin air but based on the fact that our native calls are super heavy) that calling ARM ABI function directly would probably be at least 2x faster than current implementation which jumps into runtime.

The implementation is fairly straightforward - so I am not sure if it does not make sense to just do it for the sake of engineering excellency.

aartbik · 2019-09-11T18:54:18Z

@mraleph Happy to investigate that too if your intuition is that there is more performance to be had there! But wouldn't it be true engineering excellency if we actually make native calls faster? As it is, we have a lot of different mechanisms (code selection, intrinsifier, native methods, etc.). Adding yet another one (call into arm abi) seems we are spreading our logic rather thin.

@jtmcdole please also pay attention to the last comment in bold above; if speeding up trailing zeros is the endgame, perhaps we should add native support in Dart, so that the compiler can pick the best instruction on any given architecture

Rationale: Needed for an upcoming improvement of DIV/MOD by constant. dart-lang#37789 Change-Id: I3966049a3f14fd3becec797c4bc0038f2852bc1b Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/114680 Commit-Queue: Aart Bik <ajcbik@google.com> Reviewed-by: Alexander Markov <alexmarkov@google.com>

a-siva added area-vm Use area-vm for VM related issues, including code coverage, FFI, and the AOT and JIT backends. type-performance Issue relates to performance or code size P2 A bug or feature request we're likely to work on labels Aug 8, 2019

aartbik self-assigned this Aug 8, 2019

aartbik closed this as completed Sep 11, 2019

jtmcdole mentioned this issue Sep 11, 2019

Add optimal "numberOfTrailingZeros" method #38346

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VM: const array indexing modulo #37789

VM: const array indexing modulo #37789

jtmcdole commented Aug 8, 2019

a-siva commented Aug 8, 2019

aartbik commented Aug 13, 2019 •

edited

aartbik commented Aug 13, 2019 •

edited

aartbik commented Aug 22, 2019

ds84182 commented Aug 23, 2019

aartbik commented Aug 23, 2019

aartbik commented Aug 26, 2019 •

edited

jtmcdole commented Aug 26, 2019

aartbik commented Aug 26, 2019

aartbik commented Aug 29, 2019 •

edited

aartbik commented Aug 29, 2019

alexmarkov commented Aug 30, 2019

jtmcdole commented Aug 30, 2019

aartbik commented Aug 30, 2019

aartbik commented Aug 30, 2019

aartbik commented Aug 31, 2019

aartbik commented Sep 10, 2019 •

edited

mraleph commented Sep 11, 2019

aartbik commented Sep 11, 2019 •

edited

VM: const array indexing modulo #37789

VM: const array indexing modulo #37789

Comments

jtmcdole commented Aug 8, 2019

a-siva commented Aug 8, 2019

aartbik commented Aug 13, 2019 • edited

aartbik commented Aug 13, 2019 • edited

aartbik commented Aug 22, 2019

ds84182 commented Aug 23, 2019

aartbik commented Aug 23, 2019

aartbik commented Aug 26, 2019 • edited

jtmcdole commented Aug 26, 2019

aartbik commented Aug 26, 2019

aartbik commented Aug 29, 2019 • edited

aartbik commented Aug 29, 2019

alexmarkov commented Aug 30, 2019

jtmcdole commented Aug 30, 2019

aartbik commented Aug 30, 2019

aartbik commented Aug 30, 2019

aartbik commented Aug 31, 2019

aartbik commented Sep 10, 2019 • edited

mraleph commented Sep 11, 2019

aartbik commented Sep 11, 2019 • edited

aartbik commented Aug 13, 2019 •

edited

aartbik commented Aug 13, 2019 •

edited

aartbik commented Aug 26, 2019 •

edited

aartbik commented Aug 29, 2019 •

edited

aartbik commented Sep 10, 2019 •

edited

aartbik commented Sep 11, 2019 •

edited