Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VM: const array indexing modulo #37789

Closed
jtmcdole opened this issue Aug 8, 2019 · 19 comments
Closed

VM: const array indexing modulo #37789

jtmcdole opened this issue Aug 8, 2019 · 19 comments
Assignees
Labels
area-vm Use area-vm for VM related issues, including code coverage, FFI, and the AOT and JIT backends. P2 A bug or feature request we're likely to work on type-performance Issue relates to performance or code size

Comments

@jtmcdole
Copy link
Contributor

jtmcdole commented Aug 8, 2019

My code for number of trailing zeros should be pretty fast: https://gist.github.com/jtmcdole/297434f327077dbfe5fb19da3b4ef5be

But a const array of size modulo still has a range check when indexed by said modulo. Also, modulo is apparently slow on all arm64 and arm32 dartvm snapshots?

https://github.com/dart-lang/sdk/blob/master/runtime/vm/cpu_arm.cc#L227

@a-siva a-siva added area-vm Use area-vm for VM related issues, including code coverage, FFI, and the AOT and JIT backends. type-performance Issue relates to performance or code size P2 A bug or feature request we're likely to work on labels Aug 8, 2019
@a-siva
Copy link
Contributor

a-siva commented Aug 8, 2019

/cc @aartbik

@aartbik aartbik self-assigned this Aug 8, 2019
@aartbik
Copy link
Contributor

aartbik commented Aug 13, 2019

Let's start with the bounds check. For the constant arrays, we see the array lengths as constants, so it is indeed a low hanging fruit to eliminate the check altogether given the modulo operator. Coming up!

@aartbik
Copy link
Contributor

aartbik commented Aug 13, 2019

I completed a CL that improves range analysis for modulo operators. This implies the bounds check is now eliminated. Performance of a micro-benchmark (on an Intel Xeon 2.60GHz desktop for the time being) before this improvement:

JIT
DartNTZ.NTZ32(RunTime): 439.99929036515624 us.
DartNTZ.NTZ64(RunTime): 373.55112268907567 us.

AOT
DartNTZ.NTZ32(RunTime): 562.560742407199 us.
DartNTZ.NTZ64(RunTime): 560.3938464555898 us.

And after the range analysis improvements:

JIT
DartNTZ.NTZ32(RunTime): 377.65250944108766 us.
DartNTZ.NTZ64(RunTime): 357.13150723085164 us.

AOT
DartNTZ.NTZ32(RunTime): 518.6002999740732 us.
DartNTZ.NTZ64(RunTime): 536.2601919571046 us.

dart-bot pushed a commit that referenced this issue Aug 14, 2019
Rationale:
Improves range analysis over MOD with the goal
of removing more bounds checks in a[i % length]
array accesses.

#37789

Change-Id: I1fb275d53a00400e2343628295fa010ac9b21786
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/112933
Commit-Queue: Aart Bik <ajcbik@google.com>
Reviewed-by: Martin Kustermann <kustermann@google.com>
Reviewed-by: Vyacheslav Egorov <vegorov@google.com>
@aartbik
Copy link
Contributor

aartbik commented Aug 22, 2019

The performance of modulo (or division for that matter) on arm is clearly a much "higher" hanging fruit. There is probably not much we can do about the assumptions what processor on what device supports integer division across all potential target architectures. However, perhaps we can do slightly better for the constant x % c operator for various c (currently, and to a certain extent, we do that for powers of two only). Investigating.....

@ds84182
Copy link
Contributor

ds84182 commented Aug 23, 2019

I imagine you could emit a call directly to the ARM builtins (__aeabi_idiv and friends). It would probably be faster than transitioning in and out of the VM.

@aartbik
Copy link
Contributor

aartbik commented Aug 23, 2019

Note, to clarify, the cpu_arm code linked to above determines what to generate when compiling arm32, but running on arm32 or arm64. When compiling for arm64, cpu_arm64 is used, and you should unconditionally see a division, e.g. for ntz64:

0x7f81247788f0 d2801062 movz r2, 0x83 // 131
0x7f81247788f4 9ac20c85 sdiv r5, r4, r2

@aartbik
Copy link
Contributor

aartbik commented Aug 26, 2019

Intel results on our performance cluster are in (from bounds check elimination only):

  | Benchmark     |Improvement|Mode      |Processor      |OS          |Noise Level
  | DartNTZ.NTZ32 | +3.88 %   | dart     | Intel Core i5 | linux-ia32 | 1.09 x noise
  | DartNTZ.NTZ32 | +5.57 %​   | dart     | Intel Core i5 | linux-x64  | 1.07 x noise
  | DartNTZ.NTZ32 | +13.7 %​   | dart-aot | Intel Core i5 | linux-x64  | 3.01 x noise
  | DartNTZ.NTZ32 | +12.5 %​   | dart-aot | Intel Xeon    | linux-x64  | 1.99 x noise
  | DartNTZ.NTZ64 | +5.29 %   | dart     | Intel Core i5 | linux-x64  | 2.42 x noise
  | DartNTZ.NTZ64 | +10.5 %​   | dart-aot | Intel Core i5 | linux-x64  | 3.16 x noise
  | DartNTZ.NTZ64 | +12.8 %​   | dart-aot | Intel Xeon    | linux-x64  | 1.41 x noise

@jtmcdole
Copy link
Contributor Author

How do I read those results? Does that mean they are faster?

@aartbik
Copy link
Contributor

aartbik commented Aug 26, 2019

@jtmcdole I added a header. The second column shows improvement, the noise level indicates how reliable the improvement is compared to noise that the benchmarks has between runs (> 1 is confident, and higher is more confident).

dart-bot pushed a commit that referenced this issue Aug 28, 2019
Rationale:
Needed for an upcoming improvement of DIV/MOD by constant.

#37789

Change-Id: I3966049a3f14fd3becec797c4bc0038f2852bc1b
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/114680
Commit-Queue: Aart Bik <ajcbik@google.com>
Reviewed-by: Alexander Markov <alexmarkov@google.com>
dart-bot pushed a commit that referenced this issue Aug 29, 2019
Rationale:
Mostly an AOT 64-bit improvement of DIV and MOV by constant
on X64_64. Also a few minor supporting optimizations
that were missed. With unit tests.

#37789

Change-Id: I2f12d6a24bbe983324521d7d4d778f20b34d18a1
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/114901
Commit-Queue: Aart Bik <ajcbik@google.com>
Reviewed-by: Ryan Macnak <rmacnak@google.com>
@aartbik
Copy link
Contributor

aartbik commented Aug 29, 2019

With the latest DIV/MOD by magic improvement (x86_64 only for now), we get a nice boost:

AOT
DartNTZ.NTZ32(RunTime): 385.1173925683481 us.
DartNTZ.NTZ64(RunTime): 384.06057430875575 us.

However, I overlooked something completely so far. The omission of the type in

 ntz32(int x) => _ntzLut32[(x & -x) % 37];
 ntz64(int x) => _ntzLut64[(x & -x) % 131];

has a big impact on AOT performance (the static type of any use remains dynamic, as can been seen by inspecting the intermediate representation and the generated code). If I use

 int ntz32(int x) => _ntzLut32[(x & -x) % 37];
 int ntz64(int x) => _ntzLut64[(x & -x) % 131];

I get a much higher boost, with AOT now outperforming JIT by far:

JIT
DartNTZ.NTZ32(RunTime): 374.6525572204532 us.
DartNTZ.NTZ64(RunTime): 355.48469912919853 us.

AOT
DartNTZ.NTZ32(RunTime): 187.82733552446237 us.
DartNTZ.NTZ64(RunTime): 188.34485271682834 us.

So one obvious step for our customer would be to simply add this 'int'.

Talking with @alexmarkov and @rmacnak-google we debated whether automatically inferring the int type on the const array and using that to infer the type of the methods would be worthwhile.....

@mraleph @mkustermann

@aartbik
Copy link
Contributor

aartbik commented Aug 29, 2019

IR of loop-body without type:

B3[target]:24
    PushArgument(v7)
    v35 <- UnaryInt64Op(unary-, v9 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
    v37 <- BinaryInt64Op(& [tr], v9 T{int}, v35) [-9223372036854775808, 9223372036854775807] T{int}
    v40 <- BinaryInt64Op(% [tr], v37, v73) [0, 36] T{int}
    v67 <- BoxInt64(v40) [0, 36] T{_Smi}
  0:     v48 <- LoadIndexed(v50, v67)
    AssertAssignable:34(v48 T{*?}, num, '', instantiator_type_args(v0), function_type_args(v0)) T{num?}
    PushArgument(v48 T{num?})
    v17 <- StaticCall:38( +<0> v7, v48 T{num?}, using unchecked entrypoint, recognized_kind = Integer_add, result_type = T{!null}) T{num}
    AssertAssignable:40(v17, int, '', instantiator_type_args(v0), function_type_args(v0)) T{int}
    v20 <- BinaryInt64Op(+ [tr], v9 T{int}, v75) [-9223372036854775808, 9223372036854775807] T{int}
    goto:46 B5

IR of loop-body with type (note that automatic type inference by looking at the array contents could even remove the null-check over this version!):

B3[target]:24
    v37 <- UnaryInt64Op(unary-, v9 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
    v39 <- BinaryInt64Op(& [tr], v9 T{int}, v37) [-9223372036854775808, 9223372036854775807] T{int}
    v42 <- BinaryInt64Op(% [tr], v39, v81) [0, 36] T{int}
    v71 <- BoxInt64(v42) [0, 36] T{_Smi}
  0:     v50 <- LoadIndexed(v52, v71)
    CheckNull:36(v50 T{int?}) [-9223372036854775808, 9223372036854775807] T{int}
    v73 <- UnboxInt64(v50 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
    v17 <- BinaryInt64Op(+ [tr], v7 T{int}, v73 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
    v20 <- BinaryInt64Op(+ [tr], v9 T{int}, v83) [-9223372036854775808, 9223372036854775807] T{int}
    goto:42 B5

@alexmarkov
Copy link
Contributor

Created CL to infer type for indexing into a constant list.

Results on my system (AOT, linux/x64):

Before:
DartNTZ.NTZ32(RunTime): 351.9357010381841 us.
DartNTZ.NTZ64(RunTime): 351.99127771911304 us.

After:
DartNTZ.NTZ32(RunTime): 150.19256867162275 us.
DartNTZ.NTZ64(RunTime): 149.48622578475337 us.

@jtmcdole
Copy link
Contributor Author

That's pretty awesome - and man, I would have assumed the inference would have picked up. note to codefu: add types on the LHS.

@aartbik
Copy link
Contributor

aartbik commented Aug 30, 2019

Results of x86_64 change. I will also include the type inference results when they become available, which will be across architectures. After that we need to revisit using "the magic" for ARM64.

  | Benchmark     |Improvement|Mode      |Processor      |OS         |Noise Level
  | DartNTZ.NTZ32 | +11.9 %   | ​dart-aot | Intel Core i5 | linux-x64 | 2.97 x noise
  | DartNTZ.NTZ32 | +18.6 %   | dart-aot | Intel Xeon    | linux-x64 | 1.73 x noise
  | DartNTZ.NTZ64 | +11.6 %   | dart-aot | Intel Core i5 | linux-x64 | 3.87 x noise
  | DartNTZ.NTZ64 | +23.9 %   | dart-aot | Intel Xeon    | linux-x64 | 2.88 x noise

dart-bot pushed a commit that referenced this issue Aug 30, 2019
Before:
DartNTZ.NTZ32(RunTime): 351.9357010381841 us.
DartNTZ.NTZ64(RunTime): 351.99127771911304 us.

After:
DartNTZ.NTZ32(RunTime): 150.19256867162275 us.
DartNTZ.NTZ64(RunTime): 149.48622578475337 us.

Issue: #37789
Change-Id: Ie8b16b2602a15140df9f0f21199ee9eb2bbf158c
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/115012
Reviewed-by: Aart Bik <ajcbik@google.com>
Reviewed-by: Martin Kustermann <kustermann@google.com>
Reviewed-by: Samir Jindel <sjindel@google.com>
Commit-Queue: Alexander Markov <alexmarkov@google.com>
@aartbik
Copy link
Contributor

aartbik commented Aug 30, 2019

Type inference preliminary improvements \o/

  | Benchmark     |Improvement|Mode      |Processor      |OS 
  | DartNTZ.NTZ32 | +118 %    | dart-aot | Intel Core i5 | linux-x64
  | DartNTZ.NTZ32 | +120 %    | ​dart-aot | Intel Xeon    | linux-x64
  | DartNTZ.NTZ64 | +119 %    | ​dart-aot | Intel Core i5 | linux-x64
  | DartNTZ.NTZ64 | +119 %    | ​dart-aot | Intel Xeon    | linux-x64
  | DartNTZ.NTZ32 |  +24 %    | ​dart-aot | Odroid-C2     | linux-armv7
  | DartNTZ.NTZ64 |  +27 %    | ​dart-aot | Odroid-C2     | linux-armv7
  | DartNTZ.NTZ32 |  +89 %    | ​dart-aot | Odroid-C2     | linux-armv8
  | DartNTZ.NTZ64 | +102 %    | ​dart-aot | Odroid-C2     | linux-armv8

Much better loop-body, no null-check.

B3[target]:24 specu repr=tagged
    v35 <- UnaryInt64Op(unary-, v9 T{int}) [-9223372036854775808, 9223372036854775807] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
    v37 <- BinaryInt64Op(& [tr], v9 T{int}, v35) [-9223372036854775808, 9223372036854775807] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
    v40 <- BinaryInt64Op(% [tr], v37, v79) [0, 36] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
    v69 <- BoxInt64(v40) [0, 36] T{_Smi} specu repr=tagged tp=T{_Smi} #uses=1 #env-uses=0
  0:     v48 <- LoadIndexed(v50, v69) specu repr=tagged tp=T{*?} #uses=1 #env-uses=0
    v71 <- UnboxInt64(v48 T{_Smi}) [-9223372036854775808, 9223372036854775807] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
    v17 <- BinaryInt64Op(+ [tr], v7 T{int}, v71 T{_Smi}) [-9223372036854775808, 9223372036854775807] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
    v20 <- BinaryInt64Op(+ [tr], v9 T{int}, v81) [-9223372036854775808, 9223372036854775807] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
    goto:42 B5 specu repr=tagged

@aartbik
Copy link
Contributor

aartbik commented Aug 31, 2019

To consider before closing this bug

  • On ARM64, the magic sequence performs worse than the direct division/modulo; my friends at ARM are pondering over this (most likely because, although sdiv is high-latency, the latency is data dependent; perhaps 131 and 37 are just lucky).

  • What to do on ARM32, if anything. We pondered over making native calls in the past (AOT: implement non-speculative int64 versions of MOD and TRUNCDIV #33967) and decided against it. Perhaps we need to reconsider? At the very least, the lack of recognizing the operator negatively impacts our inlining heuristics.

@aartbik
Copy link
Contributor

aartbik commented Sep 10, 2019

Several comments to wrap this up.

  • I got excellent feedback from my ARM friends. In general, the "magic" sequence has consistently lower latency than the sdiv operation, although the sdiv can sometimes luck out, not just for the specific right-hand-side values 37 and 131 but depending on the actual left-hand-side value as well. They strongly recommend using the magic sequence though. Perhaps my benchmarks uses the wrong inputs. Therefore I am going to put the magic sequence under our "O3" flag. Once the code is available you can experiment whether the magic sequence works better in your particular case.

  • On ARM32, the modulo operation is ultimately implemented as a static call into _moduloFromInteger. So no further action will be taken here, since I don't expect there is much performance left at the table.

@pragma("vm:non-nullable-result-type")
  num operator %(num other) {
    if ((other is int) && (other == 0)) {
      throw const IntegerDivisionByZeroException();
    }
    return other._moduloFromInteger(this);
  }
  int _moduloFromInteger(int other) native "Integer_moduloFromInteger";

One last comment:

  • You are actually probably not even interested in div/mod performance, but want a very fast "trailing zeros" implementation. Feel free to file a feature request for direct support for these in our library (and assign to me). On all architectures, a much faster way of implementing this operation exists (clearly, we don't want to resort to "idiom recognition" on your table + function).

dart-bot pushed a commit that referenced this issue Sep 10, 2019
Rationale:
Follow up on prior CL, now with the ARM64 version.
Unit tests were included in that previous CL.
(https://dart-review.googlesource.com/c/sdk/+/114901)

#37789

Change-Id: I6177b48743cd615914dbfbddab166a6c71ef71c7
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/115064
Reviewed-by: Ryan Macnak <rmacnak@google.com>
Commit-Queue: Aart Bik <ajcbik@google.com>
@aartbik aartbik closed this as completed Sep 11, 2019
@mraleph
Copy link
Member

mraleph commented Sep 11, 2019

On ARM32, the modulo operation is ultimately implemented as a static call into _moduloFromInteger. So no further action will be taken here, since I don't expect there is much performance left at the table.

I would make a guess (out of thin air but based on the fact that our native calls are super heavy) that calling ARM ABI function directly would probably be at least 2x faster than current implementation which jumps into runtime.

The implementation is fairly straightforward - so I am not sure if it does not make sense to just do it for the sake of engineering excellency.

@aartbik
Copy link
Contributor

aartbik commented Sep 11, 2019

@mraleph Happy to investigate that too if your intuition is that there is more performance to be had there! But wouldn't it be true engineering excellency if we actually make native calls faster? As it is, we have a lot of different mechanisms (code selection, intrinsifier, native methods, etc.). Adding yet another one (call into arm abi) seems we are spreading our logic rather thin.

@jtmcdole please also pay attention to the last comment in bold above; if speeding up trailing zeros is the endgame, perhaps we should add native support in Dart, so that the compiler can pick the best instruction on any given architecture

tekknolagi pushed a commit to tekknolagi/dart-assembler that referenced this issue Nov 3, 2020
Rationale:
Needed for an upcoming improvement of DIV/MOD by constant.

dart-lang#37789

Change-Id: I3966049a3f14fd3becec797c4bc0038f2852bc1b
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/114680
Commit-Queue: Aart Bik <ajcbik@google.com>
Reviewed-by: Alexander Markov <alexmarkov@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, FFI, and the AOT and JIT backends. P2 A bug or feature request we're likely to work on type-performance Issue relates to performance or code size
Projects
None yet
Development

No branches or pull requests

6 participants