New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VM: const array indexing modulo #37789
Comments
/cc @aartbik |
Let's start with the bounds check. For the constant arrays, we see the array lengths as constants, so it is indeed a low hanging fruit to eliminate the check altogether given the modulo operator. Coming up! |
I completed a CL that improves range analysis for modulo operators. This implies the bounds check is now eliminated. Performance of a micro-benchmark (on an Intel Xeon 2.60GHz desktop for the time being) before this improvement: JIT
DartNTZ.NTZ32(RunTime): 439.99929036515624 us.
DartNTZ.NTZ64(RunTime): 373.55112268907567 us.
AOT
DartNTZ.NTZ32(RunTime): 562.560742407199 us.
DartNTZ.NTZ64(RunTime): 560.3938464555898 us. And after the range analysis improvements: JIT
DartNTZ.NTZ32(RunTime): 377.65250944108766 us.
DartNTZ.NTZ64(RunTime): 357.13150723085164 us.
AOT
DartNTZ.NTZ32(RunTime): 518.6002999740732 us.
DartNTZ.NTZ64(RunTime): 536.2601919571046 us. |
Rationale: Improves range analysis over MOD with the goal of removing more bounds checks in a[i % length] array accesses. #37789 Change-Id: I1fb275d53a00400e2343628295fa010ac9b21786 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/112933 Commit-Queue: Aart Bik <ajcbik@google.com> Reviewed-by: Martin Kustermann <kustermann@google.com> Reviewed-by: Vyacheslav Egorov <vegorov@google.com>
The performance of modulo (or division for that matter) on arm is clearly a much "higher" hanging fruit. There is probably not much we can do about the assumptions what processor on what device supports integer division across all potential target architectures. However, perhaps we can do slightly better for the constant x % c operator for various c (currently, and to a certain extent, we do that for powers of two only). Investigating..... |
I imagine you could emit a call directly to the ARM builtins ( |
Note, to clarify, the cpu_arm code linked to above determines what to generate when compiling arm32, but running on arm32 or arm64. When compiling for arm64, cpu_arm64 is used, and you should unconditionally see a division, e.g. for ntz64: 0x7f81247788f0 d2801062 movz r2, 0x83 // 131 |
Intel results on our performance cluster are in (from bounds check elimination only):
|
How do I read those results? Does that mean they are faster? |
@jtmcdole I added a header. The second column shows improvement, the noise level indicates how reliable the improvement is compared to noise that the benchmarks has between runs (> 1 is confident, and higher is more confident). |
Rationale: Needed for an upcoming improvement of DIV/MOD by constant. #37789 Change-Id: I3966049a3f14fd3becec797c4bc0038f2852bc1b Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/114680 Commit-Queue: Aart Bik <ajcbik@google.com> Reviewed-by: Alexander Markov <alexmarkov@google.com>
Rationale: Mostly an AOT 64-bit improvement of DIV and MOV by constant on X64_64. Also a few minor supporting optimizations that were missed. With unit tests. #37789 Change-Id: I2f12d6a24bbe983324521d7d4d778f20b34d18a1 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/114901 Commit-Queue: Aart Bik <ajcbik@google.com> Reviewed-by: Ryan Macnak <rmacnak@google.com>
With the latest DIV/MOD by magic improvement (x86_64 only for now), we get a nice boost: AOT
DartNTZ.NTZ32(RunTime): 385.1173925683481 us.
DartNTZ.NTZ64(RunTime): 384.06057430875575 us.
However, I overlooked something completely so far. The omission of the type in ntz32(int x) => _ntzLut32[(x & -x) % 37];
ntz64(int x) => _ntzLut64[(x & -x) % 131]; has a big impact on AOT performance (the static type of any use remains dynamic, as can been seen by inspecting the intermediate representation and the generated code). If I use int ntz32(int x) => _ntzLut32[(x & -x) % 37];
int ntz64(int x) => _ntzLut64[(x & -x) % 131]; I get a much higher boost, with AOT now outperforming JIT by far: JIT
DartNTZ.NTZ32(RunTime): 374.6525572204532 us.
DartNTZ.NTZ64(RunTime): 355.48469912919853 us.
AOT
DartNTZ.NTZ32(RunTime): 187.82733552446237 us.
DartNTZ.NTZ64(RunTime): 188.34485271682834 us. So one obvious step for our customer would be to simply add this 'int'. Talking with @alexmarkov and @rmacnak-google we debated whether automatically inferring the int type on the const array and using that to infer the type of the methods would be worthwhile..... |
IR of loop-body without type: B3[target]:24
PushArgument(v7)
v35 <- UnaryInt64Op(unary-, v9 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
v37 <- BinaryInt64Op(& [tr], v9 T{int}, v35) [-9223372036854775808, 9223372036854775807] T{int}
v40 <- BinaryInt64Op(% [tr], v37, v73) [0, 36] T{int}
v67 <- BoxInt64(v40) [0, 36] T{_Smi}
0: v48 <- LoadIndexed(v50, v67)
AssertAssignable:34(v48 T{*?}, num, '', instantiator_type_args(v0), function_type_args(v0)) T{num?}
PushArgument(v48 T{num?})
v17 <- StaticCall:38( +<0> v7, v48 T{num?}, using unchecked entrypoint, recognized_kind = Integer_add, result_type = T{!null}) T{num}
AssertAssignable:40(v17, int, '', instantiator_type_args(v0), function_type_args(v0)) T{int}
v20 <- BinaryInt64Op(+ [tr], v9 T{int}, v75) [-9223372036854775808, 9223372036854775807] T{int}
goto:46 B5
IR of loop-body with type (note that automatic type inference by looking at the array contents could even remove the null-check over this version!): B3[target]:24
v37 <- UnaryInt64Op(unary-, v9 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
v39 <- BinaryInt64Op(& [tr], v9 T{int}, v37) [-9223372036854775808, 9223372036854775807] T{int}
v42 <- BinaryInt64Op(% [tr], v39, v81) [0, 36] T{int}
v71 <- BoxInt64(v42) [0, 36] T{_Smi}
0: v50 <- LoadIndexed(v52, v71)
CheckNull:36(v50 T{int?}) [-9223372036854775808, 9223372036854775807] T{int}
v73 <- UnboxInt64(v50 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
v17 <- BinaryInt64Op(+ [tr], v7 T{int}, v73 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
v20 <- BinaryInt64Op(+ [tr], v9 T{int}, v83) [-9223372036854775808, 9223372036854775807] T{int}
goto:42 B5 |
Created CL to infer type for indexing into a constant list. Results on my system (AOT, linux/x64):
|
That's pretty awesome - and man, I would have assumed the inference would have picked up. note to codefu: add types on the LHS. |
Results of x86_64 change. I will also include the type inference results when they become available, which will be across architectures. After that we need to revisit using "the magic" for ARM64.
|
Before: DartNTZ.NTZ32(RunTime): 351.9357010381841 us. DartNTZ.NTZ64(RunTime): 351.99127771911304 us. After: DartNTZ.NTZ32(RunTime): 150.19256867162275 us. DartNTZ.NTZ64(RunTime): 149.48622578475337 us. Issue: #37789 Change-Id: Ie8b16b2602a15140df9f0f21199ee9eb2bbf158c Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/115012 Reviewed-by: Aart Bik <ajcbik@google.com> Reviewed-by: Martin Kustermann <kustermann@google.com> Reviewed-by: Samir Jindel <sjindel@google.com> Commit-Queue: Alexander Markov <alexmarkov@google.com>
Type inference preliminary improvements \o/
Much better loop-body, no null-check. B3[target]:24 specu repr=tagged
v35 <- UnaryInt64Op(unary-, v9 T{int}) [-9223372036854775808, 9223372036854775807] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
v37 <- BinaryInt64Op(& [tr], v9 T{int}, v35) [-9223372036854775808, 9223372036854775807] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
v40 <- BinaryInt64Op(% [tr], v37, v79) [0, 36] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
v69 <- BoxInt64(v40) [0, 36] T{_Smi} specu repr=tagged tp=T{_Smi} #uses=1 #env-uses=0
0: v48 <- LoadIndexed(v50, v69) specu repr=tagged tp=T{*?} #uses=1 #env-uses=0
v71 <- UnboxInt64(v48 T{_Smi}) [-9223372036854775808, 9223372036854775807] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
v17 <- BinaryInt64Op(+ [tr], v7 T{int}, v71 T{_Smi}) [-9223372036854775808, 9223372036854775807] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
v20 <- BinaryInt64Op(+ [tr], v9 T{int}, v81) [-9223372036854775808, 9223372036854775807] T{int} repr=int64 tp=T{int} #uses=1 #env-uses=0
goto:42 B5 specu repr=tagged
|
To consider before closing this bug
|
Several comments to wrap this up.
@pragma("vm:non-nullable-result-type")
num operator %(num other) {
if ((other is int) && (other == 0)) {
throw const IntegerDivisionByZeroException();
}
return other._moduloFromInteger(this);
}
int _moduloFromInteger(int other) native "Integer_moduloFromInteger"; One last comment:
|
Rationale: Follow up on prior CL, now with the ARM64 version. Unit tests were included in that previous CL. (https://dart-review.googlesource.com/c/sdk/+/114901) #37789 Change-Id: I6177b48743cd615914dbfbddab166a6c71ef71c7 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/115064 Reviewed-by: Ryan Macnak <rmacnak@google.com> Commit-Queue: Aart Bik <ajcbik@google.com>
I would make a guess (out of thin air but based on the fact that our native calls are super heavy) that calling ARM ABI function directly would probably be at least 2x faster than current implementation which jumps into runtime. The implementation is fairly straightforward - so I am not sure if it does not make sense to just do it for the sake of engineering excellency. |
@mraleph Happy to investigate that too if your intuition is that there is more performance to be had there! But wouldn't it be true engineering excellency if we actually make native calls faster? As it is, we have a lot of different mechanisms (code selection, intrinsifier, native methods, etc.). Adding yet another one (call into arm abi) seems we are spreading our logic rather thin. @jtmcdole please also pay attention to the last comment in bold above; if speeding up trailing zeros is the endgame, perhaps we should add native support in Dart, so that the compiler can pick the best instruction on any given architecture |
Rationale: Needed for an upcoming improvement of DIV/MOD by constant. dart-lang#37789 Change-Id: I3966049a3f14fd3becec797c4bc0038f2852bc1b Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/114680 Commit-Queue: Aart Bik <ajcbik@google.com> Reviewed-by: Alexander Markov <alexmarkov@google.com>
My code for number of trailing zeros should be pretty fast: https://gist.github.com/jtmcdole/297434f327077dbfe5fb19da3b4ef5be
But a const array of size modulo still has a range check when indexed by said modulo. Also, modulo is apparently slow on all arm64 and arm32 dartvm snapshots?
https://github.com/dart-lang/sdk/blob/master/runtime/vm/cpu_arm.cc#L227
The text was updated successfully, but these errors were encountered: