Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VM/ia32: Bad register allocation in _StringBase._createOneByteString if it is inlined into createFromCharCodes #37800

Open
alexmarkov opened this issue Aug 9, 2019 · 1 comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. type-performance Issue relates to performance or code size

Comments

@alexmarkov
Copy link
Contributor

On IA32, Base64Encoder golem benchmark:

Base64Encoder(RunTime): 47.596634102808196 us.

If _StringBase._createOneByteString is inlined into _StringBase.createFromCharCodes, this slows down benchmark:

Base64Encoder(RunTime): 63.26622740644672 us.

Hot loop in _StringBase._createOneByteString:

0xf68add7b |   | 0.95% (25) | cmp esp,[esi+0x24] |  
0xf68add7e |   |            | jna 0xf68ade4e |  
0xf68add84 |   | 1.68% (44) | cmp ebx,edi |  
0xf68add86 |   |            | jge 0xf68addaa |  
0xf68add8c |   | 0.38% (10) | mov eax,[ebp+0xc] |  
0xf68add8f |   | 0.91% (24) | add eax,ebx |  
0xf68add91 |   | 1.52% (40) | mov edi,eax |  
0xf68add93 |   | 1.68% (44) | sar edi,1 |  
0xf68add95 |   | 0.65% (17) | movzxb eax,[ecx+edi*1+0xb] |  
0xf68add9a |   | 2.48% (65) | mov edi,ebx |  
0xf68add9c |   | 0.84% (22) | sar edi,1 |  
0xf68add9e |   | 1.03% (27) | movb [edx+edi*1+0xb],eax |  
0xf68adda2 |   | 1.60% (42) | add ebx,2 |  
0xf68adda5 |   | 0.42% (11) | mov edi,[ebp+0x8] |  
0xf68adda8 |   | 0.72% (19) | jmp 0xf68add7b

The same loop if _StringBase._createOneByteString is inlined:

0xf686f001 |   | 2.86% (74)  | cmp esp,[esi+0x24] |  
0xf686f004 |   |             | jna 0xf686f4ec |  
0xf686f00a |   |             | cmp eax,ebx |  
0xf686f00c |   |             | jge 0xf686f09c |  
0xf686f012 |   |             | mov ecx,[ebp+0x10] |  
0xf686f015 |   | 0.04% (1)   | add ecx,eax |  
0xf686f017 |   | 2.74% (71)  | mov ebx,ecx |  
0xf686f019 |   |             | sar ebx,1 |  
0xf686f01b |   |             | movzxb ecx,[edi+ebx*1+0xb] |  
0xf686f020 |   | 0.23% (6)   | xchg ecx,eax |  
0xf686f022 |   | 3.98% (103) | mov ebx,ecx |  
0xf686f024 |   |             | sar ebx,1 |  
0xf686f026 |   | 2.12% (55)  | movb [edx+ebx*1+0xb],eax |  
0xf686f02a |   | 14.90% (386)| add ecx,2 |  
0xf686f02d |   | 0.08% (2)   | mov eax,ecx |  
0xf686f02f |   |             | mov ebx,[ebp-0xc] |  
0xf686f032 |   |             | jmp 0xf686f001

The following patch can be used to reproduce the problem:

diff --git a/runtime/lib/string_patch.dart b/runtime/lib/string_patch.dart
index 6bf950f911..48c53770bd 100644
--- a/runtime/lib/string_patch.dart
+++ b/runtime/lib/string_patch.dart
@@ -219,6 +219,7 @@ abstract class _StringBase implements String {
     return createFromCharCodes(charCodeList, 0, length, bits);
   }
 
+  @pragma("vm:prefer-inline")
   static String _createOneByteString(List<int> charCodes, int start, int len) {
     // It's always faster to do this in Dart than to call into the runtime.
     var s = _OneByteString._allocate(len);

Note that _StringBase._createOneByteString satisfies our inlining_callee_call_sites_threshold heuristic if all integer operations are recognized. Currently it is not inlined because not all integer operations are recognized (there is a remaining unrecognized call). However, with bytecode all integer operations are replaced with BinarySmiOp and _StringBase._createOneByteString is inlined causing performance regression.

/cc @mraleph @mkustermann @aartbik

@alexmarkov alexmarkov added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. type-performance Issue relates to performance or code size labels Aug 9, 2019
dart-bot pushed a commit that referenced this issue Aug 12, 2019
If _StringBase._createOneByteString is inlined it becomes much slower
due to #37800.

This workaround improves Base64Encoder benchmark on ia32 in bytecode mode:

Base64Encoder(RunTime): 63.31 -> 46.7 us.

Issue: #37800
Change-Id: Ibcf5dc7bfd25bac0bd61a9dcb196c553c82f3dc1
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/112647
Reviewed-by: Aart Bik <ajcbik@google.com>
Reviewed-by: Martin Kustermann <kustermann@google.com>
Commit-Queue: Alexander Markov <alexmarkov@google.com>
@aartbik aartbik self-assigned this Aug 12, 2019
@aartbik
Copy link
Contributor

aartbik commented Oct 25, 2019

This has been in my "icebox" for a long time. The status of (not)inlining _createOneByteString on IA32 using mainline on my desktop using ReleaseIA32 is as follows, so we still have the potential perf regression hanging above our head. Looking....

         no-inline (never pragma)  default (no pragma)  inline (always pragma)
JIT IA32     54.45 us.              54.57 us.          71.93us.

@aartbik aartbik removed their assignment Nov 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. type-performance Issue relates to performance or code size
Projects
None yet
Development

No branches or pull requests

2 participants