VM/ia32: Bad register allocation in _StringBase._createOneByteString if it is inlined into createFromCharCodes #37800

alexmarkov · 2019-08-09T21:42:00Z

On IA32, Base64Encoder golem benchmark:

Base64Encoder(RunTime): 47.596634102808196 us.

If _StringBase._createOneByteString is inlined into _StringBase.createFromCharCodes, this slows down benchmark:

Base64Encoder(RunTime): 63.26622740644672 us.

Hot loop in _StringBase._createOneByteString:

0xf68add7b |   | 0.95% (25) | cmp esp,[esi+0x24] |  
0xf68add7e |   |            | jna 0xf68ade4e |  
0xf68add84 |   | 1.68% (44) | cmp ebx,edi |  
0xf68add86 |   |            | jge 0xf68addaa |  
0xf68add8c |   | 0.38% (10) | mov eax,[ebp+0xc] |  
0xf68add8f |   | 0.91% (24) | add eax,ebx |  
0xf68add91 |   | 1.52% (40) | mov edi,eax |  
0xf68add93 |   | 1.68% (44) | sar edi,1 |  
0xf68add95 |   | 0.65% (17) | movzxb eax,[ecx+edi*1+0xb] |  
0xf68add9a |   | 2.48% (65) | mov edi,ebx |  
0xf68add9c |   | 0.84% (22) | sar edi,1 |  
0xf68add9e |   | 1.03% (27) | movb [edx+edi*1+0xb],eax |  
0xf68adda2 |   | 1.60% (42) | add ebx,2 |  
0xf68adda5 |   | 0.42% (11) | mov edi,[ebp+0x8] |  
0xf68adda8 |   | 0.72% (19) | jmp 0xf68add7b

The same loop if _StringBase._createOneByteString is inlined:

0xf686f001 |   | 2.86% (74)  | cmp esp,[esi+0x24] |  
0xf686f004 |   |             | jna 0xf686f4ec |  
0xf686f00a |   |             | cmp eax,ebx |  
0xf686f00c |   |             | jge 0xf686f09c |  
0xf686f012 |   |             | mov ecx,[ebp+0x10] |  
0xf686f015 |   | 0.04% (1)   | add ecx,eax |  
0xf686f017 |   | 2.74% (71)  | mov ebx,ecx |  
0xf686f019 |   |             | sar ebx,1 |  
0xf686f01b |   |             | movzxb ecx,[edi+ebx*1+0xb] |  
0xf686f020 |   | 0.23% (6)   | xchg ecx,eax |  
0xf686f022 |   | 3.98% (103) | mov ebx,ecx |  
0xf686f024 |   |             | sar ebx,1 |  
0xf686f026 |   | 2.12% (55)  | movb [edx+ebx*1+0xb],eax |  
0xf686f02a |   | 14.90% (386)| add ecx,2 |  
0xf686f02d |   | 0.08% (2)   | mov eax,ecx |  
0xf686f02f |   |             | mov ebx,[ebp-0xc] |  
0xf686f032 |   |             | jmp 0xf686f001

The following patch can be used to reproduce the problem:

diff --git a/runtime/lib/string_patch.dart b/runtime/lib/string_patch.dart
index 6bf950f911..48c53770bd 100644
--- a/runtime/lib/string_patch.dart
+++ b/runtime/lib/string_patch.dart
@@ -219,6 +219,7 @@ abstract class _StringBase implements String {
     return createFromCharCodes(charCodeList, 0, length, bits);
   }
 
+  @pragma("vm:prefer-inline")
   static String _createOneByteString(List<int> charCodes, int start, int len) {
     // It's always faster to do this in Dart than to call into the runtime.
     var s = _OneByteString._allocate(len);

Note that _StringBase._createOneByteString satisfies our inlining_callee_call_sites_threshold heuristic if all integer operations are recognized. Currently it is not inlined because not all integer operations are recognized (there is a remaining unrecognized call). However, with bytecode all integer operations are replaced with BinarySmiOp and _StringBase._createOneByteString is inlined causing performance regression.

/cc @mraleph @mkustermann @aartbik

The text was updated successfully, but these errors were encountered:

If _StringBase._createOneByteString is inlined it becomes much slower due to #37800. This workaround improves Base64Encoder benchmark on ia32 in bytecode mode: Base64Encoder(RunTime): 63.31 -> 46.7 us. Issue: #37800 Change-Id: Ibcf5dc7bfd25bac0bd61a9dcb196c553c82f3dc1 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/112647 Reviewed-by: Aart Bik <ajcbik@google.com> Reviewed-by: Martin Kustermann <kustermann@google.com> Commit-Queue: Alexander Markov <alexmarkov@google.com>

aartbik · 2019-10-25T20:54:39Z

This has been in my "icebox" for a long time. The status of (not)inlining _createOneByteString on IA32 using mainline on my desktop using ReleaseIA32 is as follows, so we still have the potential perf regression hanging above our head. Looking....

         no-inline (never pragma)  default (no pragma)  inline (always pragma)
JIT IA32     54.45 us.              54.57 us.          71.93us.

alexmarkov added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. type-performance Issue relates to performance or code size labels Aug 9, 2019

aartbik self-assigned this Aug 12, 2019

aartbik removed their assignment Nov 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VM/ia32: Bad register allocation in _StringBase._createOneByteString if it is inlined into createFromCharCodes #37800

VM/ia32: Bad register allocation in _StringBase._createOneByteString if it is inlined into createFromCharCodes #37800

alexmarkov commented Aug 9, 2019

aartbik commented Oct 25, 2019 •

edited

Loading

VM/ia32: Bad register allocation in _StringBase._createOneByteString if it is inlined into createFromCharCodes #37800

VM/ia32: Bad register allocation in _StringBase._createOneByteString if it is inlined into createFromCharCodes #37800

Comments

alexmarkov commented Aug 9, 2019

aartbik commented Oct 25, 2019 • edited Loading

aartbik commented Oct 25, 2019 •

edited

Loading