VM inlining curiosity

Before https://dart-review.googlesource.com/c/sdk/+/385722 a compile of the CFE (an aot compile of `pkg/front_end/tool/_fasta/compile.dart` is what I'll be using below) includes two scanners: The `Utf8BytesScanner` and `StringScanner`. After the CL landed it it only included one.

Before, or indeed now if I include it again, say by applying this patch (here done on top of 329760d5cf81589b27819ee3458d46320e4ad2a0):

```
diff --git a/pkg/front_end/lib/src/kernel_generator_impl.dart b/pkg/front_end/lib/src/kernel_generator_impl.dart
index 2daf9ed7002..86673e57c33 100644
--- a/pkg/front_end/lib/src/kernel_generator_impl.dart
+++ b/pkg/front_end/lib/src/kernel_generator_impl.dart
@@ -5,6 +5,7 @@
 /// Defines the front-end API for converting source code to Dart Kernel objects.
 library front_end.kernel_generator_impl;
 
+import 'package:_fe_analyzer_shared/src/scanner/scanner.dart';
 import 'package:_fe_analyzer_shared/src/messages/severity.dart' show Severity;
 import 'package:kernel/ast.dart';
 import 'package:kernel/class_hierarchy.dart';
@@ -77,6 +78,11 @@ Future<InternalCompilerResult> generateKernelInternal(
   ProcessedOptions options = compilerContext.options;
   assert(options.haveBeenValidated, "Options have not been validated");
 
+  if (1+1==3) {
+    ScannerResult scannerResult = scanString("void main(){}");
+    print(scannerResult);
+  }
+
   options.reportNullSafetyCompilationModeInfo();
   FileSystem fs = options.fileSystem;
```

looking at an AOT compile on the function `AbstractScanner.select` at what happens to `advance` we can see that it checks the class id and then inlines both of the two versions:

```
[...]

903547  v91 <- LoadClassId(v2) [2073, 2074] int64
   0x00000000006337ea <+34>:    mov    -0x1(%rdx),%ebx
   0x00000000006337ed <+37>:    shr    $0xc,%ebx

903548  ParallelMove fp[-1] <- rbx
   0x00000000006337f0 <+40>:    mov    %rbx,-0x8(%rbp)

903549  Branch if EqualityCompare:10(v91 == v52) goto (17, 27)
   0x00000000006337f4 <+44>:    cmp    $0x819,%rbx
   0x00000000006337fb <+51>:    jne    0x633855 <AbstractScanner.select+141>

903550  B17
903551  v58 <- LoadField(v2 T{Utf8BytesScanner} . byteOffset) [-9223372036854775808, 9223372036854775807] int64
   0x0000000000633801 <+57>:    mov    0x9f(%rdx),%rdi

[...]

903564  B27
903565  v74 <- LoadField(v2 T{StringScanner} . scanOffset) [-9223372036854775808, 9223372036854775807] int64
   0x0000000000633855 <+141>:   mov    0x9f(%rdx),%rdi

[...]
```

With only one included, though, it no longer inlines it, and the method disassembly looks like this:

```
[...]
901662  v6 <- StaticCall:10( advance<0> v2, using unchecked entrypoint, result_type = T{_Smi}) [-4611686018427387904, 4611686018427387903] int64
   0x0000000000631559 <+49>:    call   0x62fe60 <Utf8BytesScanner.advance>
[...]
```

I can ask the VM nicely for it to be inlined by applying `@pragma("vm:prefer-inline")`, making it in fact inline it:

```
diff --git a/pkg/_fe_analyzer_shared/lib/src/scanner/utf8_bytes_scanner.dart b/pkg/_fe_analyzer_shared/lib/src/scanner/utf8_bytes_scanner.dart
index 2ea201ed8f5..bedc0edc015 100644
--- a/pkg/_fe_analyzer_shared/lib/src/scanner/utf8_bytes_scanner.dart
+++ b/pkg/_fe_analyzer_shared/lib/src/scanner/utf8_bytes_scanner.dart
@@ -126,6 +126,7 @@ class Utf8BytesScanner extends AbstractScanner {
 
   @override
   @pragma('vm:unsafe:no-bounds-checks')
+  @pragma("vm:prefer-inline")
   int advance() {
     // Always increment so byteOffset goes past the end.
     ++byteOffset;
```

```
[...]

901884  v32 <- LoadField(v2 T{Utf8BytesScanner} . byteOffset) [-9223372036854775808, 9223372036854775807] int64
   0x00000000006318ea <+34>:    mov    0x9f(%rdx),%rbx

901885  ParallelMove rbx <- rbx
901886  v34 <- BinaryInt64Op(+ [tr], v32, v72 T{_Smi}) [-9223372036854775808, 9223372036854775807] int64
   0x00000000006318f1 <+41>:    add    $0x1,%rbx

901887  StoreField(v2 T{Utf8BytesScanner} . byteOffset = v34 T{int} <int64>)
   0x00000000006318f5 <+45>:    mov    %rbx,0x9f(%rdx)

[...]
```

In fact inlining it is (slightly) faster --- here the difference having it inlined when compiling a fixed version of the CFE with itself:
```
instructions:u: -0.1007% +/- 0.0006% (-21314554.00 +/- 122170.23)
```

Knowing that @mraleph for instance is sceptical about the instruction count, looking instead on the scanner benchmark `pkg/_fe_analyzer_shared/test/scanner_benchmark.dart` this too has two live scanners, and they're inlined by default (`.1` below). Out-commenting the string one (leaving only one) does `StaticCall`s (`.2` below). Asking nicely for it to be inlines and it is (`.3` below) compiled like `out/ReleaseX64/dart-sdk/bin/dart compile aot-snapshot pkg/_fe_analyzer_shared/test/scanner_benchmark.dart` (and moved to `.1` etc), run like e.g. `out/ReleaseX64/dart-sdk/bin/dartaotruntime pkg/_fe_analyzer_shared/test/scanner_benchmark.aot.1 pkg/kernel/lib/src/equivalence.dart --bytes` gives these runtimes:
`.1`: ~103.1 bytes per microsecond
`.2`: ~108.7 bytes per microsecond
`.3`: ~112.9 bytes per microsecond

So having only one non-inlined one processes `5.47075% +/- 1.30016%` more bytes per time unit compared to having two inlined ones. So far so good. Inlining `advance` makes it process `3.8906% +/- 0.812575%` more bytes still per time unit though.

I guess my questions are this:
1) Why did it go from being inlined to not being inlined? That's quite a surprising result and as we've seen it would have been faster had it still been inlined.
Depending on the answer to that,
2) why wasn't the before a class-id check and two (specific) `StaticCall`s instead? (probably that would be slower, which is why it depends on the answer to 1 above).
and also
3) If - when in the situation where we have two options and it inlines by default - I mark both as `@pragma("vm:never-inline")` it doesn't check class id's and instead does a `DispatchTableCall` which makes sense. But if I only mark one of them with `@pragma("vm:never-inline")` it inlines the non-marked on and does a `DispatchTableCall` on the other one --- which seems to me like it could have been a `StaticCall` considering it should now know exactly what is left. Why doesn't it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

VM inlining curiosity #56975

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

VM inlining curiosity #56975

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions