VM pattern code less efficient than manual check #54114

lrhn · 2023-11-21T10:51:54Z

class C {
  final int? _x;
  C(this._x);

  @pragma("vm:never-inline")
  void manual() {
    var x = _x;
    if (x != null) print(x);
    else print("null");
  }

  @pragma("vm:never-inline")
  void pattern() {
    if (_x case var x?) print(x);
    else print("null");
  }

  @pragma("vm:never-inline")
  void promote() {
    if (_x != null) print(_x);
    else print("null");
  }
}

If I compile this with dart compile exe (and flags to dump disassemble), the code for the pattern method is less efficient than the other two.

The manual and promote methods both compile to function code (minus intro/stack check) of:

0x7fe5fb8b4192    488b4510               movq rax,[rbp+0x10]
0x7fe5fb8b4196    488b4807               movq rcx,[rax+0x7]
0x7fe5fb8b419a    493b4e68               cmpq rcx,[thr+0x68]   null
0x7fe5fb8b419e    0f840e000000           jz 0x00007fe5fb8b41b2
0x7fe5fb8b41a4    48890c24               movq [rsp],rcx
0x7fe5fb8b41a8    e8fbffffff             call 0x00007fe5fb8b41a8
0x7fe5fb8b41ad    e910000000             jmp 0x00007fe5fb8b41c2
0x7fe5fb8b41b2    4d8b9f17170000         movq tmp,[pp+0x1717]   null
0x7fe5fb8b41b9    4c891c24               movq [rsp],tmp
0x7fe5fb8b41bd    e8fbffffff             call 0x00007fe5fb8b41bd
0x7fe5fb8b41c2    498b4668               movq rax,[thr+0x68]   null
0x7fe5fb8b41c6    4889ec                 movq rsp,rbp
0x7fe5fb8b41c9    5d                     pop rbp
0x7fe5fb8b41ca    c3                     ret

That's basically $tmp = this._x; if ($tmp != null) print ($tmp); else print("null"); return null;

The pattern version produces:

0x7fe5fb8b4102    488b4510               movq rax,[rbp+0x10]
0x7fe5fb8b4106    488b4807               movq rcx,[rax+0x7]
0x7fe5fb8b410a    493b4e68               cmpq rcx,[thr+0x68]   null
0x7fe5fb8b410e    0f840c000000           jz 0x00007fe5fb8b4120
0x7fe5fb8b4114    4889c8                 movq rax,rcx
0x7fe5fb8b4117    498b4e70               movq rcx,[thr+0x70]   true
0x7fe5fb8b411b    e908000000             jmp 0x00007fe5fb8b4128
0x7fe5fb8b4120    498b4e78               movq rcx,[thr+0x78]   false
0x7fe5fb8b4124    498b4668               movq rax,[thr+0x68]   null
0x7fe5fb8b4128    f6c110                 testb rcx,0x10
0x7fe5fb8b412b    0f850e000000           jnz 0x00007fe5fb8b413f
0x7fe5fb8b4131    48890424               movq [rsp],rax
0x7fe5fb8b4135    e8fbffffff             call 0x00007fe5fb8b4135
0x7fe5fb8b413a    e910000000             jmp 0x00007fe5fb8b414f
0x7fe5fb8b413f    4d8b9f17170000         movq tmp,[pp+0x1717]   null
0x7fe5fb8b4146    4c891c24               movq [rsp],tmp
0x7fe5fb8b414a    e8fbffffff             call 0x00007fe5fb8b414a
0x7fe5fb8b414f    498b4668               movq rax,[thr+0x68]   null
0x7fe5fb8b4153    4889ec                 movq rsp,rbp
0x7fe5fb8b4156    5d                     pop rbp
0x7fe5fb8b4157    c3                     ret

which is closer to:

$tmp = this._x
if (tmp != null) 
  $tmp2 = $tmp1
  $tmp3 = true
else
  $tmp3 = false
  $tmp2 = null
if (tmp3 == true)
  print($tmp2)
else
  print("null")
return null;

It looks like some desugaring of patterns ends up reifying the match result as a boolean value, and introducing a temporary variable for the non-matching branch too, in a way that the VM compiler doesn't manage (or try?) to optimize afterwards.

I'd prefer to use the case version, but only if it's actually as efficient.

The text was updated successfully, but these errors were encountered:

mkustermann · 2023-11-21T11:18:17Z

/cc @alexmarkov

rakudrama · 2023-11-21T23:34:59Z

@alexmarkov I would love to learn from how you approach this - if you could cc me on changes I would be grateful.

alexmarkov · 2023-11-27T15:46:39Z

Flow graph:

  0: B0[graph]:0 {
      v0 <- Constant(#null) T{Null?}
      v4 <- Constant(#false) T{bool}
      v7 <- Constant(#true) T{bool}
      v8 <- Constant(#null) T{_OneByteString}
}
  2: B1[function entry]:2 {
      v2 <- Parameter(0) T{C}
}
  4:     CheckStackOverflow:8(stack=0, loop=0)
  5:     ParallelMove rax <- S+2
  6:     v3 <- LoadField(v2 . _x@13380480 {final}) T{_Smi?}
  8:     Branch if StrictCompare:16(!==, v3, v0) goto (3, 4)
 10: B3[target]:20
 12:     ParallelMove rcx <- C, rax <- rcx goto:28 B5
 14: B4[target]:24
 16:     ParallelMove rcx <- C, rax <- C goto:30 B5
 18: B5[join]:26 pred(B3, B4) {
      v5 <- phi(v7, v4) alive T{bool}
      v6 <- phi(v3 T{_Smi}, v0) alive T{_Smi?}
}
 20:     Branch if StrictCompare:36(===, v5 T{bool}, v7) goto (6, 7)
 22: B6[target]:40
 24:     MoveArgument(v6, SP+0)
 26:     StaticCall:42( print<0> v6)
 28:     ParallelMove  goto:50 B8
 30: B7[target]:44
 32:     MoveArgument(v8 T{_OneByteString}, SP+0)
 34:     StaticCall:16( printToConsole<0> v8 T{_OneByteString})
 36:     ParallelMove  goto:52 B8
 38: B8[join]:48 pred(B6, B7)
 39:     ParallelMove rax <- C
 40:     Return:56(v0)
*** END CFG

Optimizer can match an empty basic block (B5) ending with a branch which compares Phi of constants (v5) with a constant (v7), and then re-route incoming edges directly to the outgoing (B3 -> B6, B4 -> B7), eliminating the empty block. Uses of phis in this block (such as v6 <- phi(v3 T{_Smi}, v0)) should be replaced with their corresponding arguments along each outgoing edge.

However, before implementing this optimization, I'd like to figure out why CFE created such lowering at the first place.

Kernel:

        hoisted core::int x;
        final synthesized core::int? #0#0 = [@vm.direct-call.metadata=library file:///usr/local/google/home/alexmarkov/dart/sdk/foo.dart::C._x] [@vm.inferred-type.metadata=dart.core::_Smi?] this.{foo::C::_x}{core::int?};
        if(!(#0#0 == null) ?{core::bool} let final dynamic #t1 = x = #0#0{core::int} in true : false)
          core::print(x);
        else
          core::print("null");

CFE could use && let ... in true instead of ternary ? let ... true : false to make this code easier to optimize:

        if(!(#0#0 == null) && let final dynamic #t1 = x = #0#0{core::int} in true)
          core::print(x);
        else
          core::print("null");

With this lowering, right-hand side of && is placed into a separate basic block which ends with constant comparison which is replaced with an unconditional branch and basic blocks of x = #0#0{core::int} and core::print(x); are merged.

@johnniwinther @chloestefantsova Is there any reason to use ternary ?: operator in this lowering? Is it easy/feasible to change the lowering to use && for the additional code which should be executed only when pattern matches?

@rakudrama How would this change in pattern lowering affect dart2js?

lrhn added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. type-performance Issue relates to performance or code size labels Nov 21, 2023

alexmarkov self-assigned this Nov 21, 2023

alexmarkov added P2 A bug or feature request we're likely to work on triaged Issue has been triaged by sub team labels Nov 21, 2023

rakudrama mentioned this issue Nov 22, 2023

Dart2JS pattern code less efficient than manual check. #54115

Open

copybara-service bot closed this as completed in ad625df Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VM pattern code less efficient than manual check #54114

VM pattern code less efficient than manual check #54114

lrhn commented Nov 21, 2023 •

edited

Loading

mkustermann commented Nov 21, 2023

rakudrama commented Nov 21, 2023

alexmarkov commented Nov 27, 2023

VM pattern code less efficient than manual check #54114

VM pattern code less efficient than manual check #54114

Comments

lrhn commented Nov 21, 2023 • edited Loading

mkustermann commented Nov 21, 2023

rakudrama commented Nov 21, 2023

alexmarkov commented Nov 27, 2023

lrhn commented Nov 21, 2023 •

edited

Loading