System.Numerics.Vector: Recognize division by a constant and inline #43761

nietras · 2020-10-23T10:14:39Z

Sharplab

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABABgAJiBGAOgDkBXfGKASzFwG4BYAKD+IBmcgwB2ubADMYlAEzkAwnwDefcuvKRxGcq1E6AYtjAZo5ALzlZpFAA5O5APSOr5AHrkqVNRq24dejoAygAWrJI6ll48vBrksNgAJhCiADYAnuQAajAm0AA8gQB82bmmUEZ5UBbkojAA7qVVhfpFABSV5QCUMXE+6kJN5S0YJQAirABurIkwbTnNxeSTXRYlk05D0J3QvRp8AL5AA

Given

using System.Numerics;

public unsafe class C
{
    const int Factor = 2048; // 2 ^ 11
    const int Shift = 11;
    readonly Vector<int> VectorFactor = new Vector<int>(Factor);
    
    public Vector<int> Divide(Vector<int> v) => v / VectorFactor;   
}

Output

The assembly output for x64 is:

; Core CLR v4.700.20.41105 on amd64

C..ctor()
    L0000: vzeroupper
    L0003: mov eax, 0x800
    L0008: vmovd xmm0, eax
    L000c: vpbroadcastd ymm0, xmm0
    L0011: vmovupd [rcx+8], ymm0
    L0016: vzeroupper
    L0019: ret

C.Divide(System.Numerics.Vector`1<Int32>)
    L0000: push rsi
    L0001: sub rsp, 0x50
    L0005: vzeroupper
    L0008: mov rsi, rdx
    L000b: mov rdx, rsi
    L000e: vmovupd ymm0, [rcx+8]
    L0013: vmovupd [rsp+0x20], ymm0
    L0019: mov rcx, rdx
    L001c: mov rdx, r8
    L001f: lea r8, [rsp+0x20]
    L0024: call System.Numerics.Vector`1[[System.Int32, System.Private.CoreLib]].op_Division(System.Numerics.Vector`1<Int32>, System.Numerics.Vector`1<Int32>)
    L0029: mov rax, rsi
    L002c: vzeroupper
    L002f: add rsp, 0x50
    L0033: pop rsi
    L0034: ret

Expected

The division is inlined and division by constant is recognized and "optimal" SIMD is emitted e.g. with shifts or similar.

cc: @tannergooding

category:cq
theme:inlining
skill-level:intermediate
cost:medium

nietras · 2020-10-23T12:59:27Z

The issue created by @EgorBo is probably better and more detailed see #43759

EgorBo · 2020-10-23T13:00:46Z

I closed it in the favor of yours since you noticed the problem 🙂

EgorBo · 2020-10-23T19:28:50Z

@AndyAyersMS side issue: any idea what can be done with Vector.ScalarDivide call here (it's not inlined even with AggressiveInlining):?

Vector<int> Test(Vector<int> v1, Vector<int> v2) => v1 / v2;

Codegen:

G_M26366_IG01:
       push     rdi
       push     rsi
       sub      rsp, 136
       vzeroupper 
       mov      rsi, rdx
G_M26366_IG02:
       vmovupd  ymm0, ymmword ptr[r8]
       vmovupd  ymmword ptr[rsp+40H], ymm0
       vmovupd  ymm0, ymmword ptr[r9]
       vmovupd  ymmword ptr[rsp+20H], ymm0
       vxorps   ymm0, ymm0
       vmovupd  ymmword ptr[rsp+60H], ymm0
       xor      rdi, rdi
G_M26366_IG03:
       lea      rcx, bword ptr [rsp+40H]
       mov      ecx, dword ptr [rcx+4*rdi]
       lea      rdx, bword ptr [rsp+20H]
       mov      edx, dword ptr [rdx+4*rdi]
       call     System.Numerics.Vector`1[Int32][System.Int32]:ScalarDivide(int,int):int  ;; <====== not inlined
       lea      rdx, bword ptr [rsp+60H]
       mov      dword ptr [rdx+4*rdi], eax
       inc      rdi
       cmp      rdi, 8
       jl       SHORT G_M26366_IG03
G_M26366_IG04:
       vmovupd  ymm0, ymmword ptr[rsp+60H]
       vmovupd  ymmword ptr[rsi], ymm0
       mov      rax, rsi
G_M26366_IG05:
       vzeroupper 
       add      rsp, 136
       pop      rsi
       pop      rdi
       ret

That ScalarDivide has [MethodImpl(MethodImplOptions.AggressiveInlining)] and obviously it's inlined into a single instruction actually (despite the fact that it's a lot of IL due to box+unbox).: https://github.com/dotnet/runtime/blob/master/src/libraries/System.Private.CoreLib/src/System/Numerics/Vector_1.cs#L1229-L1231

BudgetCheck: for IL Size 627, timeDelta 1240 +  currentEstimate 162 > currentBudget 840
Inlining [000107] failed

So it exceeds the time budget but I'd love to inline it anyway to get a single division instruction.

AndyAyersMS · 2020-10-23T20:38:45Z

This is the same issue as we see over in #41692, and I think the suggestion I made there should work.

Inlines into 06000001 C:Divide(Vector`1):Vector`1:this
  [1 IL=0007 TR=000003 06001DB3] [profitable inline] Vector`1:op_Division(Vector`1,Vector`1):Vector`1
    [2 IL=0019 TR=000031 06001DE1] [aggressive inline attribute] Vector`1:GetElement(long):int:this
      [3 IL=0001 TR=000063 0600641A] [aggressive inline attribute] Unsafe:AsRef(byref):byref
      [4 IL=0006 TR=000065 06006407] [aggressive inline attribute] Unsafe:As(byref):byref
      [5 IL=0012 TR=000067 06006409] [aggressive inline attribute] Unsafe:Add(byref,long):byref
    [6 IL=0027 TR=000036 06001DE1] [aggressive inline attribute] Vector`1:GetElement(long):int:this
      [7 IL=0001 TR=000086 0600641A] [aggressive inline attribute] Unsafe:AsRef(byref):byref
      [8 IL=0006 TR=000088 06006407] [aggressive inline attribute] Unsafe:As(byref):byref
      [9 IL=0012 TR=000090 06006409] [aggressive inline attribute] Unsafe:Add(byref,long):byref
    [0 IL=0032 TR=000041 06001DDA] [FAILED: inline exceeds budget] Vector`1:ScalarDivide(int,int):int
    [10 IL=0037 TR=000043 06001DE2] [aggressive inline attribute] Vector`1:SetElement(long,int):this
      [11 IL=0001 TR=000110 0600641A] [aggressive inline attribute] Unsafe:AsRef(byref):byref
      [12 IL=0006 TR=000112 06006407] [aggressive inline attribute] Unsafe:As(byref):byref
      [13 IL=0012 TR=000114 06006409] [aggressive inline attribute] Unsafe:Add(byref,long):byref
    [0 IL=0032 TR=000041 06001DDA] [FAILED: inline exceeds budget] Vector`1:ScalarDivide(int,int):int

Basically: for an aggressive inline, we budget check incrementally as we import rather than doing the check up front, so if the method turns out to in fact be small because of importer trimming, the inline will succeed.

If the method ends up not being small, we'll reject the inline, and will have wasted some time/memory in the partial inlining attempt, but hopefully that's a fairly rare occurrence.

EgorBo · 2021-03-30T13:49:37Z

Marking as Future,
Re-using logic to optimize "x div c" (scalar) for vectors sounds like a lot of work.

… opportunities Look for IL patterns in an inlinee that are indicative of type folding. Also track when an inlinee has calls that are intrinsics. Use this to boost the inlinee profitability estimate. Also, allow an aggressive inline inline candidate method to go over the budget when it contains type folding or intrinsics (and so the method may be simpler than it appears). Remove the old criteria (top-level inline) which was harder to reason about. Closes dotnet#43761. Closes dotnet#41692. Closes dotnet#51587. Closes dotnet#47434.

nietras added the tenet-performance Performance related issue label Oct 23, 2020

Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Oct 23, 2020

AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Oct 23, 2020

AndyAyersMS added this to the 6.0.0 milestone Oct 23, 2020

EgorBo mentioned this issue Oct 27, 2020

Change Task.FromResult to use same task cache as async methods #43894

Merged

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 7, 2020

JulieLeeMSFT assigned EgorBo Mar 23, 2021

JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021

EgorBo modified the milestones: 6.0.0, Future Mar 30, 2021

EgorBo removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 30, 2021

AndyAyersMS mentioned this issue Apr 21, 2021

JIT: revise criteria for inline candidates with intrinsics or folding… #51634

Closed

ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 21, 2021

ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 10, 2021

EgorBo mentioned this issue Jun 29, 2021

[JIT] Improve inliner: new heuristics, rely on PGO data #52708

Merged

EgorBo mentioned this issue Sep 12, 2021

Inliner: next steps #59002

Open

This was referenced Nov 21, 2022

Casting via generic math doesn't always inline as expected #78648

Closed

Allow inlining in some over-bugdet cases #78874

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Nov 26, 2022

EgorBo modified the milestones: Future, 8.0.0 Nov 26, 2022

EgorBo closed this as completed in #78874 Dec 1, 2022

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Dec 1, 2022

ghost locked as resolved and limited conversation to collaborators Dec 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System.Numerics.Vector: Recognize division by a constant and inline #43761

System.Numerics.Vector: Recognize division by a constant and inline #43761

nietras commented Oct 23, 2020 •

edited by AndyAyersMS

Loading

nietras commented Oct 23, 2020

EgorBo commented Oct 23, 2020

EgorBo commented Oct 23, 2020 •

edited

Loading

AndyAyersMS commented Oct 23, 2020

EgorBo commented Mar 30, 2021

System.Numerics.Vector: Recognize division by a constant and inline #43761

System.Numerics.Vector: Recognize division by a constant and inline #43761

Comments

nietras commented Oct 23, 2020 • edited by AndyAyersMS Loading

Sharplab

Given

Output

Expected

nietras commented Oct 23, 2020

EgorBo commented Oct 23, 2020

EgorBo commented Oct 23, 2020 • edited Loading

AndyAyersMS commented Oct 23, 2020

EgorBo commented Mar 30, 2021

nietras commented Oct 23, 2020 •

edited by AndyAyersMS

Loading

EgorBo commented Oct 23, 2020 •

edited

Loading