Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.Numerics.Vector: Recognize division by a constant and inline #43761

Closed
nietras opened this issue Oct 23, 2020 · 5 comments · Fixed by #78874
Closed

System.Numerics.Vector: Recognize division by a constant and inline #43761

nietras opened this issue Oct 23, 2020 · 5 comments · Fixed by #78874
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@nietras
Copy link
Contributor

nietras commented Oct 23, 2020

Sharplab

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABABgAJiBGAOgDkBXfGKASzFwG4BYAKD+IBmcgwB2ubADMYlAEzkAwnwDefcuvKRxGcq1E6AYtjAZo5ALzlZpFAA5O5APSOr5AHrkqVNRq24dejoAygAWrJI6ll48vBrksNgAJhCiADYAnuQAajAm0AA8gQB82bmmUEZ5UBbkojAA7qVVhfpFABSV5QCUMXE+6kJN5S0YJQAirABurIkwbTnNxeSTXRYlk05D0J3QvRp8AL5AA

Given

using System.Numerics;

public unsafe class C
{
    const int Factor = 2048; // 2 ^ 11
    const int Shift = 11;
    readonly Vector<int> VectorFactor = new Vector<int>(Factor);
    
    public Vector<int> Divide(Vector<int> v) => v / VectorFactor;   
}

Output

The assembly output for x64 is:

; Core CLR v4.700.20.41105 on amd64

C..ctor()
    L0000: vzeroupper
    L0003: mov eax, 0x800
    L0008: vmovd xmm0, eax
    L000c: vpbroadcastd ymm0, xmm0
    L0011: vmovupd [rcx+8], ymm0
    L0016: vzeroupper
    L0019: ret

C.Divide(System.Numerics.Vector`1<Int32>)
    L0000: push rsi
    L0001: sub rsp, 0x50
    L0005: vzeroupper
    L0008: mov rsi, rdx
    L000b: mov rdx, rsi
    L000e: vmovupd ymm0, [rcx+8]
    L0013: vmovupd [rsp+0x20], ymm0
    L0019: mov rcx, rdx
    L001c: mov rdx, r8
    L001f: lea r8, [rsp+0x20]
    L0024: call System.Numerics.Vector`1[[System.Int32, System.Private.CoreLib]].op_Division(System.Numerics.Vector`1<Int32>, System.Numerics.Vector`1<Int32>)
    L0029: mov rax, rsi
    L002c: vzeroupper
    L002f: add rsp, 0x50
    L0033: pop rsi
    L0034: ret

Expected

The division is inlined and division by constant is recognized and "optimal" SIMD is emitted e.g. with shifts or similar.

cc: @tannergooding

category:cq
theme:inlining
skill-level:intermediate
cost:medium

@nietras nietras added the tenet-performance Performance related issue label Oct 23, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Oct 23, 2020
@nietras
Copy link
Contributor Author

nietras commented Oct 23, 2020

The issue created by @EgorBo is probably better and more detailed see #43759

@EgorBo
Copy link
Member

EgorBo commented Oct 23, 2020

I closed it in the favor of yours since you noticed the problem 🙂

@EgorBo
Copy link
Member

EgorBo commented Oct 23, 2020

@AndyAyersMS side issue: any idea what can be done with Vector.ScalarDivide call here (it's not inlined even with AggressiveInlining):?

Vector<int> Test(Vector<int> v1, Vector<int> v2) => v1 / v2;

Codegen:

G_M26366_IG01:
       push     rdi
       push     rsi
       sub      rsp, 136
       vzeroupper 
       mov      rsi, rdx
G_M26366_IG02:
       vmovupd  ymm0, ymmword ptr[r8]
       vmovupd  ymmword ptr[rsp+40H], ymm0
       vmovupd  ymm0, ymmword ptr[r9]
       vmovupd  ymmword ptr[rsp+20H], ymm0
       vxorps   ymm0, ymm0
       vmovupd  ymmword ptr[rsp+60H], ymm0
       xor      rdi, rdi
G_M26366_IG03:
       lea      rcx, bword ptr [rsp+40H]
       mov      ecx, dword ptr [rcx+4*rdi]
       lea      rdx, bword ptr [rsp+20H]
       mov      edx, dword ptr [rdx+4*rdi]
       call     System.Numerics.Vector`1[Int32][System.Int32]:ScalarDivide(int,int):int  ;; <====== not inlined
       lea      rdx, bword ptr [rsp+60H]
       mov      dword ptr [rdx+4*rdi], eax
       inc      rdi
       cmp      rdi, 8
       jl       SHORT G_M26366_IG03
G_M26366_IG04:
       vmovupd  ymm0, ymmword ptr[rsp+60H]
       vmovupd  ymmword ptr[rsi], ymm0
       mov      rax, rsi
G_M26366_IG05:
       vzeroupper 
       add      rsp, 136
       pop      rsi
       pop      rdi
       ret

That ScalarDivide has [MethodImpl(MethodImplOptions.AggressiveInlining)] and obviously it's inlined into a single instruction actually (despite the fact that it's a lot of IL due to box+unbox).: https://github.com/dotnet/runtime/blob/master/src/libraries/System.Private.CoreLib/src/System/Numerics/Vector_1.cs#L1229-L1231

BudgetCheck: for IL Size 627, timeDelta 1240 +  currentEstimate 162 > currentBudget 840
Inlining [000107] failed

So it exceeds the time budget but I'd love to inline it anyway to get a single division instruction.

@AndyAyersMS
Copy link
Member

This is the same issue as we see over in #41692, and I think the suggestion I made there should work.

Inlines into 06000001 C:Divide(Vector`1):Vector`1:this
  [1 IL=0007 TR=000003 06001DB3] [profitable inline] Vector`1:op_Division(Vector`1,Vector`1):Vector`1
    [2 IL=0019 TR=000031 06001DE1] [aggressive inline attribute] Vector`1:GetElement(long):int:this
      [3 IL=0001 TR=000063 0600641A] [aggressive inline attribute] Unsafe:AsRef(byref):byref
      [4 IL=0006 TR=000065 06006407] [aggressive inline attribute] Unsafe:As(byref):byref
      [5 IL=0012 TR=000067 06006409] [aggressive inline attribute] Unsafe:Add(byref,long):byref
    [6 IL=0027 TR=000036 06001DE1] [aggressive inline attribute] Vector`1:GetElement(long):int:this
      [7 IL=0001 TR=000086 0600641A] [aggressive inline attribute] Unsafe:AsRef(byref):byref
      [8 IL=0006 TR=000088 06006407] [aggressive inline attribute] Unsafe:As(byref):byref
      [9 IL=0012 TR=000090 06006409] [aggressive inline attribute] Unsafe:Add(byref,long):byref
    [0 IL=0032 TR=000041 06001DDA] [FAILED: inline exceeds budget] Vector`1:ScalarDivide(int,int):int
    [10 IL=0037 TR=000043 06001DE2] [aggressive inline attribute] Vector`1:SetElement(long,int):this
      [11 IL=0001 TR=000110 0600641A] [aggressive inline attribute] Unsafe:AsRef(byref):byref
      [12 IL=0006 TR=000112 06006407] [aggressive inline attribute] Unsafe:As(byref):byref
      [13 IL=0012 TR=000114 06006409] [aggressive inline attribute] Unsafe:Add(byref,long):byref
    [0 IL=0032 TR=000041 06001DDA] [FAILED: inline exceeds budget] Vector`1:ScalarDivide(int,int):int

Basically: for an aggressive inline, we budget check incrementally as we import rather than doing the check up front, so if the method turns out to in fact be small because of importer trimming, the inline will succeed.

If the method ends up not being small, we'll reject the inline, and will have wasted some time/memory in the partial inlining attempt, but hopefully that's a fairly rare occurrence.

@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Oct 23, 2020
@AndyAyersMS AndyAyersMS added this to the 6.0.0 milestone Oct 23, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 7, 2020
@JulieLeeMSFT JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021
@EgorBo
Copy link
Member

EgorBo commented Mar 30, 2021

Marking as Future,
Re-using logic to optimize "x div c" (scalar) for vectors sounds like a lot of work.

@EgorBo EgorBo modified the milestones: 6.0.0, Future Mar 30, 2021
@EgorBo EgorBo removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 30, 2021
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Apr 21, 2021
… opportunities

Look for IL patterns in an inlinee that are indicative of type folding. Also track
when an inlinee has calls that are intrinsics.

Use this to boost the inlinee profitability estimate.

Also, allow an aggressive inline inline candidate method to go over the budget when
it contains type folding or intrinsics (and so the method may be simpler than it appears).
Remove the old criteria (top-level inline) which was harder to reason about.

Closes dotnet#43761.
Closes dotnet#41692.
Closes dotnet#51587.
Closes dotnet#47434.
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 21, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 10, 2021
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Nov 26, 2022
@EgorBo EgorBo modified the milestones: Future, 8.0.0 Nov 26, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Dec 1, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Dec 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
6 participants