Double constants usage in a loop can be CSEed #35257

kunalspathak · 2020-04-21T20:36:12Z

Doubles present in a loop are reloaded repeatedly. Instead they can be set just once out of loop and use it inside the loop.

private static double Process(double n)
{
  double res;
  res = 1;
  while (n > 0.0)
  {
    res *= n;
// User might not write such code, but here it is written merely to show that same constants are re-loaded in assembly code.
    n -= 1.0;
    n -= 2.0;
    n -= 1.0;
    n -= 2.0;
  }
  return res;
}

G_M15653_IG03:
        1E600A10          fmul    d16, d16, d0
        1E6E1011          fmov    d17, #1.0000
        1E713800          fsub    d0, d0, d17
        1E601011          fmov    d17, #2.0000
        1E713800          fsub    d0, d0, d17
        1E6E1011          fmov    d17, #1.0000
        1E713800          fsub    d0, d0, d17
        1E601011          fmov    d17, #2.0000
        1E713800          fsub    d0, d0, d17
        4F00E411          movi    v17.16b, #0x00
        1E712000          fcmp    d0, d17
        54FFFEAC          bgt     G_M15653_IG03

category:cq
theme:cse
skill-level:expert
cost:medium
impact:medium

kunalspathak · 2020-04-21T20:36:32Z

@CarolEidt @BruceForstall

CarolEidt · 2020-04-21T20:39:59Z

As I discussed offline with @kunalspathak , doing this without unduly pessimizing cases with high register pressure may also require rematerialization (#6264)

EgorBo · 2020-04-21T20:40:45Z

I guess this one is not arm specific as I see exactly the same picture on x86
(constants are not saved to registers before the loop)

kunalspathak · 2020-04-21T20:44:29Z

I guess this one is not arm specific as I see exactly the same picture on x86
(SIMD types aren't currently hoisted out of loops)

Agree. I will update the title/label.

tannergooding · 2020-04-21T20:47:01Z

For reference on the x64 side: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHuGAHKAJYA3bBhhNySBgBMIAV2AAbcQAUoEMDFy4AFLIXKGAOwCUNTtW76l42LgDcXBnYYBeBuUeWGAdwAWAoY6RgwAfAwADAB0EWbeFjzOWgwAVO5GXgD0mQwAqrgwUAz4AgDmfhjGEJU+gmIMuHJgfgyQ0jBoDMBylX6F4gKVAri+dWIh+P2KAJ4MGBANfhA+c36iDdiTrRBGuBjYRhgj2LBJcIoQ2O3SDAIheAX4SrNtMFFO3CFw7uQxXolfdykP4fYwMb4eEHeT7goFQ7gAXycxAA7EkHDQkdQEUA===

EgorBo · 2020-04-21T20:53:53Z

For reference on the x64 side: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHuGAHKAJYA3bBhhNySBgBMIAV2AAbcQAUoEMDFy4AFLIXKGAOwCUNTtW76l42LgDcXBnYYBeBuUeWGAdwAWAoY6RgwAfAwADAB0EWbeFjzOWgwAVO5GXgD0mQwAqrgwUAz4AgDmfhjGEJU+gmIMuHJgfgyQ0jBoDMBylX6F4gKVAri+dWIh+P2KAJ4MGBANfhA+c36iDdiTrRBGuBjYRhgj2LBJcIoQ2O3SDAIheAX4SrNtMFFO3CFw7uQxXolfdykP4fYwMb4eEHeT7goFQ7gAXycxAA7EkHDQkdQEUA===

Ah, by "hoisted out of loop" I meant to save 1 and 2 to some xmm registers before the loop.
Not sure it makes sense here tho.

gcc saves them: https://godbolt.org/z/gToudN

BruceForstall · 2020-04-22T00:54:01Z

fyi, for x64 it looks like we're creating multiple constant pool entries for identical values, so I added #35268 to address that.

gfoidl · 2020-11-13T15:23:02Z

(At least on x64) this can be worked around by BitConverter-tricks*
Does this work by breaking constant propagation and CSE kicks in properly?

A more realistic repro is (pre-)computing** some heavy values to a table / computing results in batches like

double x = 0d;

for (nuint i = 0; i < N; ++i, x += 0.01)
{
    table[i] = Math.Sin(x);    // table is of type double*
}

; ...
M00_L00:
       vmovaps   xmm0,xmm6
       call      System.Math.Sin(Double)
       vmovsd    qword ptr [rsi+rdi*8],xmm0
       inc       rdi
       vaddsd    xmm6,xmm6,qword ptr [7FFEBB6C48F8]
       cmp       rdi,3E8
       jb        short M00_L00
; ...

It makes no difference if the increment const double inc = 0.01; is defined outside the loop explicitely or not.

The workaround here is

private static readonly long s_inc = BitConverter.DoubleToInt64Bits(0.01);
private static double Inc => BitConverter.Int64BitsToDouble(s_inc);
// ...
double x = 0d;

for (nuint i = 0; i < N; ++i, x += Inc)
{
    table[i] = Math.Sin(x);
}

(one needs to set TC_QuickJitForLoops=1 to use the static readonlies as "consts").

; ...
       xor       edi,edi
       mov       rax,7AE147AE147B
       vmovq     xmm7,rax
M00_L00:
       vmovaps   xmm0,xmm6
       call      System.Math.Sin(Double)
       vmovsd    qword ptr [rsi+rdi*8],xmm0
       inc       rdi
       vmovaps   xmm0,xmm7			; not needed
       vaddsd    xmm6,xmm6,xmm0
       cmp       rdi,3E8
       jb        short M00_L00
; ...

(note 1: at the comment vaddsd xmm6,xmm6,xmm7 would be ideal)
(note 2: instead of xmm6 and xmm7 could other registers be used too? So that they must not be saved according call-convention?)

* BitConverter uses SSE2 as workaround
** for pre-computing as it's outside of a critical section it won't matter, but for computing

EgorBo · 2020-11-13T16:02:28Z

@gfoidl I think(hope) it will be fixed via #44419

tannergooding · 2024-02-05T16:52:02Z

We do hoist these now for x86/x64. We don't for Arm64 because these are viewed as "cheap" to materialize given they are small constants that can be embedded as an immediate.

They are given a gtCost of 1, which puts them below the MIN_CSE_COST threshold (currently 2).

We could play with increasing the gtCost such that all constants (at least floating-point ones) can be CSE'd, but that will likely also require some special handling in constant prop to ensure that special codegen opportunities are still accounted for (most notably for cases that are known to allow better instruction sequences to be emitted, such as if the constant can be contained as 0).

We could also play with allowing a something like MIN_CSE_COST_IN_LOOP, so that in a loop anything can be CSE'd, or special casing CSE of constants in a loop.

I think either would give a good balance between ensuring loop code stays efficient and ensuring that we don't accidentally pessimize codegen for cases where hoisting a constant prevents us from optimizing.

tannergooding · 2024-02-05T16:58:07Z

Personally, I think the idea of allowing anything to be CSE'd is the better option. We should have already done forward sub, morph, value numbering, and CSE by the time we get to proper constant prop. So, we should have already a decent view of things.

One might want to have value numbering or CSE account for cases where we would try to undo a CSE, to ensure costing remains correctly tracked, but that is likely a more complex change than a limited amount of "undo" for special constants like zero. So I think we could do a targeted fix here to improve things and work towards improving it more over time.

kunalspathak added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Apr 21, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 21, 2020

kunalspathak changed the title ~~ARM64 : Double constants usage in a loop can be CSEed~~ Double constants usage in a loop can be CSEed Apr 21, 2020

kunalspathak added optimization and removed arch-arm64 labels Apr 21, 2020

BruceForstall added this to the Future milestone Apr 21, 2020

BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 21, 2020

BruceForstall mentioned this issue Apr 22, 2020

Constant pool should share values #35268

Closed

kunalspathak mentioned this issue May 5, 2020

Improving ARM64 Performance in .NET 5.0 – Closing the gap with x64 #35853

Closed

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

kunalspathak mentioned this issue Feb 4, 2022

Arm64: Readjust the heuristics to materialize double constants using fmov #64794

Closed

kunalspathak assigned TIHan Feb 4, 2022

kunalspathak mentioned this issue Feb 4, 2022

Improving ARM64 Performance in .NET 7.0 #64820

Closed

32 tasks

BruceForstall modified the milestones: Future, 7.0.0 Feb 8, 2022

JulieLeeMSFT removed the JitUntriaged CLR JIT issues needing additional triage label Feb 14, 2022

TIHan mentioned this issue Feb 14, 2022

ARM64 - Allow double constants to CSE more often #65176

Closed

1 task

TIHan modified the milestones: 7.0.0, 8.0.0 Jul 11, 2022

kunalspathak mentioned this issue Oct 13, 2022

Improving Arm64 Performance in .NET 8.0 #77010

Closed

28 tasks

TIHan modified the milestones: 8.0.0, Future Jun 9, 2023

BruceForstall added tenet-performance Performance related issue and removed optimization labels Oct 15, 2024

BruceForstall unassigned TIHan Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double constants usage in a loop can be CSEed #35257

Double constants usage in a loop can be CSEed #35257

kunalspathak commented Apr 21, 2020 •

edited by BruceForstall

Loading

kunalspathak commented Apr 21, 2020

CarolEidt commented Apr 21, 2020

EgorBo commented Apr 21, 2020 •

edited

Loading

kunalspathak commented Apr 21, 2020

tannergooding commented Apr 21, 2020

EgorBo commented Apr 21, 2020 •

edited

Loading

BruceForstall commented Apr 22, 2020

gfoidl commented Nov 13, 2020

EgorBo commented Nov 13, 2020

tannergooding commented Feb 5, 2024

tannergooding commented Feb 5, 2024

Double constants usage in a loop can be CSEed #35257

Double constants usage in a loop can be CSEed #35257

Comments

kunalspathak commented Apr 21, 2020 • edited by BruceForstall Loading

kunalspathak commented Apr 21, 2020

CarolEidt commented Apr 21, 2020

EgorBo commented Apr 21, 2020 • edited Loading

kunalspathak commented Apr 21, 2020

tannergooding commented Apr 21, 2020

EgorBo commented Apr 21, 2020 • edited Loading

BruceForstall commented Apr 22, 2020

gfoidl commented Nov 13, 2020

EgorBo commented Nov 13, 2020

tannergooding commented Feb 5, 2024

tannergooding commented Feb 5, 2024

kunalspathak commented Apr 21, 2020 •

edited by BruceForstall

Loading

EgorBo commented Apr 21, 2020 •

edited

Loading

EgorBo commented Apr 21, 2020 •

edited

Loading