Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double constants usage in a loop can be CSEed #35257

Open
kunalspathak opened this issue Apr 21, 2020 · 11 comments
Open

Double constants usage in a loop can be CSEed #35257

kunalspathak opened this issue Apr 21, 2020 · 11 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@kunalspathak
Copy link
Member

kunalspathak commented Apr 21, 2020

Doubles present in a loop are reloaded repeatedly. Instead they can be set just once out of loop and use it inside the loop.

private static double Process(double n)
{
  double res;
  res = 1;
  while (n > 0.0)
  {
    res *= n;
// User might not write such code, but here it is written merely to show that same constants are re-loaded in assembly code.
    n -= 1.0;
    n -= 2.0;
    n -= 1.0;
    n -= 2.0;
  }
  return res;
}
G_M15653_IG03:
        1E600A10          fmul    d16, d16, d0
        1E6E1011          fmov    d17, #1.0000
        1E713800          fsub    d0, d0, d17
        1E601011          fmov    d17, #2.0000
        1E713800          fsub    d0, d0, d17
        1E6E1011          fmov    d17, #1.0000
        1E713800          fsub    d0, d0, d17
        1E601011          fmov    d17, #2.0000
        1E713800          fsub    d0, d0, d17
        4F00E411          movi    v17.16b, #0x00
        1E712000          fcmp    d0, d17
        54FFFEAC          bgt     G_M15653_IG03

category:cq
theme:cse
skill-level:expert
cost:medium
impact:medium

@kunalspathak kunalspathak added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Apr 21, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 21, 2020
@kunalspathak
Copy link
Member Author

@CarolEidt @BruceForstall

@CarolEidt
Copy link
Contributor

As I discussed offline with @kunalspathak , doing this without unduly pessimizing cases with high register pressure may also require rematerialization (#6264)

@EgorBo
Copy link
Member

EgorBo commented Apr 21, 2020

I guess this one is not arm specific as I see exactly the same picture on x86
(constants are not saved to registers before the loop)

@kunalspathak
Copy link
Member Author

I guess this one is not arm specific as I see exactly the same picture on x86
(SIMD types aren't currently hoisted out of loops)

Agree. I will update the title/label.

@kunalspathak kunalspathak changed the title ARM64 : Double constants usage in a loop can be CSEed Double constants usage in a loop can be CSEed Apr 21, 2020
@BruceForstall BruceForstall added this to the Future milestone Apr 21, 2020
@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 21, 2020
@BruceForstall
Copy link
Member

fyi, for x64 it looks like we're creating multiple constant pool entries for identical values, so I added #35268 to address that.

@gfoidl
Copy link
Member

gfoidl commented Nov 13, 2020

(At least on x64) this can be worked around by BitConverter-tricks*
Does this work by breaking constant propagation and CSE kicks in properly?

A more realistic repro is (pre-)computing** some heavy values to a table / computing results in batches like

double x = 0d;

for (nuint i = 0; i < N; ++i, x += 0.01)
{
    table[i] = Math.Sin(x);    // table is of type double*
}
; ...
M00_L00:
       vmovaps   xmm0,xmm6
       call      System.Math.Sin(Double)
       vmovsd    qword ptr [rsi+rdi*8],xmm0
       inc       rdi
       vaddsd    xmm6,xmm6,qword ptr [7FFEBB6C48F8]
       cmp       rdi,3E8
       jb        short M00_L00
; ...

It makes no difference if the increment const double inc = 0.01; is defined outside the loop explicitely or not.

The workaround here is

private static readonly long s_inc = BitConverter.DoubleToInt64Bits(0.01);
private static double Inc => BitConverter.Int64BitsToDouble(s_inc);
// ...
double x = 0d;

for (nuint i = 0; i < N; ++i, x += Inc)
{
    table[i] = Math.Sin(x);
}

(one needs to set TC_QuickJitForLoops=1 to use the static readonlies as "consts").

; ...
       xor       edi,edi
       mov       rax,7AE147AE147B
       vmovq     xmm7,rax
M00_L00:
       vmovaps   xmm0,xmm6
       call      System.Math.Sin(Double)
       vmovsd    qword ptr [rsi+rdi*8],xmm0
       inc       rdi
       vmovaps   xmm0,xmm7			; not needed
       vaddsd    xmm6,xmm6,xmm0
       cmp       rdi,3E8
       jb        short M00_L00
; ...

(note 1: at the comment vaddsd xmm6,xmm6,xmm7 would be ideal)
(note 2: instead of xmm6 and xmm7 could other registers be used too? So that they must not be saved according call-convention?)


* BitConverter uses SSE2 as workaround
** for pre-computing as it's outside of a critical section it won't matter, but for computing

@EgorBo
Copy link
Member

EgorBo commented Nov 13, 2020

@gfoidl I think(hope) it will be fixed via #44419

@TIHan TIHan modified the milestones: 8.0.0, Future Jun 9, 2023
@tannergooding
Copy link
Member

We do hoist these now for x86/x64. We don't for Arm64 because these are viewed as "cheap" to materialize given they are small constants that can be embedded as an immediate.

They are given a gtCost of 1, which puts them below the MIN_CSE_COST threshold (currently 2).

We could play with increasing the gtCost such that all constants (at least floating-point ones) can be CSE'd, but that will likely also require some special handling in constant prop to ensure that special codegen opportunities are still accounted for (most notably for cases that are known to allow better instruction sequences to be emitted, such as if the constant can be contained as 0).

We could also play with allowing a something like MIN_CSE_COST_IN_LOOP, so that in a loop anything can be CSE'd, or special casing CSE of constants in a loop.

I think either would give a good balance between ensuring loop code stays efficient and ensuring that we don't accidentally pessimize codegen for cases where hoisting a constant prevents us from optimizing.

@tannergooding
Copy link
Member

Personally, I think the idea of allowing anything to be CSE'd is the better option. We should have already done forward sub, morph, value numbering, and CSE by the time we get to proper constant prop. So, we should have already a decent view of things.

One might want to have value numbering or CSE account for cases where we would try to undo a CSE, to ensure costing remains correctly tracked, but that is likely a more complex change than a limited amount of "undo" for special constants like zero. So I think we could do a targeted fix here to improve things and work towards improving it more over time.

@BruceForstall BruceForstall added tenet-performance Performance related issue and removed optimization labels Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

9 participants