Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: mulshift optimization should happen in Lower #8668

Closed
JosephTremoulet opened this issue Aug 1, 2017 · 2 comments
Closed

JIT: mulshift optimization should happen in Lower #8668

JosephTremoulet opened this issue Aug 1, 2017 · 2 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@JosephTremoulet
Copy link
Contributor

Currently we have code that rewrites multiplication by constants into shift/lea sequences split between Morph and CodeGen. Ideally it would live in Lower so that CodeGen isn't performing optimizations (which complicates its contract), the target-specific optimization is handled by the back-end, and pre-Lower optimizations see the simpler multiplication tree.

See discussion here: dotnet/coreclr#12956 (review)

@mikedn
Copy link
Contributor

mikedn commented Sep 16, 2017

This is more complicated than it seems. What we have now in morph can't be easily moved to lower, it interferes poorly with address mode formation and results in significant regressions.
Consider:

struct vec3 { public float x, y, z; }

[MethodImpl(MethodImplOptions.NoInlining)]
static void Test(vec3[] src, vec3[] dst, int length)
{
    for (int i = 0; i < src.Length; i++)
        dst[i].x = src[i].x;
}

This generates:

G_M1752_IG03:
       413BC1               cmp      eax, r9d
       7321                 jae      SHORT G_M1752_IG05
       4C63D0               movsxd   r10, eax
       4F8D1452             lea      r10, [r10+2*r10]
       C4A17A10449110       vmovss   xmm0, dword ptr [rcx+4*r10+16]
       C4A17A11449210       vmovss   dword ptr [rdx+4*r10+16], xmm0
       FFC0                 inc      eax
       443BC0               cmp      r8d, eax
       7FDF                 jg       SHORT G_M1752_IG03
  • The initial code contains two MUL(x, 12)
  • Morph changes these to MUL(MUL(x, 3), 4)
  • CSE picks up MUL(x, 3) and we now have STORE(cse0, MUL(x, 3)) and two MUL(LCL(cse0), 4)
  • Address mode builder use the two MULs to build LEAs that end up as memory operands

Net result: two MUL(x, 12) were transformed into a single MUL(x, 3) instruction. Saves 5 cycles and 1 instruction.

If this transform is moved to lower:

  • CSE picks up MUL(x, 12) and we now have STORE(cse0, MUL(x, 12)) and two LCL(cse0)
  • Lower transforms MUL(x, 12) into MUL(MUL(x, 3), 4)
  • Address mode builder can't see through the local var, grabs its bags and goes on holiday

Net result: two MUL(x, 12) were transformed into two instructions - SHL(LEA(x+x*2), 2). Saves 4 cycles and no instructions.

So by doing this we make things worse and it doesn't seem easy to solve this. Use SSA to build address modes so we can see through local vars? That's a more general problem of address mode building.

Granted, we can duplicate this optimization in lower. Tried, got 3 MUL(x, 10) that morph missed in the entire framework. Doesn't seem to be worth it.

@JosephTremoulet
Copy link
Contributor Author

Ok. Thanks for digging in.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 21, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

3 participants