Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codegen: Checking for ISA support affects inline codegen #12791

Closed
john-h-k opened this issue Jun 2, 2019 · 12 comments
Closed

Codegen: Checking for ISA support affects inline codegen #12791

john-h-k opened this issue Jun 2, 2019 · 12 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@john-h-k
Copy link
Contributor

john-h-k commented Jun 2, 2019

SharpLab

Writing and comparing some intrinsic code for vectors compared to Sys.Numerics.Vector4 equivalents, I noticed a significant slowdown (~20%) in a tight loop of vectorised addition. I verified the native for the actual add methods is identical, and it is, with the if (Sse.IsSupported) check. However, when it is inlined as part of the loop method (as desired), the codegen is not identical, and

vaddps xmm0, xmm0, xmm0

is transformed to

vmovaps xmm1, xmm0
vaddps xmm0, xmm1, xmm0

, which I think is a completely pointless op, as it doesn't have an effect on the end result (and it slows it down, so clearly not for optimization or anything)

Tested on .NET Core preview 5 as well as on whatever preview version SharpLab is running

Image for easy visualisation of issue

category:cq
theme:vector-codegen
skill-level:expert
cost:large

@CarolEidt
Copy link
Contributor

@johnkellyoxford - I looked briefly at the PR from which you referenced this issue, but it's pretty large. If you were able to come up with a small example where you see this, it would be easier to analyze.

I doubt that this is directly related to the check for Sse.IsSupported, as that is eliminated by the JIT as a compile-time constant. FWIW, Sse2 is the baseline requirement for coreclr, so you might want to see what happens if you just eliminate the check (or, if you prefer, you could add an assertion in your debug build that asserts that Sse.IsSupported is true).

My suspicion is that it is a register allocator issue, which may make it tricky to address.

@CarolEidt CarolEidt self-assigned this Jun 3, 2019
@EgorBo
Copy link
Member

EgorBo commented Jun 3, 2019

@john-h-k
Copy link
Contributor Author

john-h-k commented Jun 3, 2019

Yeah, @EgorBo 's code is exactly representative of it. In a benchmark, it resulted in a slowdown from 1.4s -> 1.7s, which is not insignificant.

FWIW, Sse2 is the baseline requirement for coreclr

That's useful to know thanks 😄 - however, they function as an implicit IsWindows test too, as opposed to eventual NEON ARM pathways, currently not written (which is why software fallbacks still needed)

@CarolEidt
Copy link
Contributor

CarolEidt commented Jun 6, 2019

Quick update: looking at this further, you were indeed correct that it is fundamentally the check that causes the code to be worse. This is because, although the branch is eliminated, the blocks aren't merged yet, so the fact that the two parameters are both equal to vector is not propagated in the local optimization pass.
I still need to analyze why these copies are eliminated later, but I suspect that it is some remaining conservatism in optimizing value types.

@CarolEidt
Copy link
Contributor

The reason we don't catch this in optVnCopyProp() is that by the time we reach the uses of left and right in the add intrinsic, vector is no longer live. It is unfortunate that we are using "liveness" here rather than "availability". That is, although vector is no longer live (its last use is when it is assigned to right), it is still available (i.e. the value we're interested in propagating has not been overwritten).
It seems to me (probably missing something here; I haven't spent much time in this code) that we could avoid pruning last-use variables from the temporary liveness set used in copyprop, and check value numbers to make sure that the current value number for the source is the same as the value number for the local we want to propagate to.
However, there appears to be a further problem in that value numbering doesn't apply value numbers to defs, apparently because we don't ever consider them as CSE candidates. However, this means that when it calls GetUseAsgDefVNOrTreeVN on a def, there is no VN and it can't do the copy prop even if the variable is still live.

I've done a little experimentation with extending the local assertion prop to preserve assertions across a fall-through edge with no other incoming edge (i.e. an edge between blocks that could/will be merged), and it handles this case. But that is unlikely to be something we'd consider until post-3.0.

cc @dotnet/jit-contrib

@mikedn
Copy link
Contributor

mikedn commented Jun 7, 2019

The reason we don't catch this in optVnCopyProp() is that by the time we reach the uses of left and right in the add intrinsic, vector is no longer live. It is unfortunate that we are using "liveness" here rather than "availability". That is, although vector is no longer live (its last use is when it is assigned to right), it is still available (i.e. the value we're interested in propagating has not been overwritten).

Hmm, perhaps I should a PR to add comments to VN copyprop since I starred a lot at that in the past (and AFAIR one of the existing comments is wrong/misleading).

The short version is that:

  • it needs liveness because we only insert PHIs for live variables so you can't substitute forward past a last-user without risking also going past a missing PHI.
  • it also needs liveness as a heuristic, without it you risk extending the live range of some variables without also eliminating/shortening the live range of other variables to compensate.

@CarolEidt
Copy link
Contributor

it needs liveness because we only insert PHIs for live variables so you can't substitute forward past a last-user without risking also going past a missing PHI.

Right - since we have pruned (semi-pruned?) SSA, we need to rely on liveness at block boundaries to ensure that we don't propagate any lclVars that may no longer be valid. That said, there's no correctness requirement to remove last-uses from the live/available set within a BasicBlock (that's what I was describing above), and I'm sure there must be a better heuristic than never extending those live ranges. Especially since b = a (last use); c = b is a very common pattern that we'd like to be able to propagate.

@mikedn
Copy link
Contributor

mikedn commented Jun 7, 2019

That said, there's no correctness requirement to remove last-uses from the live/available set within a BasicBlock (that's what I was describing above), and I'm sure there must be a better heuristic than never extending those live ranges.

Yes, there are some improvements that can be made. One thing I tried was to make SSA mark single use defs (which is pretty much always the case with these pesky copy chains) and use that to override the liveness based heuristic:

x = y;
...
z = x;

If x is single use then we now that substituting x with y will result in x's live range going away, compensating for the potential extension of y's own live range.

AFAIR that resulted in a 10k diff but there were regressions too. That coupled with the fact that VN copyprop code is also pretty ugly and inefficient made me move away to other things. I ended up using the SSA single-use thing in forward substitution experiment that too can eliminate such copy chains but can do other things as well.

@mikedn
Copy link
Contributor

mikedn commented Jun 7, 2019

since we have pruned (semi-pruned?) SSA

BTW, maybe it's worth trying non pruned SSA, perhaps it's not too costly. And maybe that would also eliminate the need for the liveness pass before SSA (yes, it's pruned, not semi). Though there's still the pesky issue of the VN copyprop heuristic.

Ultimately the only good way to do copyprop is allow going into the non-conventional SSA form. But then that requires an out of SSA pass and with the current LclVar design that's probably not going to be easy.

CarolEidt referenced this issue in CarolEidt/coreclr Sep 19, 2019
- Within a block, don't remove last uses from the live set, enabling us to catch the `b = a (last use); c = b` case.
- We don't label defs VNs, so lookup the VN for DEF, not just USASG.

Fix #24912
CarolEidt referenced this issue in CarolEidt/coreclr Sep 19, 2019
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@CarolEidt CarolEidt removed their assignment Dec 4, 2020
@AndyAyersMS
Copy link
Member

Seems like this is fixed now?

Sharplab 6.0 codegen is identical between AddNoCheckInLoop and AddCheckInLoop.

@tannergooding
Copy link
Member

This is likely resolved due to the work @EgorBo had done to take advantage of the token resolution during inlining support I added.

It means we can actually see its an IsSupported call and take that into account when upping the "inline profitability".

@BruceForstall
Copy link
Member

Since it appears to be fixed, I'm closing.

@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Oct 30, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Nov 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

8 participants