Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove GT_ASG nodes #10873

Closed
mikedn opened this issue Aug 10, 2018 · 31 comments · Fixed by #85871
Closed

Remove GT_ASG nodes #10873

mikedn opened this issue Aug 10, 2018 · 31 comments · Fixed by #85871
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI design-discussion Ongoing discussion about design without consensus JitUntriaged CLR JIT issues needing additional triage
Milestone

Comments

@mikedn
Copy link
Contributor

mikedn commented Aug 10, 2018

This came up occasionally in discussions but AFAIK there's no issue associated with it and IMO there should be due to the significant (negative) impact GT_ASG nodes have on RyuJIT's IR.

IMO it's one of the worst IR "features", if not plain and simple the worst. So far I only encountered pros and no cons:

  • It's an extra IR node - wastes memory and slows down IR traversal.
  • It imposes a model where the meaning of some nodes depends on the context - a GT_LCL_VAR on the RHS is a use but on the LHS is a def. The JIT uses various means to deal with this issue - GTF_VAR_DEF, GTF_IND_ASG_LHS, gtGetParent() etc.
  • It complicates UD chain traversal. SSA gives you the GT_LCL_VAR node on the LHS, from there you need to use gtGetParent() to get the GT_ASG node and then get the assignment's second operand. If you're lucky the LHS follow the assignment, otherwise gtGetParent() will need to traverse multiple nodes.
  • It requires additional work to transform HIR into LIR, the later does not use assignment nodes.
  • It doesn't match the IL model that uses stores. While I don't think that the IR has to strictly follow the IL model this one seems like a huge and unnecessary deviation.
  • It doesn't follow the assembly model either. One could argue that mov eax, ebx is assignment but that's only true if eax and ebx happen to be enregistered variables. Otherwise you'd be looking at mov [eax], ebx which is more similar to a store.

Anyone knows any advantages?

AFAIR Phoenix did have Assign but its IR was very different, looking more like assembly. I know next to nothing about LLVM but I don't think it has assignment.

category:implementation
theme:ir
skill-level:expert
cost:extra-large

@mikedn
Copy link
Contributor Author

mikedn commented Aug 10, 2018

Unfortunately getting rid of GT_ASG requires a ton of work so it's not clear when and if that may happen.

I'm going to take a closer look to see what needs to be done, if only for the sake of curiosity.

I cleaned up some GT_ASG_op leftovers in dotnet/coreclr#18205 and in the process I got rid of OperIsAssignment, this should make searching for GT_ASG uses slightly easier. There are some 120 uses of GT_ASG in code, not including comments and asserts. And then there are some 40 gtNewAssignNode (see https://gist.github.com/mikedn/cc0756edf8ae8eab66356d4275d62c16 for a list).

Doing it all at once may be difficult but it's not clear if that's feasible. For a while I thought that it may be good to do this first for SSA phases but it's not so clear if it's actually simpler because these still involve morph and the assertion propagation has shared code for local and global propagation.

Another way to split work may be scalar/struct or lclvar/indir. Dealing with scalars first may be easier because it avoids potential first-class structs complications. Dealing with indirs first may be easier because the JIT doesn't do a lot of work with indirs, compared to lclvars.

There may be complications with lclvars written to via indirs. For some reason the JIT keeps checking for such cases, instead of simply converting IND(ADDR(LCL_VAR)) to LCL_VAR.

And finally, I've made on bone headed attempt to generate a GT_STOREIND in the importer to see what happens. Rather surprisingly it didn't quite crash and burn when compiling a trivial method but the store was lost along the way. Adding GT_STOREIND to OperRequiresAsgFlag solved the problem. So I suppose it looks somewhat encouraging, I was expecting an interminable stream of asserts. Of course, attempting to compile anything more complex does just that.

@RussKeldorph
Copy link
Contributor

@dotnet/jit-contrib

@BruceForstall
Copy link
Member

There's been a general desire to import "rationalized" IR and therefore remove the "rationalizer" phase (after which there is no GT_ASG). Removing GT_ASG might be the biggest part of that work. This phase was introduced to allow the new RyuJIT downstream phases to not deal with GT_ASG but leave the legacy codegen alone. Now that legacy backend is gone, it's "just work" to move it -- likely, teaching morph, VN, SSA, etc., about rationalized IR.

@CarolEidt
Copy link
Contributor

@mikedn thanks for opening this issue!

The RyuJIT overview mentions the desire to move the Rationalized IR earlier in the JIT: https://github.com/dotnet/coreclr/blob/master/Documentation/botr/ryujit-overview.md#rationalization

And actually, I think that eliminating COMMA nodes will be potentially more challenging, but I could be wrong.

@mikedn
Copy link
Contributor Author

mikedn commented Aug 11, 2018

The RyuJIT overview mentions the desire to move the Rationalized IR earlier in the JIT:

Thanks, I knew I saw some discussion somewhere but couldn't remember where, I thought it was in a PR/issue. That document also mentions GT_ADDR which has similar issues. Removing that is probably less work but it's not clear to me what is better to do first - remove GT_ASG or remove GT_ADDR, or maybe both at the same time. Or maybe remove IND(ADD(LCL)).

And actually, I think that eliminating COMMA nodes will be potentially more challenging, but I could be wrong.

Hmm, I'd put that in a different bucket - side effects and ordering. I used to think that moving the whole JIT to LIR could be a good idea. Now I think that, while it may better than the current implementation, it's not the best solution - LIR is good for some things and bad for others (try doing expression reassociation, you'll either battle with maintaining the LIR links or "sequence" the whole thing and take a throughput hit). So I have this vague idea of a model where only side effects are maintained in linear order while other nodes roam freely and they're referenced only via side effect nodes. But that's another issue and my current impressions is that if we do any of this removing GT_ASG should be done first, as it may simplify things.

@mikedn
Copy link
Contributor Author

mikedn commented Aug 11, 2018

I was looking at lclvar nodes and noticed something that may explain how GT_ASG came to be: in the assignment model GenTreeLclVar probably was always a leaf. Now it is an unary op, to support GT_STORE_LCL_VAR. Perhaps it would be better to have a GenTreeStoreLclVar that's a special op and has a GenTree* op in it.

In general, it may be useful to revisit the lclvar node hierarchy as it may be possible to improve some SSA related stuff - either replace the SSA number with a pointer to the SSA def (problematic now because on 64 bit targets it increases the node size) or find room to store an additional SSA number for GTF_USE_ASG to avoid the need for the side table.

If something like GenTreeStoreLclVar is to be added then it's probably best done before removing GT_ASG. After that the use of GT_STORE_LCL_VAR will increase significantly and changes around it might require more work.

@mikedn
Copy link
Contributor Author

mikedn commented Sep 12, 2018

Removing assignment (and GT_ADDR) would require dealing with an interesting SSA issue - assigning to a struct field that is itself a struct produces IR like:

[000013] ------------              *  STMT      void  (IL 0x002...0x00A)
[000010] -----+------              |  /--*  CNS_INT   int    0
[000012] IA---+------              \--*  ASG       struct (init)
[000011] x----+-N----                 \--*  BLK(12)   struct
[000009] -----+------                    \--*  ADDR      byref
[000006] U----+-N----                       \--*  LCL_FLD   struct V01 loc0         [+0] Fseq[s3]

Despite the fact that the assignment is "indirect" the LCL_FLD node is GTF_VAR_DEF and acts as a SSA definition. If we remove GT_ASG and GT_ADDR we should end up with something like STORE_BLK(LCL_FLD_ADDR, CNS_INT). So which node is the SSA definition? There's no way it's LCL_FLD_ADDR, it's STORE_BLK. But making an indirect store act as a SSA definition sounds a bit weird.

Ideally there should be no indirect store needed. But how to avoid it, add STORE_LCL_FLD_BLK? I guess it would work but it has to be a large node.

@mikedn
Copy link
Contributor Author

mikedn commented Oct 14, 2018

Assignment to array elements produces somewhat strange trees:

     ( 18, 21) [000022] ------------              *  STMT      void  (IL 0x006...0x00E)
N017 (  1,  1) [000019] ------------              |  /--*  LCL_VAR   int    V04 loc1         u:3 $180
N018 ( 18, 21) [000021] -A-XG-------              \--*  ASG       int    $VN.Void
N014 (  4,  4) [000020] a--XG--N----                 |     /--*  IND       int    $180
N012 (  1,  1) [000044] ------------                 |     |  |  /--*  CNS_INT   long   32 Fseq[#ConstantIndex, #FirstElem] $281
N013 (  2,  2) [000045] -------N----                 |     |  \--*  ADD       byref  $340
N011 (  1,  1) [000037] ------------                 |     |     \--*  LCL_VAR   ref    V06 tmp1         u:2 (last use) <l:$301, c:$201>
N015 ( 12, 15) [000046] ---XG-------                 |  /--*  COMMA     int    $VN.Void
N010 (  8, 11) [000040] ---X--------                 |  |  \--*  ARR_BOUNDS_CHECK_Rng void   <l:$30e, c:$30d>
N007 (  1,  1) [000018] ------------                 |  |     +--*  CNS_INT   int    4 $43
N009 (  3,  3) [000039] ---X--------                 |  |     \--*  ARR_LENGTH int    <l:$143, c:$142>
N008 (  1,  1) [000036] ------------                 |  |        \--*  LCL_VAR   ref    V06 tmp1         u:2 <l:$301, c:$201>
N016 ( 16, 19) [000047] -A-XG--N----                 \--*  COMMA     int    $VN.Void
N004 (  4,  4) [000017] ---XG-------                    |  /--*  IND       ref    <l:$304, c:$303>
N002 (  1,  1) [000048] ------------                    |  |  |  /--*  CNS_INT   long   8 field offset Fseq[a] $280
N003 (  2,  2) [000049] -------N----                    |  |  \--*  ADD       byref  $2c0
N001 (  1,  1) [000016] ------------                    |  |     \--*  LCL_VAR   ref    V02 arg2         u:1 $c0
N006 (  4,  4) [000035] -A-XG---R---                    \--*  ASG       ref    <l:$304, c:$303>
N005 (  1,  1) [000034] D------N----                       \--*  LCL_VAR   ref    V06 tmp1         d:2 <l:$301, c:$201>

The LHS of the assignment isn't an IND node, it's a COMMA. The actual IND node is hidden behind the comma. To get rid of ASG we'll need to make it so that the COMMA tree produces the address (as a byref) and then ASG + IND can be easily changed to a STOREIND.

We should probably ensure that the ASG LHS is always a "location" node (LCL_VAR, LCL_FLD, IND etc.) before attempting to remove ASG nodes.

@CarolEidt
Copy link
Contributor

There are some COMMA patterns that aren't well handled by CSE, and in my (somewhat dated) experience of change the block stores to assignments, I ran into some of those. When we transform these array element assignments we'll have to ensure that we preserve CSE's ability to reason about them.

@mikedn
Copy link
Contributor Author

mikedn commented Oct 20, 2018

Speaking of block stores/assignments, assigning a struct to an array element generates a different tree. The LHS of the ASG is a BLK (good), the COMMA produces an address (good), the address is produced by taking the address of a struct IND (kind of weird), the IND doesn't have GTF_DONT_CSE set (scary):

-A-XG-------              *  ASG       struct (copy)
n--XG-------              +--*  BLK(24)   struct
---XG-------              |  \--*  COMMA     byref  $80
---X--------              |     +--*  ARR_BOUNDS_CHECK_Rng void  
------------              |     |  +--*  LCL_VAR   int    V01 arg1         u:1
---X--------              |     |  \--*  ARR_LENGTH int   
------------              |     |     \--*  LCL_VAR   ref    V04 loc1         u:2
---XG-------              |     \--*  ADDR      byref  $80
a--XG-------              |        \--*  IND       struct
-------N----              |           \--*  ADD       byref 
------------              |              +--*  LCL_VAR   ref    V04 loc1         u:2
-------N----              |              \--*  ADD       long  
-------N----              |                 +--*  LSH       long  
------------              |                 |  +--*  MUL       long  
------------              |                 |  |  +--*  CAST      long <- int
i-----------              |                 |  |  |  \--*  LCL_VAR   int    V01 arg1         u:1
------------              |                 |  |  \--*  CNS_INT   long   3
-------N----              |                 |  \--*  CNS_INT   long   3
------------              |                 \--*  CNS_INT   long   16 Fseq[#FirstElem]
------------              \--*  LCL_VAR   struct V06 loc3         u:3 (last use)

Guess that the only thing that prevents CSE from messing up with the indir is the fact that its a struct. Well, at least the assignment part is clear, it simply maps to a STORE_BLK.

@mikedn
Copy link
Contributor Author

mikedn commented Nov 4, 2018

@CarolEidt What do you think about adding a GenTreeLclFldBlk node? Unfortunately I don't see any other way to deal with https://github.com/dotnet/coreclr/issues/19412#issuecomment-420736412

Ideally we'd just add the block size to GenTreeLclFld but there's no room for it on 32 bit hosts (on 64 bit there's padding between gtLclOffs and gtFieldSeq).

Potential alternatives:

  • Get rid of gtFieldSeq and recover it from offset. I find the need for FieldSeq on LCL_FLD nodes to be rather unfortunate but I'm not sure if it's feasible to remove it.
  • Limit both the offset and size to 65535 so they each fit in 16 bit. This should cover all practical needs, in extreme cases where they exceed 65535 we simply fall back to taking the variable address, add the offset and use an indir node.

@CarolEidt
Copy link
Contributor

I'm out of the office today and it's hard to look at the sources on my phone ;-) but I'd like to see us split off the variants of LclFld. One that represents an actual field, and from which one can recover the handle info (and therefore the size) and a variant that represents a field access for an opaque or possibly overlapping field access. I'm not crazy about gtFieldSeq and would love to see that reworked, but it's been a while since I've looked at that.

If necessary I think I'd prefer to constrain size and offset over using a large node for the common case.

@mikedn
Copy link
Contributor Author

mikedn commented Nov 5, 2018

I'm out of the office today and it's hard to look at the sources on my phone ;-) but I'd like to see us split off the variants of LclFld. One that represents an actual field, and from which one can recover the handle info (and therefore the size) and a variant that represents a field access for an opaque or possibly overlapping field access.

Hmm, I know you mentioned this in the past but I don't quite understand what would be the advantage of having a variant of LclFld. To me it seems simpler to do something like:

  • if the node type is not TYP_STRUCT then it's simple, the size/layout is known because it's a primitive type (or SIMD type)
  • if the node type is TYP_STRUCT then there should be another piece of information in the node that describes the layout of the struct. Not a handle, something more generic, something that can "point to" a handle, size, layout, GC info etc.

@mikedn
Copy link
Contributor Author

mikedn commented Nov 7, 2018

If necessary I think I'd prefer to constrain size and offset over using a large node for the common case.

Turns out that the offset is already constrained: https://github.com/dotnet/coreclr/blob/631407852a6332a659a6a57ede670d3e94c81efb/src/jit/morph.cpp#L13142-L13145

So an unsigned short offset and an unsigned short index into a "layout table" should do.

@CarolEidt
Copy link
Contributor

@mikedn - I like the idea of using an offset and an index into a "layout table" (I'm sure there are other places where we separately allocate the layout table, and could instead index into a common table).
I think there are places that expect most struct values to return a handle (i.e. by calling gtGetStructHandle()), where all they really need to do is to get the gc layout and/or size - so there may be some places that will need to change to accommodate this, but also some opportunity to reduce the need to carry around and/or query the handles.

@CarolEidt
Copy link
Contributor

Oh, and to clarify - the reason that I thought it would be useful to have two variants of lclFld would be to distinguish the variant that needs to be handled conservatively (accessing a field through a different type and/or accessing a field of a struct with overlapping fields) from the one that is non-aliasing. While it may be moot for the front-end, where we rely on value numbers, it may be useful in reducing the places (e.g. in morph) where we treat all lclFlds conservatively, as well as downstream where we no longer have valid SSA/value numbers.

@mikedn
Copy link
Contributor Author

mikedn commented Nov 8, 2018

accessing a field through a different type

Hmm, some of these cases could be handled in a different manner. For example, accessing a float field as int could be morphed to BITCAST<float>(LCL_VAR<int>) instead of LCL_FLD<float>. Combined with "reg optional" we could then generate movd xmm4, edx or movss xmm4, dword ptr [rbp-n].

@mikedn
Copy link
Contributor Author

mikedn commented Dec 2, 2018

I'm sure there are other places where we separately allocate the layout table, and could instead index into a common table

It looks to me that the best place to start is actually GT_BLK/OBJ. GT_OBJ in particular because it has so much stuff in it that it end up being a large node. Very early work, just enough to show the approach and survive some diffing (no diffs): mikedn/coreclr@086c7fa

Block size, class handle and gc pointer table are moved to a separate "layout" object and GenTreeBlk/Obj only maintains a pointer to that. The same "layout" object is shared by multiple nodes (by means of a hashtable). The "layout" object is immutable, in order to avoid nasty accidents (right now it's only partially immutable because SetGCInfo is called multiple times, it really should only be called once).

IMO it has a bunch of advantages:

  • We can avoid querying GC info of a struct multiple times (now we do this for every node that needs id)
  • GenTreeObj becomes a small node. In fact we can probably get rid of it and GenTreeBlk and just put all the necessary data in GenTreeIndir.
  • We could probably use the layout object in LclVarDsc as well, to replace lvStructGcCount and lvGcLayout. While it doesn't make LclVarDsc smaller (it replaces on pointer with another pointer), we get a consistent representation of struct layout throughout the JIT.
  • Overall cleaner code/IR - I really don't think it's good to have a bunch of information that gets repeated in every node (I wonder if the same technique cannot be applied to calls as well, a bunch of stuff is call signature dependent but it gets repeated in every call node).

The need for hashtable might be a disadvantage but then the number of struct types you encounter in a method is usually very small so it shouldn't be a problem. And the hashtable lookup cost is likely to be smaller than the redundant VM calls we can avoid.

Once this is done it should be pretty easy to extend it to GT_LCL_FLD, we just need to somehow get a short table index instead of pointer and modify some code here and there to support struct types GT_LCL_FLD. Well, it's probably not that easy but it's a start.

@CarolEidt
Copy link
Contributor

some of these cases could be handled in a different manner.

Yes, I think that, in general, using BITCAST (or CAST in the case of a partial reference where we need to do a zero- or sign-extend) would be a better approach. AFAIK though the front-end doesn't really know how to handle BITCAST yet, and may not expect a CAST in some places where we're currently using LCL_FLD. But I would really like to see us eliminate the use of LCL_FLD for ABI handling purposes.

@mikedn
Copy link
Contributor Author

mikedn commented Dec 3, 2018

AFAIK though the front-end doesn't really know how to handle BITCAST yet, and may not expect a CAST in some places where we're currently using LCL_FLD

I'm more concerned about the backend. I'm not sure if and how we'll be able to handle something like BITCAST between long and double on x86. On x64 we have movd (and there's something similar on ARM64) but on x86 we need to pass the BITCAST node through decomposition somehow and the codegen will like need a temporary memory location.

@mikedn
Copy link
Contributor Author

mikedn commented Dec 10, 2018

Now, if we allow LCL_FLD with type struct the question is where should they be generated. The importer naturally generates BLK(FIELD(ADDR(LCL_VAR))) trees and then morphs converts them to BLK(ADDR(LCL_FLD)). And we want a single LCL_FLD node having type struct.

Looks like I'll end up trying to do this kind of stuff in LocalAddressVisitor, where this transform IMO belongs. But that means that we'll have to deal with struct promotion and effectively do "morph copy block" at the same time, in LocalAddressVisitor. Or, as @sandreenko proposed in another issue, move struct promotion after LocalAddressVisitor so all the promotion details, including block copy, are handled during morph. Kind of strange how things fit together sometimes.

@mikedn
Copy link
Contributor Author

mikedn commented Dec 14, 2018

@CarolEidt Since GT_BITCAST keeps coming up, here's a quick experiment that transforms *(int*)&float_var into GT_BITCAST: mikedn/coreclr@730badc

It's quite trivial. Saves ~1KB in corelib. The only problem will be adding support for long-double conversion on x86 (and ARM32 I guess).

@tannergooding
Copy link
Member

It's quite trivial. Saves ~1KB in corelib

Would a similar optimization around Unsafe.As (when TFrom and TTo are the same size) be worth looking into?

@mikedn
Copy link
Contributor Author

mikedn commented Dec 14, 2018

Would a similar optimization around Unsafe.As (when TFrom and TTo are the same size) be worth looking into?

For what types?

@tannergooding
Copy link
Member

For what types?

I wasn't thinking about hardcoding it to particular types, but rather seeing if a more generalized option would be possible (any two structs that are both blittable and the same size is likely a good basis).

S.P.Corelib (and CoreFX) use Unsafe.As in a number of places to effectively reinterpret-cast a given ref to another type. Some examples are code in System.Enum, System.BitConverter, System.Buffer, etc.

  • If this is already being handled elsewhere, that's great. I'm just recalling a previous conversation around Unsafe.As and Unsafe.AsRef not always doing the most optimal thing

@mikedn
Copy link
Contributor Author

mikedn commented Dec 14, 2018

any two structs that are both blittable and the same size is likely a good basis

Well, with some exceptions structs aren't enregistrable. Since they're already in memory, reinterpretation shouldn't have much impact on the generated code. That said, I don't remember looking at such cases so perhaps there are some improvements that could be made.

I'm just recalling a previous conversation around Unsafe.As and Unsafe.AsRef not always doing the most optimal thing

Perhaps that was about primitives and not structs? The problem with primitives is that the current handling of reinterpretation (by using GT_LCL_FLD) prevents the variable from being enregistered.

@mikedn
Copy link
Contributor Author

mikedn commented Dec 23, 2018

There is an interesting case that BITCAST (well, at least in its current form) can't handle and that came up in "real world" (Decimal) code:

struct Pair { public int lo, hi; }long l = *(long*)&pair.lo;

Today this forces the struct in memory which is not ideal. The JIT uses a GT_LCL_FLD of type long, that avoids marking the struct var as address taken it still forces it into memory. We need to make GT_LCL_FLD (or a new oper) capable of handling the case of dependently promoted struct fields by assembling the resulting value from individual variables using shifts and masking.

Trouble is, it's not that simple to decide when it's better to spill the struct to memory so you can simply load the long value and when it's better to assemble it from individual values. In this example assembling form enregistered values would look something like:

mov ecx, eax ; eax contains lo
shl rdx, 32 ; edx contains hi
or rcx, rdx ; rcx contains the result

If the values aren't enregistered then the code gets a bit more messy as you may end up with 2 loads instead of the original one load:

mov ecx, dword ptr [lo]
mov edx, dword ptr [hi]
shl rdx, 32
or rcx, rdx

instead of just

mov rcx, qword ptr [lo]

There will be trade-offs to be made.

Now, all this is a bit ortoghonal to this issue - GT_ASG. But you can also have:

*(long*)&pair.lo = l;

And end up with similar drawbacks:

mov dword ptr [lo], rax
shr rax, 32
mov dword ptr [hi], rax

instead of single mov instruction.

And it gets even more complicated if you have more than 2 fields (e.g. you could read an int from 4 byte fields).

@mikedn
Copy link
Contributor Author

mikedn commented Dec 23, 2018

Going back to struct typed GT_LCL_FLD nodes - I've made an experiment and it's definitely doable. It does require a fair amount of work but it's not prohibitive. I got it to the point that it generates IR (and the associated assembly code) like STORE_LCL_FLD<struct>(LCL_VAR<struct>) or STORE_LCL_FLD<struct>(IND<struct>). It only does unroll copy so I don't know what impact, if any, the actual implementation will have on the code size.

The main problem is where exactly to produce such struct typed LCL_FLD nodes. I did it in LocalAddressVisitor, IMO that's where it belongs but it would likely work better once promotion is moved after LocalAddressVisitor. Attempting to handle copying from a promoted struct LCL_VAR to a struct LCL_FLD isn't fun. Generating LCL_FLD nodes and promotion should be done separately.

@pentp
Copy link
Contributor

pentp commented Dec 23, 2018

The example from Decimal is probably optimal with direct memory accesses because the structs are always in memory anyway (often passed in ref or in). Enregistering a Decimal looks unlikely as it's 4 ints, unless it's enregistered as two longs (might be too complicated).

@mikedn
Copy link
Contributor Author

mikedn commented Dec 23, 2018

The example from Decimal is probably optimal with direct memory accesses because the structs are always in memory anyway (often passed in ref or in).

Yes but the general case can be more complicated. You could use the lo and hi separately in 50 places and then have just one place where someone uses (long*)&lo and then performance goes down the drain. Though I suppose that then you can blame the developer, it's not the JIT's job to recover from poor optimization choices.

And then there's also the pesky problem of the SYS V calling convention that stuffs 2 int struct fields in a register. At least because of that it would be good to find a way to treat such cases better.

Enregistering a Decimal looks unlikely as it's 4 ints, unless it's enregistered as two longs (might be too complicated).

Actually a 4 int struct can be "promoted" - treated as 4 individual int variables. But if you use address tricks then those variables are no longer enregistered, even if they're still individual variables.

Hmm, that's actually something I should pay attention to. If you have 4 int fields and read only 2 as long then the other 2 shouldn't be affected. But I'm pretty sure that today they're all affected.

@mikedn
Copy link
Contributor Author

mikedn commented Dec 27, 2018

And then there's also the pesky problem of the SYS V calling convention that stuffs 2 int struct fields in a register.

And of course, this isn't actually SYS V specific. It happens on Windows as well except that only for structs having size <= 8 (e.g. on Linux Decimal is passed in 2 64 bit registers while on Windows it is passed on stack).

The not so fortunate effects of JIT's limitation in this area can be seen in the recently added Range's Equals method:

       mov      qword ptr [rsp+10H], rdx
       mov      eax, dword ptr [rsp+10H] ; GT_LCL_FLD
       mov      edx, dword ptr [rcx]
       cmp      eax, edx
       jne      SHORT G_M32203_IG04
       mov      eax, dword ptr [rsp+14H] ; GT_LCL_FLD
       mov      edx, dword ptr [rcx+4]
       cmp      eax, edx

The ideal codegen would be something like:

       cmp      dword ptr [rcx], edx
       jne      SHORT G_M32203_IG04
       shr      edx, 32
       cmp      dword ptr [rcx+4], edx

@CarolEidt Do you know if there's any opened issue tracking such problems? Should there be?

This isn't directly related to GT_ASG removal, it's just another case where IR limitations cause problems so it's something to keep in mind.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@SingleAccretion SingleAccretion self-assigned this Feb 1, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label May 16, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 24, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jun 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI design-discussion Ongoing discussion about design without consensus JitUntriaged CLR JIT issues needing additional triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants