Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad x64 codegen in VerifyUnifierConsistency #8606

Closed
MichalStrehovsky opened this issue Jul 20, 2017 · 41 comments · Fixed by dotnet/coreclr#13016
Closed

Bad x64 codegen in VerifyUnifierConsistency #8606

MichalStrehovsky opened this issue Jul 20, 2017 · 41 comments · Fixed by dotnet/coreclr#13016
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI bug

Comments

@MichalStrehovsky
Copy link
Member

@sergiy-k and I hit this in CoreRT, but this has a repro in CoreCLR too.

Compile the attached Program.txt
as csc /noconfig /nostdlib /r:System.Private.CoreLib.dll Program.cs /define:DEBUG /O (note: we define the DEBUG symbol and enable optimizations) and run the resulting executable with CoreRun. The code will hit an assert. Now drop the /O and the code won't hit the assert.

@sergiy-k found some suspicious use of rdi register in VerifyUnifierConsistency. Setting COMPlus_JitDisasm=VerifyUnifierConsistency and inspecting the use of rdi in the disassembly might be a good starting point.

@MichalStrehovsky MichalStrehovsky changed the title Bad x64 codegen in VerifyUnifierConsistency (CoreRT) Bad x64 codegen in VerifyUnifierConsistency Jul 20, 2017
@AndyAyersMS
Copy link
Member

I'll take a look.

@AndyAyersMS
Copy link
Member

Looks your diagnosis is spot-on. JIT is using r14d for walk1 but when it indexes into _entries it mysteriously switches over to using rdi:

G_M65020_IG12:
...
4C8B4610         mov      r8, gword ptr [rsi+16]      ;; get _entries from this
453B7008         cmp      r14d, dword ptr [r8+8]      ;; bounds check walk1
0F835F010000     jae      G_M65020_IG19
4D8D44F810       lea      r8, bword ptr [r8+8*rdi+16] ;; get address of _entries[walk1]...???

@mikedn
Copy link
Contributor

mikedn commented Jul 20, 2017

it mysteriously switches over to using rdi

And it looks like the cast from int to long is missing too.

@AndyAyersMS
Copy link
Member

Also a regression from 1.1. Note array bounds check and access will generally need two different registers, because this is an array of structs, so the physical index is the logical index * 24. 1.1 has *8 in the addressing mode and *3 earlier. 2.0 seems to lose the *3 somehow, presumably that is what should have been in rdi.

G_M65020_IG12:
 ...
 4863D5               movsxd   rdx, ebp
 4C8D3C52             lea      r15, [rdx+2*rdx]
 ...
 4C8B4610             mov      r8, gword ptr [rsi+16]
 413B6808             cmp      ebp, dword ptr [r8+8]
 0F8375010000         jae      G_M65020_IG20
 4F8D44F810           lea      r8, bword ptr [r8+8*r15+16]

Will keep digging... seems like maybe an optimizer bug at this point.

cc @dotnet/jit-contrib

@AndyAyersMS
Copy link
Member

Best guess so far is that gtFoldExprSpecial sees a box(..) == null and deadcodes it, but hidden in the ... is a side effect to initialize the *3 cse (V37 here) that gets lost in the process:

Bashing [000121] to NOP as part of dead box operation
N040 (  1,  1) [000131] ------------             /--*  CNS_INT   ref    null $VN.Null
N041 ( 55, 53) [000132] NA-XGO------             *  NE        int    $26e
N039 ( 50, 51) [000130] -A-XGO------             \--*  BOX       ref    $5d0
N037 (  1,  1) [000128] ------------                |  /--*  LCL_VAR   ref    V08 tmp3         u:3 (last use) $2d5
N038 ( 44, 47) [000129] -A-XGO------                \--*  COMMA     ref    $2d5
N035 ( 38, 41) [000116] *A-XG-------                   |  /--*  IND       int    <l:$26d, c:$609>
N033 (  1,  1) [000571] ------------                   |  |  |  /--*  CNS_INT   long   8 field offset Fseq[_key] $1c3
N034 ( 36, 39) [000572] -A-XG--N----                   |  |  \--*  ADD       byref  $215
N030 ( 23, 23) [000592] -A--G-------                   |  |     |     /--*  ADDR      byref  $82
N029 ( 12, 12) [000114] aA--G--N----                   |  |     |     |  \--*  IND       struct <l:$680, c:$6c0>
N027 (  1,  1) [000584] ------------                   |  |     |     |     |  /--*  CNS_INT   long   16 Fseq[#FirstElem] $1c2
N028 ( 11, 11) [000585] -A-----N----                   |  |     |     |     \--*  ADD       byref  $82
N024 (  1,  1) [000581] -------N----                   |  |     |     |        |     /--*  CNS_INT   long   3 $1c8
N025 (  9,  9) [000582] -A-----N----                   |  |     |     |        |  /--*  LSH       long   $545
N022 (  1,  1) [000899] ------------                   |  |     |     |        |  |  |  /--*  LCL_VAR   long   V37 cse2          $544
N023 (  8,  8) [000900] -A----------                   |  |     |     |        |  |  \--*  COMMA     long   $544
N018 (  1,  1) [000590] ------------                   |  |     |     |        |  |     |     /--*  CNS_INT   long   3 $1c8
N019 (  7,  7) [000591] ------------                   |  |     |     |        |  |     |  /--*  MUL       long   $544
N017 (  2,  3) [000580] ------------                   |  |     |     |        |  |     |  |  \--*  CAST      long <- int $543
N016 (  1,  1) [000577] i-----------                   |  |     |     |        |  |     |  |     \--*  LCL_VAR   int    V03 loc2         u:4 $484
N021 (  7,  7) [000898] -A------R---                   |  |     |     |        |  |     \--*  ASG       long   $VN.Void
N020 (  1,  1) [000897] D------N----                   |  |     |     |        |  |        \--*  LCL_VAR   long   V37 cse2          $544
N026 ( 10, 10) [000583] -A-----N----                   |  |     |     |        \--*  ADD       byref  <l:$210, c:$20f>
N015 (  1,  1) [000576] ------------                   |  |     |     |           \--*  LCL_VAR   ref    V25 tmp20        u:3 (last use) <l:$2d7, c:$5cd>
N031 ( 31, 34) [000586] -A-XG-------                   |  |     |  /--*  COMMA     byref  <l:$214, c:$213>
N014 (  8, 11) [000579] ---X--------                   |  |     |  |  \--*  ARR_BOUNDS_CHECK_Rng void   <l:$2dd, c:$2dc>
N011 (  1,  1) [000113] ------------                   |  |     |  |     +--*  LCL_VAR   int    V03 loc2         u:4 $484
N013 (  3,  3) [000578] ---X--------                   |  |     |  |     \--*  ARR_LENGTH int    <l:$448, c:$447>
N012 (  1,  1) [000575] ------------                   |  |     |  |        \--*  LCL_VAR   ref    V25 tmp20        u:3 <l:$2d7, c:$5cd>
N032 ( 35, 38) [000587] -A-XG-------                   |  |     \--*  COMMA     byref  <l:$214, c:$213>
N008 (  4,  4) [000112] ---XG-------                   |  |        |  /--*  IND       ref    <l:$2d7, c:$5cd>
N006 (  1,  1) [000588] ------------                   |  |        |  |  |  /--*  CNS_INT   long   16 field offset Fseq[_entries] $1c2
N007 (  2,  2) [000589] -------N----                   |  |        |  |  \--*  ADD       byref  $206
N005 (  1,  1) [000111] ------------                   |  |        |  |     \--*  LCL_VAR   ref    V00 this         u:2 $100
N010 (  4,  4) [000574] -A-XG---R---                   |  |        \--*  ASG       ref    <l:$2d7, c:$5cd>
N009 (  1,  1) [000573] D------N----                   |  |           \--*  LCL_VAR   ref    V25 tmp20        d:3 <l:$2d7, c:$5cd>
N036 ( 43, 46) [000127] -A-XGO------                   \--*  ASG       int    $VN.Void
N004 (  4,  4) [000126] -----O-N----                      \--*  IND       int    <l:$26c, c:$609>
N002 (  1,  1) [000124] ------------                         |  /--*  CNS_INT   long   8 $1c3
N003 (  2,  2) [000125] -------N----                         \--*  ADD       byref  $20e
N001 (  1,  1) [000123] ------------                            \--*  LCL_VAR   ref    V08 tmp3         u:3 $2d5

Most of the other cases in this method check for side effects and defer; seems like we should verify the box arg is side-effect free here too.

@AndyAyersMS
Copy link
Member

Because key is a value type, the assert below is trivially true. But it also generates a CSE used in the subsequent statement. Should be easy enough to create a small repro...

                    Debug.Assert(_entries[walk1]._key != null);
                    int hashCode = _entries[walk1]._key.GetHashCode();

@AndyAyersMS
Copy link
Member

Somewhat simpler repro. Codegen bug is in Test as above.

using System;

public struct S<K>
{
    public int x;
    public int y;
    public K val;
}

public class X<K,V> 
{
    public X(K k)
    {
        a = new S<K>[2];
        a[1].val = k;
        a[1].x = 3;
        a[1].y = 4;
    }

    public void Assert(bool b)
    {
        if (!b) throw new Exception("bad!");
    }

    public int Test()
    {
        int r = 0;
        for (int i = 0; i < a.Length; i++)
        {
            Assert(a[i].val != null);
            r += a[i].val.GetHashCode();
        }
        return r;
    }

    S<K>[] a;
}

class B
{
    public static int Main()
    {
        var a = new X<int, string>(11);
        int z = a.Test();
        return (z == 11 ? 100 : 0);
    }
}

@AndyAyersMS
Copy link
Member

The underlying assume-no-side-effects weakness in gtFoldExprSpecial predates 2.0 -- indeed it is there in the jit32 sources as well.

Not clear yet if we can expose this bug in older jits. It is showing up in 2.0 because of more aggressive CSEs.

@AndyAyersMS
Copy link
Member

Adding a side effect check fixes the bad codegen, but is overly pessimistic. Likely need to rework the box code in the importer to forcibly move possible side effects out from under the box.

@AndyAyersMS
Copy link
Member

Or maybe not, the optimization done in gtFoldExprSpecial effectively dead-codes a struct copy because it's folded in under the box. If we split it out it may become problematic.

This argues for drilling deeper into the box operand during gtFoldExprSpecial to try sort which of the side effects are "expected" and can be ignored and which others aren't... not thrilled about this but it is probably the most surgical fix.

@AndyAyersMS
Copy link
Member

Simpler repro that does not require CSE. Here the call to Get is dropped.

using System;
using System.Runtime.CompilerServices;

public class X<K> 
{
    public X(K k1)
    {
        k = k1;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public K Get()
    {
        Console.WriteLine("Get called");
        count++;
        return k;
    }

    public bool Test()
    {
        bool x = Get() != null;
        bool y = (count == 1);
        return x && y;
    }

    K k;
    int count;
}

class B
{
    public static int Main()
    {
        var a = new X<int>(11);
        bool result = a.Test();
        Console.WriteLine("Passed: {0}", result);
        return result ? 100 : 0;
    }
}

@AndyAyersMS
Copy link
Member

One more example, this time we lose an exception-causing dereference.

using System;
using System.Runtime.CompilerServices;

public class X<K> 
{
    public X(K k1)
    {
        k = k1;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static bool Test(X<K> a)
    {
        return (a.k != null);
    }

    public K k;
}

class B
{
    public static int Main()
    {
        X<int> a = null;
        bool result = false;
        try 
        {
            X<int>.Test(a);
        }
        catch (Exception)
        {
            result = true;
        }
        Console.WriteLine("Passed: {0}", result);
        return result ? 100 : 0;
    }
}

@AndyAyersMS
Copy link
Member

Still searching for the right fix. Couple more notes.

  1. The initial trees produced by the box are roughly
    Stmt1: newobj into a newly live local "boxTemp"
    Stmt2: ... tree of BOX(COMMA(boxTemp, ASG(exprToBox,boxTemp)
  2. If BOX of a value type appears under a EQ/NE compare to null we optimize it in gtFoldExprSpecial. This can be called at various times and is even invoked in the importer.
  3. The optimization replaces (EQ/NE, null (BOX ...)) with (0/1) as appropriate. Thus all side effects under the BOX are lost. It also nulls out Stmt1.
  4. Most often the only side effect under the box is a modification of the "boxTemp" which dies at the compare so we are ok with losing those side effects.
  5. But if the evaluation of exprToBox contains other side effects we may lose them too, and this is the bug we're trying to fix. Examples above show cases where these effects include embedded assignments for CSEs, calls, and exceptions.
  6. Interestingly CSC now may use cgt.un for the compare of box vs null which breaks the pattern in gtFoldExprSpecial and prevents this optimization from happening early, in the importer -- later assertion prop changes the comparison operator to EQ/NE and it fires during optimization. So we should broaden the predicate types we look for in gtFoldExprSpecial to try and catch more cases of this earlier. Since this can nuke relatively large trees it is worth doing as early as possible.
  7. Whether invoked early or late we run the risk of losing side effects. There is potentially some advantage to catching this early as the shape of the tree under the box is more constrained so presumably the necessary side effects could be more readily extracted.
  8. Generally we would expect almost all cases could be found early since this is largely a type based optimization -- however if the null required const-prop or similar perhaps we would miss some if we only did this early.
  9. One could imagine splitting off any side effects from exprToBox into yet another statement and (if anything is split) storing the result in another temp. Since the cases where exprToBox has side effects are rare this seems like a viable approach. However...
  10. The BOX operator IR only appears for simple inline box cases; this has two sub cases, one where exprToBox is a simple scalar type and the other where it is a struct. Both cases are vulnerable to this. bug. So if we do the split as in 9, in the struct case we run the risk of creating a dead struct copy if we end up optimizing the box away, and an unnecessary struct copy if we don't optimize (say the BOX is not under a compare). I haven't gotten far enough along to see if we can optimize this away. This case may be rare enough that we'd be willing to pay this cost to avoid the bug.
  11. Ideally we'd look ahead during importation and see that the BOX was feeding a compare and only then decide to split off the effects, but that's likely not viable.
  12. But if we only did this optimization early, we might be able to look back when we did the optimization into the BOX tree and effectively split off the side effects (since as noted above the tree shapes are limited).
  13. An attempt to do this splitting within gtFoldExprSpecial foundered because JTRUE is not well prepared to handle COMMA, so there's no easy way in an expression context to hold onto the effects.
  14. This optimization is important and we need to inhibit as few cases as possible. There are many examples in corelib (presumably all correctly optimized?) so getting that down to minimal diffs is a good indicator of progress.

@mikedn
Copy link
Contributor

mikedn commented Jul 21, 2017

Interestingly CSC now may use cgt.un for the compare of box vs null which breaks the pattern in gtFoldExprSpecial and prevents this optimization from happening early, in the importer -- later assertion prop changes the comparison operator to EQ/NE and it fires during optimization. So we should broaden the predicate types we look for in gtFoldExprSpecial to try and catch more cases of this earlier. Since this can nuke relatively large trees it is worth doing as early as possible.

Hmm, when I added the GT.UN->EQ/NE transform I considered doing it right in the importer but I ended up adding it in morph because that's where most of such transforms are done. It looks like doing this in the importer might actually be a good idea, after all it just compensates an IL deficiency (lack of cne) rather than being an actual optimization.

@erozenfeld
Copy link
Member

fgMorphRecognizeBoxNullable is a similar optimization although there we have a helper call instead of GT_BOX. I recently fixed it to bail out when called after morphing the args and also to be called on gt.un:
dotnet/coreclr#12395. May be worth double checking that we can't possibly remove side effects there.

@JosephTremoulet
Copy link
Contributor

Is it worth having a different form of box IR operator? I'm not following whether "boxTemp" is holding a copy of the struct (and the reason we introduce boxTemps is that our BOX operation needs an lvalue?), or whether "boxTemp" is holding a pointer to the box on the heap (in which case I'm a little confused what the assign is doing..), but presumably if it's the first case we could have e.g. a BoxValue operator that just takes the value and implicitly needs a temp (which could be manifest at morph/lower/wherever), and in the latter case we could have e.g. a NewBox operator that again just takes the value but carries an implied heap allocation.

@AndyAyersMS
Copy link
Member

The box temp is a reference to the heap instance and is the result of the box. The general sequence for the inline box is:

boxtemp = newobj
copy contents from exprToBox to payload portion of boxtemp
result is boxtemp

The newobj is in a preceding statement, the copy is in comma under the box.

Am wondering now if we also have to consider possible side effects from the newobj, eg if this triggers class init. Will try and work up a repro.

@AndyAyersMS
Copy link
Member

IR example:

[000008] ------------        *  STMT      void  (IL 0x000...  ???)
[000005] --C-G-------        |  /--*  CALL help ref    HELPER.CORINFO_HELP_NEWSFAST
[000003] ------------ arg0   |  |  \--*  CNS_INT(h) long   0x7ffbd8d991c0 class
[000007] -AC-G-------        \--*  ASG       ref   
[000006] D------N----           \--*  LCL_VAR   ref    V01 tmp0        ;; tmp0 == boxTemp   

[000020] ------------        *  STMT      void  (IL   ???...  ???)
[000019] -A-XG-------        \--*  RETURN    int   
[000017] ------------           |  /--*  CNS_INT   ref    null
[000018] NA-XG----U--           \--*  GT        int   
[000016] -A-XG-------              \--*  BOX       ref   
[000014] ------------                 |  /--*  LCL_VAR   ref    V01 tmp0            ;; box result
[000015] -A-XG-------                 \--*  COMMA     ref   
[000002] ---XG-------                    |  /--*  FIELD     int    k
[000001] ------------                    |  |  \--*  LCL_VAR   ref    V00 arg0      ;; exprToBox
[000013] -A-XG-------                    \--*  ASG       int                        ;; copy
[000012] -------N----                       \--*  IND       int   
[000010] ------------                          |  /--*  CNS_INT   long   8
[000011] ------------                          \--*  ADD       byref 
[000009] ------------                             \--*  LCL_VAR   ref    V01 tmp0         

@JosephTremoulet
Copy link
Contributor

boxtemp = newobj
copy contents from exprToBox to payload portion of boxtemp
result is boxtemp

Ok, that makes sense, but then what is the point of the GT_BOX if we already have the allocation and the copy separately? Indeed, rationalize treats BOX like identity. So does it just serve as a flag to optimizations like this one? Seems a bit hokey to have an operator that means "I'm probably next to a copy and assignment that are dead if I am". Should we consider making the allocation and the copy part of the semantics of the GT_BOX itself? We'd need to manifest them, maybe at rationalize (except we may need to give morph a shot expanding the struct copy...), or maybe at morph (if that's late enough to catch the cases this is trying to optimize)...

@AndyAyersMS
Copy link
Member

Can't really defend the current approach, just explaining what it is. The Box points back at the earlier assign statement and this link is used ingtFoldExprSpecial to null out the assignment if the box is optimized away.

So yeah the Box flags this node as an optimization opportunity and tracks some of the context needed to pull it off. Not sure if exposing the parts and normal deadcoding could do the same. Perhaps encapsulation could reveal the parts only when we know they're needed, but that starts taking us outside the realm of a suitable fix for a 2.0 regression.

@JosephTremoulet
Copy link
Contributor

Not sure if exposing the parts and normal deadcoding could do the same

I don't think we have the facility to remove heap allocations and stores, no.

Perhaps encapsulation could reveal the parts only when we know they're needed

That's what I was trying to suggest, though I'm not sure how problematic the phase ordering issues would be (can't lower the copy until it gets revealed, don't want to reveal it until after having all chances to remove the box)

but that starts taking us outside the realm of a suitable fix for a 2.0 regression.

Ah, I'd missed that bit of context. Yes, I agree.

Box points back at the earlier assign statement

That suggests we might be able to expand the pattern match -- see what the LHS of the assign is, then if the argument to the box is a comma of an assign that stores to an indir off the LHS and then returns the LHS, then only extract side-effects from the arguments to the newobj and struct-assign RHS; otherwise, extract side-effects from the box itself?

@briansull
Copy link
Contributor

We should be able to check the side effect flags on any node that we are considering for the optimization. If we see an unsafe flag set, such as GTF_ASG then we can decline to do the optimization.

@JosephTremoulet
Copy link
Contributor

JosephTremoulet commented Jul 21, 2017

@briansull sure, the issue is (quoting Andy's example above) that if the optimization is removing the box at [000016] (and replacing the GT at [000018] with "true"), then we don't want to decline -- we want to remove [000013] even though it has GTF_ASG set (and likewise remove [000007]/[000005] despite their side-effect flags), but need to preserve side-effects in their sub-trees.

Yes it would be correct to just go conservative, but per above,

This optimization is important and we need to inhibit as few cases as possible. There are many examples in corelib

@AndyAyersMS
Copy link
Member

Indeed -- in that example (which is taken from the latest repro above) we must keep just this bit of the tree under the BOX:

[000002] ---XG-------                    |  /--*  FIELD     int    k
[000001] ------------                    |  |  \--*  LCL_VAR   ref    V00 arg0      ;; exprToBox

Otherwise we lose the NPE that is required, since arg0 is null.

@mikedn
Copy link
Contributor

mikedn commented Jul 21, 2017

but need to preserve side-effects in their sub-trees

CSE does something like this: https://github.com/dotnet/coreclr/blob/master/src/jit/optcse.cpp#L2073. Though it doesn't looks like that code could easily be reused.

@AndyAyersMS
Copy link
Member

All the "must keep" side effects initially come from the exprToBox tree. Since BOX could operate with address of exprToBox, conceptually we should be producing this address its own tree that is not under the BOX and we'd have all the must keep side effects walled off outside the scope of the BOX.

But later on, if we're unlucky, perhaps that new side effects (eg a CSE def) could be introduced into tree that would still sit under the BOX (where we are copying from the exprToBox to the box payload). Maybe this is rare or not actually possible, but it seems like it could happen.

If it's not possible for the tree under the BOX to get additional side effects, then the split above would conceptually solve the problem and we could zap the BOX and the ASSIGN and the side effects would stay behind.

To make this work though we'd still have to figure out where to put the address computation tree and whether it is viable to force exprToBox to memory so we can take its address.

We could form the exprToBox address in a comma above the BOX and fix gtFoldExprSimple to tunnel via gtEffectiveVal but that would reproduce the (JTRUE, COMMA(..., 0/1)) that was problematic in an earlier prototype. So perhaps better to put it a prior statement and pass to the BOX via a temp? This would be simple for the scalar case since that's close to what happens now. Not sure about the struct case just yet.

@AndyAyersMS
Copy link
Member

@erozenfeld I haven't been able to find similar issues with nullables yet, but can't rule it out either...

@AndyAyersMS
Copy link
Member

Looking a bit at why this doesn't happen in 1.1 or in Jit32, I think it is from the GT_GT -> GT_NE change we put into 2.0. Also explains why 4.7 doesn't have this issue and 4.7.1 will unless we push whatever fix we come up with here...

Odd then that the pattern in gtExprFoldSpecial only fires for GT_EQ and GT_NE cases. Seems like either something else; maybe older CSCs used signed compares?

Adding GT_GT support to gtExprFoldSpecial helps dead box a few more cases (19 methods impacted in the default jit-diff set).

@mikedn
Copy link
Contributor

mikedn commented Jul 21, 2017

Odd then that the pattern in gtExprFoldSpecial only fires for GT_EQ and GT_NE cases. Seems like either something else; maybe older CSCs used signed compares?

Signed compares? Can't be, only cgt.un, not cgt, can be used on object references.

@JosephTremoulet
Copy link
Contributor

BOX could operate with address of exprToBox...

Would that be changing the semantics of GT_BOX? Another way we could shape the IR differently in the importer would be to hoist out the assign that's under the comma as well, so instead of

assign_stmt:  Assign x, (call newobj())
box_stmt:  box (comma (assign *x, exprToBox), x)

we could have

assign_stmt1: Assign x, (call newobj())
assign_stmt2: Assign *x, exprToBox
box_stmt:  box x

the rules for the transform would be to nuke assign_stmt1, and from assign_stmt2 to preserve the side-effects of exprToBox.

If we want to do something in that vein but avoid putting another pointer on GenTreeBox, we could possibly do something like:

assign_stmt: comma (Assign x, (call newobj())), (Assign *x, exprToBox)
box_stmt:  box x

and the optimization could verify that the statement the box points to has the expected form (comma of two assigns)...

I wouldn't say it's an elegant suggestion, and I don't know if we're using commas of assigns elsewhere or if those might cause trouble, but it would be a way to shuffle the handshake w/o changing the semantics of the box operator (except that the meaning of the link to the statement on GenTreeBox would change, but this optimization appears to be the only consumer of that)

@AndyAyersMS
Copy link
Member

Yeah, using a 2-statement preamble occurred to me too. There is some variability in how the copy is represented, but perhaps it is easy enough to see how to preserve the necessary side effects. A second pointer on the Box is probably ok.

@AndyAyersMS
Copy link
Member

Hmm, looking at what gtCloneExpr already does for BOX makes me wonder about the viability of having it pointing at other statements. Seems like there is potential for real trouble here. I suppose adding a second statement shouldn't get us any deeper into trouble than we already are.

@AndyAyersMS
Copy link
Member

Incomplete fix attempt: https://github.com/AndyAyersMS/coreclr/tree/FixDeadBox. Copy is now walled off in second statement, but still not clear how to extract just the "source" side effects on the copy.

@AndyAyersMS
Copy link
Member

Added some tweaks to the fix attempt. It can now get through corelib without blowing up, though the preservation of source side effects for struct copies is ad-hoc and ugly.

511 methods impacted, overall net around zero size, but some big regressions. One such case is generic ObjectEqualityComparer IndexOf on value types. Here the IL contains the following (where CSC looks like it already clones the IL in various places to specialize the "looking for null " case)

IL_002f  03                ldarg.1     
IL_0030  08                ldloc.2     
IL_0031  a3 a5 00 00 1b    ldelem       0x1B0000A5    ;; fetch a[i]
IL_0036  8c a5 00 00 1b    box          0x1B0000A5    ;; result is non-null for value types
IL_003b  2c 19             brfalse.s    25 (IL_0056)  ;; will never branch for value types
IL_003d  03                ldarg.1     
IL_003e  08                ldloc.2     
IL_003f  fe 1e             readonly.   
IL_0041  8f a5 00 00 1b    ldelema      0x1B0000A5    ;;  fetch &a[i]
IL_0046  04                ldarg.2     
IL_0047  fe 16 a5 00 00 1b constrained. 0x1B0000A5
IL_004d  6f 80 07 00 0a    callvirt     0xA000780
IL_0052  2c 02             brfalse.s    2 (IL_0056)

Current jit's dead box opt will toss the whole subtree under the box, which includes the ldelem. So it loses potential null and bounds checks. But the subsequent ldelema does the same checks and so masks the fact that the box opt transformation is unsound.

My prototype preserves the side effects of the ldelem and so there are now two index computations. For some reason I have yet to drill into, this now triggers loop cloning. So these methods see large size increases.

@AndyAyersMS
Copy link
Member

Ah, it seems loop cloning can't spot array indices in the &a[i] case. So the current jit doesn't spot that by cloning it could remove the range check (probably a good thing...). But now the preserved side effects from the box create something like a[i] as well, which loop cloning spots and decides to optimize.

We don't need an explicit load from the source of the box, we just need a null check on the address. However GT_NULLCHECK has limitations and expects its operand to be side effect free. So while it would make sense to use it it would require a new temp.

@mikedn
Copy link
Contributor

mikedn commented Jul 23, 2017

Looking a bit at why this doesn't happen in 1.1 or in Jit32, I think it is from the GT_GT -> GT_NE change we put into 2.0. Also explains why 4.7 doesn't have this issue and 4.7.1 will unless we push whatever fix we come up with here...

To be clear, this bug exists in both JIT32 and RuyJIT that ships with .NET Framework 4.7. You just need to use == null instead of != null. The GT_GT -> GT_NE just enables this broken transform to also run for != null.

@AndyAyersMS
Copy link
Member

I suspected JIT32 and older RyuJits had the same bug. Thanks for pointing out an example. Would be interesting to know if JIT64 has it too (not as likely) and how far back it goes.

@AndyAyersMS
Copy link
Member

Looks like JIT64 is ok. The bug is there in JIT32 in v2.0.

@mikedn
Copy link
Contributor

mikedn commented Jul 24, 2017

Would be interesting to know if JIT64 has it too (not as likely)

JIT64 works fine. It does the transform and keeps the relevant side effects. For return (a.k == null); it generates:

 mov         eax,dword ptr [rcx+8] 
 xor         eax,eax 
 ret 

how far back it goes.

Hmm, all my machines already have .NET 4.7 so I don't know about other 4.x versions. But I tried with the .NET 2.0 x86 runtime and the bug is there too. It's probably safe to say that the bug exists since this optimization was introduced.

@mikedn
Copy link
Contributor

mikedn commented Jul 24, 2017

It's kind of funny that if a struct is used instead of int JIT64 keeps a copy of a.k to a temporary. Loading a single byte from a.k would have been enough to preserve the side effect.

@AndyAyersMS
Copy link
Member

My prototype just tries to read the first byte, though it would be somewhat less hacky to just copy the struct into a temporary.

I had hoped to use GT_NULLCHECK but it has limitations.

AndyAyersMS referenced this issue in AndyAyersMS/coreclr Jul 25, 2017
Boxing a value type produces a non-null result. If the result of the
box is only used to feed a compare against null, the jit tries to
optimize the box away entirely since the result of the comparison is
known. Such idiomatic expressions arise fairly often in generics
instantiated over value types.

In the current implementation the box expands into two parts: an
upstream statement to allocate heap space, and then an expression
tree containing an encapsulated copy from the value being boxed to
the payload of the new heap object. Wrapping around that is a
reference to the new object, which is the result of the box.

When the optimization fires, the upstream allocation is removed,
and the current implementation also removes the entire box expression
tree. Howver this tree can include important side effects from
the evaluation of the value being boxed that must be preserved.

For instance the value might come from an array, in which case
and the box expression tree would contain the array null check
and bounds check. So removing the entire tree can alter behavior.

This fix attempts to carefull preserve the important side
effects by moving the copy into a second statement upstream from the
box. The box itself is then just a trivial side-effect-free reference
to the box temp local.

When the optimization fires the jit removes the upstream heap allocation
as before, as well as the now-trivial box tree. It analyzes the source
side of the upstream copy. If it is side effect free the copy is removed
entirely. If not, the jit modifies the copy into a minimal load of the
boxed value, and this load should reproduce the necessary side effects.

Fixes #12949.
AndyAyersMS referenced this issue in AndyAyersMS/coreclr Jul 29, 2017
Boxing a value type produces a non-null result. If the result of the box is
only used to feed a compare against null, the jit tries to optimize the box
away entirely since the result of the comparison is known. Such idiomatic
expressions arise fairly often in generics instantiated over value types.

In the current implementation the box expands into two parts. The first is
an upstream statement to allocate a boxed object and assign a reference to
the boxed object to a local var known as the "box temp". The second is an
expression tree whose value is the box temp that also contains an an
encapsulated copy from the value being boxed to the payload section of the
boxed object. The box node also contains a pointer back to the first
statement (more on this later).

In the examples being discussed here this second tree is a child of a compare
node whose other child is a null pointer. When the optimization fires, the
upstream allocation statement is located via the pointer in the box node and
removed, and the entire compare is replaced with a constant 0 or 1 as
appropriate. Unfortunately the encapsulated copy in the box subtree may
include side effects that should be preserved, and so this transformation is
unsafe.

Note that the copy subtree as a whole will always contain side effects, since
the copy is storing values into the heap, and that copy now will not happen.
But the side effects that happen when producing the value to box must remain.

In the initial example from #12949 the side effects in question were
introduced by the jit's optimizer to capure a CSE definition. dotnet#13016 gives
several other examples where the side effects are present in the initial user
code. For instance the value being boxed might come from an array, in which
case the encapsulated copy in the box expression tree would contain the array
null check and bounds check. So removing the entire tree can alter behavior.

This fix attempts to carefully preserve the important side effects by
reworking how a box is imported. The copy is now moved out from under the box
into a second upstream statement. The box itself is then just a trivial
side-effect-free reference to the box temp. To ensure proper ordering of side
effects the jit spills the evaluation stack before appending the copy
statement.

When the optimization fires the jit removes the upstream heap allocation
as before, as well as the now-trivial compare tree. It analyzes the source
side of the upstream copy. If it is side effect free, the copy is removed
entirely. If not, the jit modifies the copy into a minimal load of the
boxed value, and this load should reproduce the necessary side effects.

The optimization is only performed when the tree shape of the copy matches
expected patterns.

There are some expected cases where the tree won't match, for instance if the
optimization is invoked while the jit is inlining. Because this optimization
runs at several points the jit can catch these cases once inlining completes.
There is one case that is not handled that could be -- if the assignment part
of the copy is itself a subtree of a comma. This doesn't happen often.

The optimization is now also extended to handle the case where the comparision
operation is `cgt.un`. This doesn't catch any new cases but causes the
optimization to happen earlier, typically during importation, which should
reduce jit time slightly.

Generally the split of the box into two upstream statements reduces code size,
especially when the box expression is incorporated into a larger tree -- for
example a call. However in some cases where the value being boxed comes from
an array, preserving the array bounds check now causes loop cloning to kick
in and increase code size. Hence the overall size impact on the jit-diff set is
essentially zero.

Added a number of new test cases showing the variety of situations that must
be handled and the need to spill before appending the copy statement.

Fixes #12949.
AndyAyersMS referenced this issue in dotnet/coreclr Jul 31, 2017
Boxing a value type produces a non-null result. If the result of the box is
only used to feed a compare against null, the jit tries to optimize the box
away entirely since the result of the comparison is known. Such idiomatic
expressions arise fairly often in generics instantiated over value types.

In the current implementation the box expands into two parts. The first is
an upstream statement to allocate a boxed object and assign a reference to
the boxed object to a local var known as the "box temp". The second is an
expression tree whose value is the box temp that also contains an an
encapsulated copy from the value being boxed to the payload section of the
boxed object. The box node also contains a pointer back to the first
statement (more on this later).

In the examples being discussed here this second tree is a child of a compare
node whose other child is a null pointer. When the optimization fires, the
upstream allocation statement is located via the pointer in the box node and
removed, and the entire compare is replaced with a constant 0 or 1 as
appropriate. Unfortunately the encapsulated copy in the box subtree may
include side effects that should be preserved, and so this transformation is
unsafe.

Note that the copy subtree as a whole will always contain side effects, since
the copy is storing values into the heap, and that copy now will not happen.
But the side effects that happen when producing the value to box must remain.

In the initial example from #12949 the side effects in question were
introduced by the jit's optimizer to capure a CSE definition. #13016 gives
several other examples where the side effects are present in the initial user
code. For instance the value being boxed might come from an array, in which
case the encapsulated copy in the box expression tree would contain the array
null check and bounds check. So removing the entire tree can alter behavior.

This fix attempts to carefully preserve the important side effects by
reworking how a box is imported. The copy is now moved out from under the box
into a second upstream statement. The box itself is then just a trivial
side-effect-free reference to the box temp. To ensure proper ordering of side
effects the jit spills the evaluation stack before appending the copy
statement.

When the optimization fires the jit removes the upstream heap allocation
as before, as well as the now-trivial compare tree. It analyzes the source
side of the upstream copy. If it is side effect free, the copy is removed
entirely. If not, the jit modifies the copy into a minimal load of the
boxed value, and this load should reproduce the necessary side effects.

The optimization is only performed when the tree shape of the copy matches
expected patterns.

There are some expected cases where the tree won't match, for instance if the
optimization is invoked while the jit is inlining. Because this optimization
runs at several points the jit can catch these cases once inlining completes.
There is one case that is not handled that could be -- if the assignment part
of the copy is itself a subtree of a comma. This doesn't happen often.

The optimization is now also extended to handle the case where the comparision
operation is `cgt.un`. This doesn't catch any new cases but causes the
optimization to happen earlier, typically during importation, which should
reduce jit time slightly.

Generally the split of the box into two upstream statements reduces code size,
especially when the box expression is incorporated into a larger tree -- for
example a call. However in some cases where the value being boxed comes from
an array, preserving the array bounds check now causes loop cloning to kick
in and increase code size. Hence the overall size impact on the jit-diff set is
essentially zero.

Added a number of new test cases showing the variety of situations that must
be handled and the need to spill before appending the copy statement.

Fixes #12949.
RussKeldorph referenced this issue in dotnet/coreclr Aug 18, 2017
Boxing a value type produces a non-null result. If the result of the box is
only used to feed a compare against null, the jit tries to optimize the box
away entirely since the result of the comparison is known. Such idiomatic
expressions arise fairly often in generics instantiated over value types.

In the current implementation the box expands into two parts. The first is
an upstream statement to allocate a boxed object and assign a reference to
the boxed object to a local var known as the "box temp". The second is an
expression tree whose value is the box temp that also contains an an
encapsulated copy from the value being boxed to the payload section of the
boxed object. The box node also contains a pointer back to the first
statement (more on this later).

In the examples being discussed here this second tree is a child of a compare
node whose other child is a null pointer. When the optimization fires, the
upstream allocation statement is located via the pointer in the box node and
removed, and the entire compare is replaced with a constant 0 or 1 as
appropriate. Unfortunately the encapsulated copy in the box subtree may
include side effects that should be preserved, and so this transformation is
unsafe.

Note that the copy subtree as a whole will always contain side effects, since
the copy is storing values into the heap, and that copy now will not happen.
But the side effects that happen when producing the value to box must remain.

In the initial example from #12949 the side effects in question were
introduced by the jit's optimizer to capure a CSE definition. #13016 gives
several other examples where the side effects are present in the initial user
code. For instance the value being boxed might come from an array, in which
case the encapsulated copy in the box expression tree would contain the array
null check and bounds check. So removing the entire tree can alter behavior.

This fix attempts to carefully preserve the important side effects by
reworking how a box is imported. The copy is now moved out from under the box
into a second upstream statement. The box itself is then just a trivial
side-effect-free reference to the box temp. To ensure proper ordering of side
effects the jit spills the evaluation stack before appending the copy
statement.

When the optimization fires the jit removes the upstream heap allocation
as before, as well as the now-trivial compare tree. It analyzes the source
side of the upstream copy. If it is side effect free, the copy is removed
entirely. If not, the jit modifies the copy into a minimal load of the
boxed value, and this load should reproduce the necessary side effects.

The optimization is only performed when the tree shape of the copy matches
expected patterns.

There are some expected cases where the tree won't match, for instance if the
optimization is invoked while the jit is inlining. Because this optimization
runs at several points the jit can catch these cases once inlining completes.
There is one case that is not handled that could be -- if the assignment part
of the copy is itself a subtree of a comma. This doesn't happen often.

The optimization is now also extended to handle the case where the comparision
operation is `cgt.un`. This doesn't catch any new cases but causes the
optimization to happen earlier, typically during importation, which should
reduce jit time slightly.

Generally the split of the box into two upstream statements reduces code size,
especially when the box expression is incorporated into a larger tree -- for
example a call. However in some cases where the value being boxed comes from
an array, preserving the array bounds check now causes loop cloning to kick
in and increase code size. Hence the overall size impact on the jit-diff set is
essentially zero.

Added a number of new test cases showing the variety of situations that must
be handled and the need to spill before appending the copy statement.

Fixes #12949.
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@dotnet dotnet locked as resolved and limited conversation to collaborators Dec 21, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants