Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

JIT: Fix value type box optimization #13016

Merged
merged 1 commit into from
Jul 31, 2017
Merged

Conversation

AndyAyersMS
Copy link
Member

Boxing a value type produces a non-null result. If the result of the
box is only used to feed a compare against null, the jit tries to
optimize the box away entirely since the result of the comparison is
known. Such idiomatic expressions arise fairly often in generics
instantiated over value types.

In the current implementation the box expands into two parts: an
upstream statement to allocate heap space, and then an expression
tree containing an encapsulated copy from the value being boxed to
the payload of the new heap object. Wrapping around that is a
reference to the new object, which is the result of the box.

When the optimization fires, the upstream allocation is removed,
and the current implementation also removes the entire box expression
tree. Howver this tree can include important side effects from
the evaluation of the value being boxed that must be preserved.

For instance the value might come from an array, in which case
and the box expression tree would contain the array null check
and bounds check. So removing the entire tree can alter behavior.

This fix attempts to carefull preserve the important side
effects by moving the copy into a second statement upstream from the
box. The box itself is then just a trivial side-effect-free reference
to the box temp local.

When the optimization fires the jit removes the upstream heap allocation
as before, as well as the now-trivial box tree. It analyzes the source
side of the upstream copy. If it is side effect free the copy is removed
entirely. If not, the jit modifies the copy into a minimal load of the
boxed value, and this load should reproduce the necessary side effects.

Fixes #12949.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 25, 2017

@erozenfeld @JosephTremoulet PTAL
cc @dotnet/jit-contrib

This is still somewhat preliminary, curious if anyone knows a better way to extract just the source side effects from an assignment. If so I am happy to modify this.

Also I have a number of test cases to add.

Code size impact on jit-diffs FX is a small net win, but there are some sizable regressions. I haven't looked at them all but the biggest ones come from loop cloning kicking in because we now preserve ldelems which were being lost in methods like ObjectEqualityComparer'1:IndexOf(ref,short,int,int):int:this

Interestingly enough we got "lucky" before with these methods because the ldelem was followed by a ldelema with the same array and index, so any side effects lost by discarding the former were recreated by the latter. So the bug was masked. And I got "unlucky" with loop cloning on these methods because it's unable to spot opportunities from ldelema and so didn't clone these loops before.

Was hoping to use GT_NULLCHECK for the residual bit of copy (which would have forestalled loop cloning) but that can't handle side effecting operands. So am just loading a scalar or byte depending on the type of the value being boxed.

The pattern match tries to conservative and bail if it doesn't recognize the IR shapes. Currently it bails out on 8 cases in the jit-diff set, all ones where the copy statement assign is embedded in a comma.

case GT_EQ:
case GT_NE:
case GT_GT:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's the check for GTF_UNSIGNED?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: need to account for operand order...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't get CSC to emit the ldnull first, but it could happen in IL. And we shouldn't see signed compares but to be on the safe side I'll add a bail out for that too.

// For struct types read the first byte of the
// source struct; there's no need to read the
// entire thing, and no place to put it.
assert(copySrc->gtOper == GT_OBJ || copySrc->gtOper == GT_IND || copySrc->gtOper == GT_FIELD);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you take the logic you're using for scalars and apply it to structs as well? Does it produce something incorrect, or just inefficient? And if inefficient, is it common?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that doing this on x86 will have a negative impact on register allocation, we need to generate something like mov reg, [mem] and reg has to be a "byteable" register.

Copy link

@briansull briansull Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably we are just reading and throwing away the result here, and
usually EAX or ECX will be available, so it shouldn't be much of an issue for x86.
If AsmDiffs do show some issue, then you coudl read 4 bytes when the struct is at least 4 bytes in size.
That allows ESI, EDI or EBP to be used, however typically those registers are much more likely to be used to
enregister LclVars than EAX or ECX.
Thus I doubt that there will be a case where we we lose because of this byte sized read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as treating structs like scalars goes -- because we're removing the destination of the copy entirely (since it was the box payload in the newobj which we've zapped) we either need to suppress/alter the copy to something the jit can handle as a standalone source, or need to add a suitable destination to ensure the copy ends up somewhere.

GT_OBJ can't be at the root of a tree, there is code in morph (and possibly elsewhere) that assumes it has a parent. So for that case at least we'd need to allocate a new temp to be the LHS. Not sure about GT_IND and GT_FIELD cases -- perhaps if they are at the root of a tree the jit will be able to create the right kind of temp to hold the result.

Copying the entire struct just to cause the side effects from address formation and null checking is indeed inefficient, though as @mikedn notes this evidently is what JIT64 does.

The struct case is fairly common.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am going to change the struct case to read a 32 bit unsigned value, so we end up with slightly smaller encodings on x86/x64; should also mitigate any worries about byteable dests.

      458B11               mov      r10d, dword ptr [r9]

instead of

       4D0FBE11             movsx    r10, byte  ptr [r9]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still will have to handle the smaller struct case (such as a three byte struct) differently. Since you are not allowed to read beyond the end of the struct (as that may be an unmapped page)

@RussKeldorph RussKeldorph added this to the 2.0.0 milestone Jul 25, 2017
@AndyAyersMS
Copy link
Member Author

Arm64 failure Assertion failed 'tosIndex > idx' means there is some node that can't handle being moved to root. Need to look at the jit dump to see more. @jashook what is the process for reproing these?

@AndyAyersMS
Copy link
Member Author

Fixed the arm64 issue, GT_RET_VAL can't stand on its own either, as there is some magic parent-aware stuff that happens for successful inlines.

Going back to just reading one byte as doing the extra analysis to figure out if we could safely read more didn't seem worth the trouble.

@AndyAyersMS
Copy link
Member Author

I think this Tizen failure is unrelated, as the test case doesn't appear to box anything. Will retry...

@dotnet-bot retest Tizen armel Cross Debug Build

@AndyAyersMS
Copy link
Member Author

Got a different Tizen failure this time. Leg seems flaky in some other recent runs too.

Will give it one more shot...

@dotnet-bot retest Tizen armel Cross Debug Buil

@AndyAyersMS
Copy link
Member Author

MIssed a letter.

@dotnet-bot retest Tizen armel Cross Debug Build

@AndyAyersMS
Copy link
Member Author

@dotnet/jit-contrib this is about as pretty as it's going to get, though it still reads too much from some struct types. For instance, given

    public static bool Test(X<K> a)
    {
        return (a.k != null);
    }

with K == Vector3:

       488D4108             lea      rax, bword ptr [rcx+8]       ;; &a.k
       C4E17A104008         vmovss   xmm0, dword ptr [rax+8]
       C4E17B1008           vmovsd   xmm1, qword ptr [rax]
       C4E170C6C844         vshufps  xmm1, xmm0, 68               ;; assemble the vector
       B801000000           mov      eax, 1
       C3                   ret

Without this fix we lose the potential null deref entirely:

       ;; code from current jit
       50                   push     rax
       B801000000           mov      eax, 1
       4883C408             add      rsp, 8
       C3                   ret

I could look again at trying to make this part work more like a null check...

Current jit-diff results:

Total bytes of diff: -1880 (-0.01 % of base)
    diff is an improvement.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
        1405 : System.Private.CoreLib.dasm (0.04 % of base)
           7 : System.Private.Uri.dasm (0.01 % of base)
           5 : System.Text.RegularExpressions.dasm (0.01 % of base)
           1 : System.Linq.dasm (0.00 % of base)
Top file improvements by size (bytes):
       -1849 : Microsoft.CodeAnalysis.dasm (-0.24 % of base)
        -259 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.01 % of base)
        -244 : Microsoft.CodeAnalysis.CSharp.dasm (-0.01 % of base)
        -175 : Microsoft.VisualBasic.dasm (-0.11 % of base)
        -119 : System.Linq.Expressions.dasm (-0.02 % of base)
40 total files with size differences (36 improved, 4 regressed), 39 unchanged.
Top method regessions by size (bytes):
         257 : System.Private.CoreLib.dasm - ObjectEqualityComparer`1:IndexOf(ref,int,int,int):int:this (3 methods)
         240 : System.Private.CoreLib.dasm - GenericEqualityComparer`1:IndexOf(ref,long,int,int):int:this (3 methods)
         237 : System.Private.CoreLib.dasm - ObjectEqualityComparer`1:IndexOf(ref,short,int,int):int:this
         234 : System.Private.CoreLib.dasm - ObjectEqualityComparer`1:IndexOf(ref,long,int,int):int:this
         173 : System.Private.CoreLib.dasm - GenericEqualityComparer`1:IndexOf(ref,double,int,int):int:this
Top method improvements by size (bytes):
       -1554 : Microsoft.CodeAnalysis.dasm - DesktopAssemblyIdentityComparer:.cctor()
        -186 : System.Private.CoreLib.dasm - ValueTuple`8:System.IComparable.CompareTo(ref):int:this (15 methods)
        -104 : Microsoft.CodeAnalysis.dasm - MetadataDecoder`5:DecodeCustomAttributePrimitiveElementOrThrow(byref,ubyte,ref):struct:this
        -102 : System.Linq.Expressions.dasm - DebugView:.ctor(struct):this
         -85 : System.Private.CoreLib.dasm - CustomAttributeData:Init(ref):this (5 methods)
647 total methods with size differences (400 improved, 247 regressed), 65437 unchanged.

As noted above, the size regressions in the various IndexOf methods are caused by loop cloning.

@mikedn
Copy link

mikedn commented Jul 27, 2017

-1849 : Microsoft.CodeAnalysis.dasm (-0.24 % of base)

What causes these improvements?!

@cmckinsey
Copy link

Thanks for adding the regression tests.

@AndyAyersMS
Copy link
Member Author

Moving the copy to an earlier statement apparently has some advantages when the box is part of some bigger computation, in particular if the box feeds a call with a lot of other arguments. For example (distilled from the Roslyn code):

using System;
using System.Collections.Generic;

public struct Av
{
    public readonly int Q;
    public Av(int q)
    {
        Q = q;
    }
}

public struct V
{
    public readonly int Token;
    public readonly Av A;

    public V(int t, Av a)
    {
        Token = t;
        A = a;
    }
}

public class F : Dictionary<string, List<V>>
{
    public void Add(string s, int t, Av a, object b)
    {
        List<V> values;
        if (!TryGetValue(s, out values))
        {
            values = new List<V>();
            Add(s, values);
        }
        values.Add(new V(t, a));
    }
}

class M
{
    private static readonly F s_F = new F()
    {
        {"abc", 33, new Av(11), new Av(12) }    // calls Add and boxes the last arg
    };

    public static int Main()
    {
        List<V> v = s_F["abc"];
        return v[0].Token + 67;
    }
}

The code for M::.cctor is much better with my changes:

;;; before
       48B9905AF72FFB7F0000 mov      rcx, 0x7FFB2FF75A90
       E80921835F           call     CORINFO_HELP_NEWSFAST
       488945F0             mov      gword ptr [rbp-10H], rax
       488B4DF0             mov      rcx, gword ptr [rbp-10H]
       E86CFBFFFF           call     F:.ctor():this
       33C9                 xor      ecx, ecx
       894DE8               mov      dword ptr [rbp-18H], ecx
       488D4DE8             lea      rcx, bword ptr [rbp-18H]
       BA0B000000           mov      edx, 11
       E869FBFFFF           call     Av:.ctor(int):this
       33C9                 xor      ecx, ecx
       894DE0               mov      dword ptr [rbp-20H], ecx
       488D4DE0             lea      rcx, bword ptr [rbp-20H]
       BA0C000000           mov      edx, 12
       E856FBFFFF           call     Av:.ctor(int):this
       48B9D85DF72FFB7F0000 mov      rcx, 0x7FFB2FF75DD8
       E8C720835F           call     CORINFO_HELP_NEWSFAST
       488945D8             mov      gword ptr [rbp-28H], rax
       488B4DF0             mov      rcx, gword ptr [rbp-10H]
       48894DD0             mov      gword ptr [rbp-30H], rcx
       48B968302720F7010000 mov      rcx, 0x1F720273068
       488B09               mov      rcx, gword ptr [rcx]
       48894DC8             mov      gword ptr [rbp-38H], rcx
       8B4DE8               mov      ecx, dword ptr [rbp-18H]
       894DBC               mov      dword ptr [rbp-44H], ecx
       8B4DE0               mov      ecx, dword ptr [rbp-20H]
       488B55D8             mov      rdx, gword ptr [rbp-28H]
       894A08               mov      dword ptr [rdx+8], ecx
       488B4DD8             mov      rcx, gword ptr [rbp-28H]
       48894DC0             mov      gword ptr [rbp-40H], rcx
       488B4DD0             mov      rcx, gword ptr [rbp-30H]
       488B55C8             mov      rdx, gword ptr [rbp-38H]
       4C8B4DC0             mov      r9, gword ptr [rbp-40H]
       4C894C2420           mov      gword ptr [rsp+20H], r9
       448B4DBC             mov      r9d, dword ptr [rbp-44H]
       41B821000000         mov      r8d, 33
       3909                 cmp      dword ptr [rcx], ecx
       E8D8FAFFFF           call     F:Add(ref,int,struct,ref):this

;;; after
       48B9905AF62FFB7F0000 mov      rcx, 0x7FFB2FF65A90
       E80D21845F           call     CORINFO_HELP_NEWSFAST
       488945F8             mov      gword ptr [rbp-08H], rax
       488B4DF8             mov      rcx, gword ptr [rbp-08H]
       E870FBFFFF           call     F:.ctor():this
       33C9                 xor      ecx, ecx
       894DF0               mov      dword ptr [rbp-10H], ecx
       488D4DF0             lea      rcx, bword ptr [rbp-10H]
       BA0B000000           mov      edx, 11
       E86DFBFFFF           call     Av:.ctor(int):this
       33C9                 xor      ecx, ecx
       894DE8               mov      dword ptr [rbp-18H], ecx
       488D4DE8             lea      rcx, bword ptr [rbp-18H]
       BA0C000000           mov      edx, 12
       E85AFBFFFF           call     Av:.ctor(int):this
       48B9D85DF62FFB7F0000 mov      rcx, 0x7FFB2FF65DD8
       E8CB20845F           call     CORINFO_HELP_NEWSFAST
       488945E0             mov      gword ptr [rbp-20H], rax
       8B55E8               mov      edx, dword ptr [rbp-18H]
       488B4DE0             mov      rcx, gword ptr [rbp-20H]
       895108               mov      dword ptr [rcx+8], edx
       488B55E0             mov      rdx, gword ptr [rbp-20H]
       4889542420           mov      gword ptr [rsp+20H], rdx
       48BA68300010EF010000 mov      rdx, 0x1EF10003068
       488B12               mov      rdx, gword ptr [rdx]
       488B4DF8             mov      rcx, gword ptr [rbp-08H]
       448B4DF0             mov      r9d, dword ptr [rbp-10H]
       41B821000000         mov      r8d, 33
       3909                 cmp      dword ptr [rcx], ecx
       E8FAFAFFFF           call     F:Add(ref,int,struct,ref):this

Haven't yet drilled into exactly what causes the current version to generate so much code.

That being said I probably need to ensure that the value being boxed can't be modified by some part of the tree that contains the box. I can probably rework the above to illustrate what can go wrong and put the appropriate spill check into the importer.

@AndyAyersMS
Copy link
Member Author

Yeah, we need to spill. About to push an update to the code and a new test case.

@AndyAyersMS
Copy link
Member Author

Jit-diffs with spilling, a few more methods impacted, now a tiny size increase overall.

Total bytes of diff: 327 (0.00 % of base)
    diff is a regression.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
        2852 : System.Private.CoreLib.dasm (0.09 % of base)
          99 : System.Net.Http.dasm (0.04 % of base)
          38 : System.Numerics.Vectors.dasm (0.07 % of base)
          30 : System.Net.WebHeaderCollection.dasm (0.20 % of base)
          22 : System.Private.Uri.dasm (0.03 % of base)
Top file improvements by size (bytes):
       -1721 : Microsoft.CodeAnalysis.dasm (-0.22 % of base)
        -251 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.01 % of base)
        -240 : Microsoft.CodeAnalysis.CSharp.dasm (-0.01 % of base)
        -119 : System.Linq.Expressions.dasm (-0.02 % of base)
        -118 : System.Linq.Parallel.dasm (-0.02 % of base)
40 total files with size differences (24 improved, 16 regressed), 39 unchanged.
Top method regessions by size (bytes):
         263 : System.Private.CoreLib.dasm - ObjectEqualityComparer`1:IndexOf(ref,int,int,int):int:this (3 methods)
         240 : System.Private.CoreLib.dasm - GenericEqualityComparer`1:IndexOf(ref,long,int,int):int:this (3 methods)
         237 : System.Private.CoreLib.dasm - ObjectEqualityComparer`1:IndexOf(ref,short,int,int):int:this
         234 : System.Private.CoreLib.dasm - ObjectEqualityComparer`1:IndexOf(ref,long,int,int):int:this
         173 : System.Private.CoreLib.dasm - GenericEqualityComparer`1:IndexOf(ref,double,int,int):int:this
Top method improvements by size (bytes):
       -1554 : Microsoft.CodeAnalysis.dasm - DesktopAssemblyIdentityComparer:.cctor()
        -104 : Microsoft.CodeAnalysis.dasm - MetadataDecoder`5:DecodeCustomAttributePrimitiveElementOrThrow(byref,ubyte,ref):struct:this
        -102 : System.Linq.Expressions.dasm - DebugView:.ctor(struct):this
         -96 : System.Private.CoreLib.dasm - ValueTuple`8:System.IComparable.CompareTo(ref):int:this (15 methods)
         -78 : System.Private.CoreLib.dasm - Dictionary`2:OnDeserialization(ref):this (29 methods)
797 total methods with size differences (321 improved, 476 regressed), 65287 unchanged.

Copy link

@JosephTremoulet JosephTremoulet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AndyAyersMS
Copy link
Member Author

Am going to squash locally, rebase, and force push, so one more "update".

Boxing a value type produces a non-null result. If the result of the box is
only used to feed a compare against null, the jit tries to optimize the box
away entirely since the result of the comparison is known. Such idiomatic
expressions arise fairly often in generics instantiated over value types.

In the current implementation the box expands into two parts. The first is
an upstream statement to allocate a boxed object and assign a reference to
the boxed object to a local var known as the "box temp". The second is an
expression tree whose value is the box temp that also contains an an
encapsulated copy from the value being boxed to the payload section of the
boxed object. The box node also contains a pointer back to the first
statement (more on this later).

In the examples being discussed here this second tree is a child of a compare
node whose other child is a null pointer. When the optimization fires, the
upstream allocation statement is located via the pointer in the box node and
removed, and the entire compare is replaced with a constant 0 or 1 as
appropriate. Unfortunately the encapsulated copy in the box subtree may
include side effects that should be preserved, and so this transformation is
unsafe.

Note that the copy subtree as a whole will always contain side effects, since
the copy is storing values into the heap, and that copy now will not happen.
But the side effects that happen when producing the value to box must remain.

In the initial example from #12949 the side effects in question were
introduced by the jit's optimizer to capure a CSE definition. dotnet#13016 gives
several other examples where the side effects are present in the initial user
code. For instance the value being boxed might come from an array, in which
case the encapsulated copy in the box expression tree would contain the array
null check and bounds check. So removing the entire tree can alter behavior.

This fix attempts to carefully preserve the important side effects by
reworking how a box is imported. The copy is now moved out from under the box
into a second upstream statement. The box itself is then just a trivial
side-effect-free reference to the box temp. To ensure proper ordering of side
effects the jit spills the evaluation stack before appending the copy
statement.

When the optimization fires the jit removes the upstream heap allocation
as before, as well as the now-trivial compare tree. It analyzes the source
side of the upstream copy. If it is side effect free, the copy is removed
entirely. If not, the jit modifies the copy into a minimal load of the
boxed value, and this load should reproduce the necessary side effects.

The optimization is only performed when the tree shape of the copy matches
expected patterns.

There are some expected cases where the tree won't match, for instance if the
optimization is invoked while the jit is inlining. Because this optimization
runs at several points the jit can catch these cases once inlining completes.
There is one case that is not handled that could be -- if the assignment part
of the copy is itself a subtree of a comma. This doesn't happen often.

The optimization is now also extended to handle the case where the comparision
operation is `cgt.un`. This doesn't catch any new cases but causes the
optimization to happen earlier, typically during importation, which should
reduce jit time slightly.

Generally the split of the box into two upstream statements reduces code size,
especially when the box expression is incorporated into a larger tree -- for
example a call. However in some cases where the value being boxed comes from
an array, preserving the array bounds check now causes loop cloning to kick
in and increase code size. Hence the overall size impact on the jit-diff set is
essentially zero.

Added a number of new test cases showing the variety of situations that must
be handled and the need to spill before appending the copy statement.

Fixes #12949.
@AndyAyersMS
Copy link
Member Author

Looked at desktop SPMI diffs (and ran full DDRs, which passed). Diffs are net favorable, but with some regressions. Sampled some of these and it looks like some missed CSEs of array bounds checks and index computations.

Tizen failure almost certainly unrelated, but will give it one more shot.

@dotnet-bot test Tizen armel Cross Debug Build

@AndyAyersMS AndyAyersMS merged commit 9c844a9 into dotnet:master Jul 31, 2017
@AndyAyersMS AndyAyersMS deleted the FixDeadBox branch July 31, 2017 20:01
@RussKeldorph RussKeldorph modified the milestones: 2.0.x, 2.0.0 Aug 10, 2017
MichalStrehovsky added a commit to MichalStrehovsky/corert that referenced this pull request Aug 13, 2017
jkotas pushed a commit to dotnet/corert that referenced this pull request Aug 13, 2017
RussKeldorph pushed a commit that referenced this pull request Aug 18, 2017
Boxing a value type produces a non-null result. If the result of the box is
only used to feed a compare against null, the jit tries to optimize the box
away entirely since the result of the comparison is known. Such idiomatic
expressions arise fairly often in generics instantiated over value types.

In the current implementation the box expands into two parts. The first is
an upstream statement to allocate a boxed object and assign a reference to
the boxed object to a local var known as the "box temp". The second is an
expression tree whose value is the box temp that also contains an an
encapsulated copy from the value being boxed to the payload section of the
boxed object. The box node also contains a pointer back to the first
statement (more on this later).

In the examples being discussed here this second tree is a child of a compare
node whose other child is a null pointer. When the optimization fires, the
upstream allocation statement is located via the pointer in the box node and
removed, and the entire compare is replaced with a constant 0 or 1 as
appropriate. Unfortunately the encapsulated copy in the box subtree may
include side effects that should be preserved, and so this transformation is
unsafe.

Note that the copy subtree as a whole will always contain side effects, since
the copy is storing values into the heap, and that copy now will not happen.
But the side effects that happen when producing the value to box must remain.

In the initial example from #12949 the side effects in question were
introduced by the jit's optimizer to capure a CSE definition. #13016 gives
several other examples where the side effects are present in the initial user
code. For instance the value being boxed might come from an array, in which
case the encapsulated copy in the box expression tree would contain the array
null check and bounds check. So removing the entire tree can alter behavior.

This fix attempts to carefully preserve the important side effects by
reworking how a box is imported. The copy is now moved out from under the box
into a second upstream statement. The box itself is then just a trivial
side-effect-free reference to the box temp. To ensure proper ordering of side
effects the jit spills the evaluation stack before appending the copy
statement.

When the optimization fires the jit removes the upstream heap allocation
as before, as well as the now-trivial compare tree. It analyzes the source
side of the upstream copy. If it is side effect free, the copy is removed
entirely. If not, the jit modifies the copy into a minimal load of the
boxed value, and this load should reproduce the necessary side effects.

The optimization is only performed when the tree shape of the copy matches
expected patterns.

There are some expected cases where the tree won't match, for instance if the
optimization is invoked while the jit is inlining. Because this optimization
runs at several points the jit can catch these cases once inlining completes.
There is one case that is not handled that could be -- if the assignment part
of the copy is itself a subtree of a comma. This doesn't happen often.

The optimization is now also extended to handle the case where the comparision
operation is `cgt.un`. This doesn't catch any new cases but causes the
optimization to happen earlier, typically during importation, which should
reduce jit time slightly.

Generally the split of the box into two upstream statements reduces code size,
especially when the box expression is incorporated into a larger tree -- for
example a call. However in some cases where the value being boxed comes from
an array, preserving the array bounds check now causes loop cloning to kick
in and increase code size. Hence the overall size impact on the jit-diff set is
essentially zero.

Added a number of new test cases showing the variety of situations that must
be handled and the need to spill before appending the copy statement.

Fixes #12949.
@karelz karelz modified the milestones: 2.1.0, 2.0.x Aug 28, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bad x64 codegen in VerifyUnifierConsistency
8 participants