Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Initial implementation of object stack allocation #20814

Merged
merged 4 commits into from
Nov 9, 2018

Conversation

erozenfeld
Copy link
Member

@erozenfeld erozenfeld commented Nov 5, 2018

f4ed491: Add JitObjectStackAllocation config option.

b19ec61: Allow creation of variables of TYP_STRUCT with non-value class handles.

Variables of TYP_STRUCT with non-value class handles represent stack-allocated objects.
Temporarily disable promotion of fields of stack-allocated objects.

323d3b3: Make getClassGClayout work with with class types.

Also add an assert to getHeapClassSize to ensure it's not
called in R2R cross-version-bubble.

d7e47b1: Implement escape analysis and stack allocation of non-box objects without gc fields.

This change implements a conservative flow-insensitive escape analysis and stack allocation
of non-box objects without gc fields.

Handling of objects with gc fields, box objects, and fixed-size arrays is future work.

Escape analysis is based on the one described here: https://www.cc.gatech.edu/~harrold/6340/cs6340_fall2009/Readings/choi99escape.pdf

Main limitations of this version of the escape analysis:

  1. The analysis is flow-insensitive.
  2. The analysis is intra-procedural and only sees the current method and the inlined methods.
  3. The analysis assumes that references passed to non-pure-helper calls escape.
  4. The analysis assumes that any references assigned to fields of objects escape.

Some of these limitations will be removed in future work.

I started with prior prototypes from @echesakovMSFT and @AndyAyersMS and extended and refactored
parts of them.

Variables of TYP_STRUCT with non-value class handles represent stack-allocated objects.
Temporarily disable promotion of fields of stack-allocated objects.
@erozenfeld
Copy link
Member Author

The feature is off by default, I will check that there are no diffs.
I will share the diffs results with complus_jitobjectstackallocation=1 later.

@erozenfeld
Copy link
Member Author

@AndyAyersMS @echesakovMSFT @dotnet/jit-contrib PTAL

@erozenfeld
Copy link
Member Author

@jkotas PTAL at the change in jitinterface.cpp

return false;
}

if (comp->info.compCompHnd->isValueClass(clsHnd))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you've already paid the cost to fetch the class attributes above you can just check for CORINFO_FLG_VALUECLASS here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (comp->lvaCount > 0)
{
m_EscapingPointers = BitVecOps::MakeEmpty(&m_bitVecTraits);
m_ConnGraphAdjacencyMatrix = new (comp->getAllocator()) BitSetShortLongRep[comp->lvaCount];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a new CMK_ value for ObjectAllocator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

const bool lclVarsOnly = true;
const bool computeStack = true;

comp->fgWalkTreePre(&stmt->gtStmtExpr, BuildConnGraphVisitor, &callbackData, lclVarsOnly, computeStack);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend using GenTreeVisitor instead of fgWalk functions (which are wrappers around GenTreeVisitor). It tends to be slightly faster and avoids the hassle with extra static functions, casts to walk data and whatnot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Done.

{
bool spcOptimizationEnabled = SPCOptimizationsEnabled();

CallTestAndVerifyAllocation(AllocateSimpleClassAndAddFields, 12, !spcOptimizationEnabled);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These allocation amounts are in bytes? If so they may vary by architecture...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, these are return values from the methods. I'm not checking the exact allocation bytes, just whether heap allocation amount changed.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see this moving along. Left some notes but nothing that would block merging this.

assert(parentStack != nullptr);
int parentIndex = 1;

bool done = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought: I think this code would read a bit better if we inverted the sense of this and had a keepChecking flag instead of a done flag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

{
if (!objectAllocator->IsLclVarEscaping(lclNum))
{
JITDUMP("V%02u first escapes (2) via [%06u]\n", lclNum, tree->gtTreeID);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't need the (2) here. Also we seem to use dspTreeID to dump the IDs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -2298,7 +2300,8 @@ unsigned CEEInfo::getClassGClayout (CORINFO_CLASS_HANDLE clsHnd, BYTE* gcPtrs)
{
// Get offset into the value class of the first pointer field (includes a +Object)
size_t cbSeriesSize = pByValueSeries->GetSeriesSize() + pMT->GetBaseSize();
size_t cbOffset = pByValueSeries->GetSeriesOffset() - OBJECT_SIZE;
size_t cbSeriesOffset = pByValueSeries->GetSeriesOffset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This layout is not version resilient. This should have assert and/or throw platform not supported that will make sure that the JIT won't ever get the non-resilient layout when compiling for R2R.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an assert here and also in getHeapClassSize.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JIT/EE changes look good to me.

@erozenfeld
Copy link
Member Author

I disabled the feature for x86 which uses a different JIT32_GCENCODER and disabled the new tests on x86. I don't want to make changes to JIT32_GCENCODER. My plan is to implement retyping of the trees involving pointers to stack-allocated objects so that no special handling in gc encoders is needed.

@erozenfeld
Copy link
Member Author

@AndyAyersMS @echesakov @mikedn @jkotas I addressed all review feedback, PTAL.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Are you going to post the enable diff summary here?

@erozenfeld
Copy link
Member Author

Yes, I'm running the diffs on the latest version, will post when I have the results.

@erozenfeld
Copy link
Member Author

Diffs with optimization enabled:

Found 238 files with textual diffs.
PMI Diffs for System.Private.CoreLib.dll, framework assemblies, assemblies in f:\coreclr6\bin\tests\Windows_NT.x64.Release for  default jit
Summary:
(Lower is better)
Total bytes of diff: 168097 (0.07% of base)
    diff is a regression.
Top file regressions by size (bytes):
       10402 : JIT\Regression\VS-ia64-JIT\V1.2-M01\b10827\b10827\b10827.dasm (5.11% of base)
        8070 : JIT\Directed\cmov\Double_And_Op_cs_do\Double_And_Op_cs_do.dasm (2.74% of base)
        8064 : JIT\Directed\cmov\Double_Or_Op_cs_do\Double_Or_Op_cs_do.dasm (2.74% of base)
        8000 : JIT\Directed\cmov\Double_Xor_Op_cs_do\Double_Xor_Op_cs_do.dasm (2.41% of base)
        7948 : JIT\Directed\cmov\Double_And_Op_cs_ro\Double_And_Op_cs_ro.dasm (2.71% of base)
Top file improvements by size (bytes):
       -1719 : JIT\Methodical\explicit\coverage\seq_gc_int_1_r\seq_gc_int_1_r.dasm (-10.51% of base)
       -1615 : JIT\Methodical\explicit\coverage\seq_gc_int_1_d\seq_gc_int_1_d.dasm (-9.65% of base)
       -1425 : JIT\Methodical\explicit\coverage\expl_gc_short_1_r\expl_gc_short_1_r.dasm (-8.37% of base)
       -1393 : JIT\Methodical\explicit\coverage\expl_gc_long_1_r\expl_gc_long_1_r.dasm (-8.35% of base)
       -1371 : JIT\Methodical\explicit\coverage\seq_gc_byte_1_r\seq_gc_byte_1_r.dasm (-8.15% of base)
207 total files with size differences (25 improved, 182 regressed), 8381 unchanged.
Top method regressions by size (bytes):
         484 (13.47% of base) : JIT\jit64\opt\cse\fieldExprUnchecked1\fieldExprUnchecked1.dasm - Test_Main:Main():int
         216 ( 3.12% of base) : JIT\Methodical\explicit\coverage\expl_gc_int_1_d\expl_gc_int_1_d.dasm - TestApp:Main():int
         216 ( 3.16% of base) : JIT\Methodical\explicit\coverage\expl_gc_int_1_r\expl_gc_int_1_r.dasm - TestApp:Main():int
         111 ( 1.61% of base) : JIT\Methodical\explicit\coverage\expl_gc_byte_1_d\expl_gc_byte_1_d.dasm - TestApp:Main():int
         111 ( 1.63% of base) : JIT\Methodical\explicit\coverage\expl_gc_byte_1_r\expl_gc_byte_1_r.dasm - TestApp:Main():int
Top method improvements by size (bytes):
       -1719 (-24.37% of base) : JIT\Methodical\explicit\coverage\seq_gc_int_1_r\seq_gc_int_1_r.dasm - TestApp:Main():int
       -1615 (-22.63% of base) : JIT\Methodical\explicit\coverage\seq_gc_int_1_d\seq_gc_int_1_d.dasm - TestApp:Main():int
       -1425 (-19.77% of base) : JIT\Methodical\explicit\coverage\expl_gc_short_1_r\expl_gc_short_1_r.dasm - TestApp:Main():int
       -1393 (-19.79% of base) : JIT\Methodical\explicit\coverage\expl_gc_long_1_r\expl_gc_long_1_r.dasm - TestApp:Main():int
       -1371 (-19.59% of base) : JIT\Methodical\explicit\coverage\seq_gc_byte_1_r\seq_gc_byte_1_r.dasm - TestApp:Main():int
Top method regressions by size (percentage):
          30 (500.00% of base) : JIT\Regression\CLR-x86-JIT\V1-M09.5-PDC\b28790\b28790\b28790.dasm - AA:Static4(byref)
          34 (309.09% of base) : JIT\Regression\CLR-x86-JIT\V1-M13-RTM\b91859\b91859\b91859.dasm - AA:Main():int
          20 (181.82% of base) : JIT\Methodical\eh\interactions\ehso\ehso.dasm - Form1:Main():int
          20 (181.82% of base) : JIT\Regression\CLR-x86-JIT\V1-M13-RTM\b91867\b91867\b91867.dasm - CC:Main():int
          22 (146.67% of base) : JIT\Regression\JitBlue\GitHub_9891\GitHub_9891\GitHub_9891.dasm - B:Main(ref):int
Top method improvements by size (percentage):
        -177 (-83.49% of base) : JIT\Regression\CLR-x86-JIT\V1-M09.5-PDC\b16238\b16238\b16238.dasm - BB:Method1(bool):int:this
         -19 (-40.43% of base) : JIT\opt\Inline\tests\ifelse\ifelse.dasm - IfElse:Main():int
        -168 (-29.12% of base) : JIT\Generics\Fields\instance_assignment_class01\instance_assignment_class01.dasm - Test:Main():int
         -31 (-26.72% of base) : JIT\Directed\cmov\Bool_No_Op_cs_ro\Bool_No_Op_cs_ro.dasm - testout1:Sub_Funclet_3():int
          -4 (-26.67% of base) : JIT\opt\Devirtualization\simple\simple.dasm - Z:Main():int
12318 total methods with size differences (148 improved, 12170 regressed), 498668 unchanged.
31 files had text diffs but not size diffs.
JIT\Generics\Typeof\dynamicTypes\dynamicTypes.dasm had 450 diffs
JIT\Generics\Typeof\refTypesdynamic\refTypesdynamic.dasm had 450 diffs
JIT\Directed\UnrollLoop\loop2_cs_ro\loop2_cs_ro.dasm had 154 diffs
JIT\opt\Inline\tests\Inline_Vars\Inline_Vars.dasm had 128 diffs
JIT\Generics\Typeof\objectBoxing\objectBoxing.dasm had 108 diffs

@erozenfeld
Copy link
Member Author

A typical diff at allocation site:

       mov      rcx, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      rdx, rax

becomes

       mov      rdx, 0xD1FFAB1E
       mov      qword ptr [rsp+28H], rdx
       lea      rdx, bword ptr [rsp+28H]

@erozenfeld
Copy link
Member Author

The size of the additional initialization in the prologue depends on whether there are other initializations and on the shape of the object.
A new initialization may look like this:

       xor      rax, rax
       mov      qword ptr [rsp+28H], rax

or

       lea      rdi, [rsp+20H]
       mov      ecx, 8
       xor      rax, rax
       rep stosd 

@erozenfeld
Copy link
Member Author

@AndyAyersMS I pushed a commit where I changed the type of the tree used to rewrite uses from TYP_BYREF to TYP_I_IMPL. I was hitting this assert with TYP_BYREF:

assert(!"Incompatible types for gtNewTempAssign");
when expanding QMARK. PTAL.

@AndyAyersMS
Copy link
Member

Seems reasonable.

Also add an assert to getHeapClassSize to ensure it's not
called in R2R cross-version-bubble.
…hout gc fields.

This change implements a conservative flow-insensitive escape analysis and stack allocation
of non-box objects without gc fields.

Handling of objects with gc fields, box objects, and fixed-size arrays is future work.

Escape analysis is based on the one described here: https://www.cc.gatech.edu/~harrold/6340/cs6340_fall2009/Readings/choi99escape.pdf

Main limitations of this version of the escape analysis:
1. The analysis is flow-insensitive.
2. The analysis is intra-procedural and only sees the current method and the inlined methods.
3. The analysis assumes that references passed to non-pure-helper calls escape.
4. The analysis assumes that any references assigned to fields of objects escape.

Some of these limitations will be removed in future work.

I started with prior prototypes from @echesakovMSFT and @AndyAyersMS and extended and refactored
parts of them.

I also added tests for cases that are currently handled or will be handled soon.
@erozenfeld
Copy link
Member Author

I verified no diffs with the optimization off.

@erozenfeld
Copy link
Member Author

@dotnet-bot test this

@erozenfeld
Copy link
Member Author

@dotnet-bot test Windows_NT arm64 Cross Debug Innerloop Build

@erozenfeld erozenfeld merged commit 00f5934 into dotnet:master Nov 9, 2018
@omariom
Copy link

omariom commented Nov 9, 2018

@erozenfeld

What characteristics a reference type and its instance must have to be allocated on stack?

@erozenfeld
Copy link
Member Author

@omariom Currently any of the following will block stack allocation:

  1. The allocation is an array.
  2. The allocation is a string.
  3. The class has gc fields.
  4. The allocation is a boxed struct.
  5. Class size is larger than 8Kb.
  6. Under ReadyToRun the class or any of its base classes are in a different versioning bubble.
  7. The object escapes the allocating method according to the current (very conservative) escape analysis.
  8. The object is allocated in a loop.

@fiigii
Copy link

fiigii commented Nov 13, 2018

Class size is larger than 8Kb.

May I ask why is 8Kb (1KB)?

@erozenfeld
Copy link
Member Author

It's a placeholder number for now, we will tune this when we are able to handle more cases.

@fiigii
Copy link

fiigii commented Nov 13, 2018

@erozenfeld Thanks for the quick reply. Just a reminder, struct promotion currently can unwarp struct objs <= 128 bytes on x64. While the tuning for escape analysis, we may be able to improve the size limitation for struct promotion as well. But I am not if there are other concerns (e.g., register pressure).

@CarolEidt
Copy link

I would suggest that the tuning of struct promotion heuristics should be considered relatively independent of object stack allocation. And there are certainly other concerns, not limited to register pressure. Struct promotion currently creates a local for each field of a struct, even if only one of them is frequently accessed, and the JIT suffers - both throughput and code quality - when there are too many locals. Prior to bumping the limit we may want to look into the feasibility of independently promoting just the frequently referenced fields. I had thought there was an issue for that, but I can't find it at the moment.

@fiigii
Copy link

fiigii commented Nov 13, 2018

we may want to look into the feasibility of independently promoting just the frequently referenced fields.

Oh, that sounds pretty coool, thanks!

@ayende
Copy link

ayende commented Nov 13, 2018

Why can't a string by stack allocated? Given that they are usually the most commonly allocated object type.

@erozenfeld
Copy link
Member Author

@ayende Stack allocation of constant-sized strings and constant-sized arrays is in the plan in #20253 (although we have to be careful with strings since they may be interned). Array and string allocation representation in the jit is different from the representation of allocations of other objects so this initial implementation doesn't include them. For example, the size of the strings and arrays is determined from constructor arguments and not from the type itself.

@YairHalberstadt
Copy link

Given that an object allocated in a loop is where this would be of most benefit, is there any work planned to allow that?

@erozenfeld
Copy link
Member Author

@YairHalberstadt This is on the list but it's not the highest priority at the moment. If you have examples from real-world code where allocations performed in a loop can and should be moved to the stack please share them.

@YairHalberstadt
Copy link

@erozenfeld
I'm thinking of lambdas specifically at the moment. When they are used in a loop and capture a variable declared inside the loop, but never escape the loop, then they allocate on every single iteration.

This is quite common in Linq code, where I will often call eg

using System.Collections.Generic;
using System.Linq;

public class C
{
    public void M(List<int[]> arrays, List<int[]> newArrays)
    {
        foreach(var array in arrays)
        {
            int i = 42;
    		newArrays.Add(array.Where(x => x > i).ToArray());
        }
    }
}

This generates the following:

public class C
{
    [CompilerGenerated]
    private sealed class <>c__DisplayClass0_0
    {
        public int i;

        internal bool <M>b__0(int x)
        {
            return x > i;
        }
    }

    public void M(List<int[]> arrays, List<int[]> newArrays)
    {
        List<int[]>.Enumerator enumerator = arrays.GetEnumerator();
        try
        {
            while (enumerator.MoveNext())
            {
                int[] current = enumerator.Current;
                <>c__DisplayClass0_0 <>c__DisplayClass0_ = new <>c__DisplayClass0_0();
                <>c__DisplayClass0_.i = 42;
                newArrays.Add(current.Where(<>c__DisplayClass0_.<M>b__0).ToArray());
            }
        }
        finally
        {
            ((IDisposable)enumerator).Dispose();
        }
    }
}

Note that whilst the same display class could theoretically be reused on each iteration of the loop, Roslyn can't know this since it doesn't do escape analysis, and by the time it gets to the JIT, it is probably to late to do that.

}

//------------------------------------------------------------------------
// CanLclVarEscape: Returns true iff local variable can

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in "iff", same with other "ifs" in this file :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"iff" means "if and only if".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh didn't know that. nevermind then.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet