Optimistic allocation of objects on the stack #1661

AartBluestoke · 2020-01-13T02:55:52Z

Background:

It is always interesting to see the 2 points of view on a problem like this: "roslyn can optimize, and everyone would benefit" vs "We need a language guarantee because compiler optimisations are flaky, and sometimes can not make the assumption we would want to force them to (eg: nobody would argue that hints to force in-lining are not important). "

Variations of all the "what things can safely be left on the stack to make an operation non-allocating" set of discussions seems to be the primary place this occurs, and it often gets bogged down in escape analysis (eg: we can't do the above (stack alloc for local functions) for lambdas because we can't force a complaint implementation).

This is especially true when many of the things we take as part of the language itself are merely an implementation detail (eg: async x.ConfigureAwait(false) ).

Originally posted by @AartBluestoke in dotnet/csharplang#1862 (comment)

This is also partially a language request (as we would need either a new keyword, or attributes on local variables; some way to hint to the compiler that we'd like to "memory-inline" this 'new'). This issue is also vague about if the actual allocation is on the stack or in the local memory pool.

Feature request:

In cooperation from the runtime, request that objects be optimistically stored to the local memory pool.

for now this is a new keyword (compiled into an attribute) on a local variable

locallalloc var localList = new List<int>();
This is a request that the entire object for a local variable, be stored either on the the stack, or in the local memory pool. This isn't a recursive request, just for that specific 'new' call.

Once an object has been allocated locally it can be passed into any function. The issue is how to deal with "escaping" stack references.
"escaping" in this context is defined as copying the reference to any heap allocated location.

class c{
static IEnumerable sv;
{
localalloc var localList = new List<int>();
var x=localList ; // non-allocating, because x is a local
var doStuff(x); // the *function call* itself is not allocation, although any escaping assignment internal to would need to be.
sv = localList; // this is an escaping assignment, so is an expensive allocation and object relocation; 

var doOtherStuff(x); // has to pass the newly relocated memory location, not the original.

This is where the "Optimistic" part of this feature comes in. In a pay-for-play manner, escaping stack references are heap allocated on demand.

I think there are 4 places this could be done:

promptly on object escape; as part of the execution of the line of code x=y.
-- perhaps done by memory trap, but i'm not sure how to achieve this without increasing the cost of many different things
on return from the function when the local memory would be unwound potentially significantly later on return from the function with localalloc.
-- This could be done via keeping a bitmap of localalloc objects that have 'escaped', and the 'escapees' and prior to function return requesting the GC relocate the object. this would hook into the concurrent mark+sweep component
on the next GC pass
-- on return from a function that uses locallalloc, (and an object escaped?) the local memory section is not considered free until after the next GC gen0 pass. GC Gen 0 also considers any preserved local memory for object promotion.
keep a separate section of the heap as a stack for the purpose of holding localAlloc objects - this would minimise the 'stack overflow' problem, as expanding the 'localAlloc memspace' could be done as part of a standard GC operation.
-- i think this would be hard to implement in a backwards compatible manner, as the generated IL would have to change.
no language, just runtime optimisation
If this turns out to be cheap enough, then this could be a pure runtime issue, essentially making a new GC gen of -1 where all objects "known" to not survive can be optimistically allocated. I would propose that in this case all local variables optimistically assumed to be localloc and never escape.
On a gen0 allocation those variables that escape are flagged, and the functions flagged for rejit (yes i know we currently never rejit, but here i want to; there are other use cases for this, optimising based on inlining of children recursively, devirtualization after rejit and code elimination within child calls, etc), removing the localAlloc hint.

--
Should this be over at csharplang first, as they would say 'can't do this without runtime assistance'?

The text was updated successfully, but these errors were encountered:

jkotas · 2020-01-13T03:09:09Z

The first thing to do on this would be an in-depth analysis, collect data or run some experiments to estimate benefits and costs to prove that this is worth doing.

FWIW, my gut feel is that it would be really hard to turn this into a measurable net win on real workloads. The observation from our experiments with allocating objects on stack is that the GC is really good at getting rid of short-term garbage and it is hard to come up with something significantly better.

AartBluestoke · 2020-01-13T05:27:18Z

@jkotas I've seen various places where the GC effort of small allocations was a significant issue - here's one quoted below. Given the existence of these situations, where people switch from object to struct just to avoid the allocation, i would expect the ability to (automatically?) stackify objects to be of benefit:

I'm curious about the projected gains for this, is there any analysis of what % of typical code would be able to leverage this? is there an estimated increase in performance?

There is an analysis in dotnet/coreclr#19663, which manually changes class to struct to get the similar effect of escape analysis. The data shows, for the PacketTracer benchmark, 31% execution time improvement, ~16% code size shrink, and significant GC overhead reduction (~33% -> ~11%).

Originally posted by @fiigii in dotnet/coreclr#20251 (comment)

jkotas · 2020-01-13T05:42:49Z

Right, structs give you a lot of control over doing manual stack allocation today, as the PacketTracker example shows. BTW: PacketTracker fix was one line change: dotnet/coreclr#19662.

You are proposing more controls over manual stack allocation. So the thing to prove is that there are common situations where these new manual controls help and that are hard or impossible to fix with what is available today.

AartBluestoke · 2020-01-13T06:13:51Z

I was proposing either some controls over stack allocation, or that the runtime detects and resolves the common instances itself.

Yes, and there have been several other instances that i'm aware of where the solution to the performance bottleneck was "you can't use a class here, don't allocate, use a struct". I would anticipate that there were other undiagnosed similar issues because of many small allocations (eg: function parameters, lambdas, synchrononsly resolved async tasks (optimisable with valuetask) );

That we need dedicated valueX to only avoid the overhead of allocations on a common path is suggestive of a need for runtime improvements.

I don't know how to get more data on this, but other use cases (eg: struct params, lambdas, local lists and arrays of all types ) appear reasonably often. A demo Switching something like ToString to not be allocating would be interesting, as that occurs in many places, but the overall cost of that wouldn't be visible on many code paths

I would like to suggest that it would be nice if these (storage allocation strategy) were not problems a developer had to face to begin with because the runtime handled this automatically. A language hint would be nice as the developer often knows which things are likely to not escape (just like they know which functions are good to inline).

Jkotas,
Out of the strategies above, do you believe any of them would either be viable, or impossible?
My thoughts are that 1 and 4 would be the least likely to be able to get working, with the on-GC style of 3 and 5 being most likely to be able to be implemented.

jkotas · 2020-01-13T08:40:40Z

We have spent quite a lot of time on automatically detecting good candidates for stack allocation in the JIT. https://github.com/dotnet/coreclr/issues/20253 is the uber issue for it, it has link to the design doc. The code with the prototype is in master. It is disabled because of it did not show enough benefits.

Suchiman · 2020-01-13T11:09:08Z

Oh, so is the feature dead for now? I thought it was disabled because it was incomplete.
Seems counter intuitive that code that automatically solves the problem of heap allocations that (ASP).NET tries so desperately to avoid for performance's sake doesn't yield enough benefits, or was the benchmark too biased due to the tested workloads already being heavily GC optimized?

jkotas · 2020-01-13T17:38:36Z

Yes, this feature is not being worked on right now by the Microsoft .NET team currently.

or was the benchmark too biased due to the tested workloads already being heavily GC optimized?

We looked at all places where the prototype kicks in or where it may potentially kick in once it is more complete, across the framework. It did not show enough hits. One observation was that you often need to analyze a very large amount of code to prove that it is ok to allocate on stack. This analysis takes time and the JIT is not designed to do a (global) analysis like this.

Trayani · 2020-01-13T18:24:37Z

I understand that Escape analysis is not happening, but we are also talking about a possibility of stack-allocating imperatively, which is being requested arguably more often.
Is it on hiatus too? I think Span showed nicely that objects can be handled while prevented from escaping.

I don't know how to get more data on this, but other use cases (eg: struct params, lambdas, local lists and arrays of all types ) appear reasonably often. )

So true

AartBluestoke · 2020-01-13T20:42:39Z

I understand that escape analysis is hard. That is why I'm proposing an optimistic approach where allocations are promoted on demand. No jit time proof of lack of escape needed.

jkotas · 2020-01-13T21:02:35Z

I agree that there may be a opportunity, but we do not know how to turn it into a significant win.

We had a team at Microsoft experimenting with the optimistic approach. The implementation was like what you have described in 4:

keep a separate section of the heap as a stack for the purpose of holding localAlloc objects - this would minimise the 'stack overflow' problem

We have been testing this on a specific server workload. All per-request objects were allocated in the separate section of the heap (ie all per-request allocations were "stack allocated"), and occasional allocations that survived the request got promoted on demand. We were not able to get significant (>10%) benefits out of it, even after spending a lot of time tweaking both the workload and the implementation. One of the observations is that the occasional on-demand promotion is expensive. You have to save a lot to pay for it, so that it is a net win at the end.

masonwheeler · 2020-01-14T18:12:31Z

@jkotas

we do not know how to turn it into a significant win.

Sure you do. You said so above:

This analysis takes time and the JIT is not designed to do a (global) analysis like this.

There's your answer right there: move to an AOT model for deployment.

JIT is a great productivity improvement during development, but when it comes to your production codebase, there's no good reason to use it over AOT and plenty of good reasons not to. (Edge cases involving runtime code generation can be handled either with an interpreter or by including a JIT in the runtime to deal with such cases, but the main body of the code should be AOT compiled.) This is just the latest such reason to be discovered.

jkotas · 2020-01-14T18:27:12Z

Ok, I should have said: We do not have a high-confidence plan for how to turn it into a significant win.

This analysis takes time and the JIT is not designed to do a (global) analysis like this.

There's your answer right there: move to an AOT model for deployment.

That was just one of the observation and the solution for it does not require AOT. It can be done as high-tier JIT too. The hardest problem is to build the code generator that can do this extensive analysis. Introducing another JIT tier or enable this in AOT is the easier part.

AartBluestoke · 2020-01-15T01:59:11Z

i find it surprising that a use case with only 10% performance was able to be found, given entire language features have been written to minimize allocations and copies ;

assuming that we are able to identify variables that escape less than 1% of the time (or never except when exceptions are thrown),
not having a 10% performance increase means that
stack alloc ( **(not)** basically free) + .01*promote_to_heap > 0.9 x heap allocation.
this implies that either the candidates for stack allocation were poorly chosen, heap promotion was more than 1000x as expensive as a memory allocation on an amortised per object basis, or that my assumption about how cheap allocation was in different areas was incorrect.

edit:
Thanks for the explanation @jkotas , edited the above.

jkotas · 2020-01-15T04:31:41Z

I do not think that stack alloc is basically free (when everything else is equal).

Here are some things that are same between heap and stack allocations:

Both heap and stack allocations are just a pointer bump in the simple case.
Both heap and stack allocations need to zero fill the memory (unless you use unsafe code)
When you allocate memory (regardless of whether it is on stack or on heap), you typically need to fill it with something useful. The cost of filling it with something useful is typically more than the cost of allocation.

Here are some of the reasons why stack allocations or structs tend to be cheaper:

The structs are two pointers smaller (no vtable and no sync blocks) and often enregistered
The stack allocations have perfectly tracked lifetime and utilize/reuse the processor cache better

Once you start converting arbitrary heap allocations to stack allocations over large code spans, you will start to lose these benefits:

You may not be able to prove that it is safe to omit sync block and vtable, or put everything into registers, due to code complexity
The lifetime of the allocations is longer and overlapping. You either need to start wasting space, or spending extra cycles on reusing freed space (ie implement C-style malloc/free, or implement GC for the local stack).

ygc369 · 2020-02-09T07:44:06Z

@jkotas

Introducing another JIT tier or enable this in AOT is the easier part.

I think that enabling this in AOT is good enough. Is there a Microsoft official AOT compiler for C# to support this now?

masonwheeler · 2020-02-09T11:16:10Z

@ygc369 Not as official as it should be, but have a look at CoreRT.

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Jan 13, 2020

jkotas added design-discussion Ongoing discussion about design without consensus area-VM-coreclr labels Jan 13, 2020

jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Jan 23, 2020

jeffschwMSFT added this to the Future milestone Jan 23, 2020

benaadams mentioned this issue Feb 7, 2020

Feature Request: Batch GC allocation for an object aggregation #31906

Closed

Suchiman mentioned this issue Mar 19, 2020

CLR/JIT should optimize "alloc temporary small object" to "alloc on stack" automatically #4584

Closed

Suchiman mentioned this issue Jul 15, 2022

JIT: Support object stack allocation #11192

Closed

20 tasks

huoyaoyuan mentioned this issue May 11, 2024

Why doesn't CLR immediately release local objects that leave the scope? #102109

Closed

dotnet-policy-service bot added backlog-cleanup-candidate An inactive issue that has been marked for automated closure. no-recent-activity labels Aug 10, 2024

dotnet-policy-service bot removed this from the Future milestone Aug 24, 2024

dotnet-policy-service bot closed this as completed Aug 24, 2024

dotnet-policy-service bot removed the no-recent-activity label Aug 24, 2024

dotnet-policy-service bot removed the backlog-cleanup-candidate An inactive issue that has been marked for automated closure. label Aug 24, 2024

github-actions bot locked and limited conversation to collaborators Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimistic allocation of objects on the stack #1661

Optimistic allocation of objects on the stack #1661

AartBluestoke commented Jan 13, 2020

jkotas commented Jan 13, 2020 •

edited

Loading

AartBluestoke commented Jan 13, 2020

jkotas commented Jan 13, 2020

AartBluestoke commented Jan 13, 2020

jkotas commented Jan 13, 2020

Suchiman commented Jan 13, 2020

jkotas commented Jan 13, 2020

Trayani commented Jan 13, 2020 •

edited

Loading

AartBluestoke commented Jan 13, 2020

jkotas commented Jan 13, 2020

masonwheeler commented Jan 14, 2020

jkotas commented Jan 14, 2020

AartBluestoke commented Jan 15, 2020 •

edited

Loading

jkotas commented Jan 15, 2020

ygc369 commented Feb 9, 2020

masonwheeler commented Feb 9, 2020

Optimistic allocation of objects on the stack #1661

Optimistic allocation of objects on the stack #1661

Comments

AartBluestoke commented Jan 13, 2020

Background:

Feature request:

jkotas commented Jan 13, 2020 • edited Loading

AartBluestoke commented Jan 13, 2020

jkotas commented Jan 13, 2020

AartBluestoke commented Jan 13, 2020

jkotas commented Jan 13, 2020

Suchiman commented Jan 13, 2020

jkotas commented Jan 13, 2020

Trayani commented Jan 13, 2020 • edited Loading

AartBluestoke commented Jan 13, 2020

jkotas commented Jan 13, 2020

masonwheeler commented Jan 14, 2020

jkotas commented Jan 14, 2020

AartBluestoke commented Jan 15, 2020 • edited Loading

jkotas commented Jan 15, 2020

ygc369 commented Feb 9, 2020

masonwheeler commented Feb 9, 2020

jkotas commented Jan 13, 2020 •

edited

Loading

Trayani commented Jan 13, 2020 •

edited

Loading

AartBluestoke commented Jan 15, 2020 •

edited

Loading