-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimistic allocation of objects on the stack #1661
Comments
The first thing to do on this would be an in-depth analysis, collect data or run some experiments to estimate benefits and costs to prove that this is worth doing. FWIW, my gut feel is that it would be really hard to turn this into a measurable net win on real workloads. The observation from our experiments with allocating objects on stack is that the GC is really good at getting rid of short-term garbage and it is hard to come up with something significantly better. |
@jkotas I've seen various places where the GC effort of small allocations was a significant issue - here's one quoted below. Given the existence of these situations, where people switch from object to struct just to avoid the allocation, i would expect the ability to (automatically?) stackify objects to be of benefit:
Originally posted by @fiigii in dotnet/coreclr#20251 (comment) |
Right, structs give you a lot of control over doing manual stack allocation today, as the PacketTracker example shows. BTW: PacketTracker fix was one line change: dotnet/coreclr#19662. You are proposing more controls over manual stack allocation. So the thing to prove is that there are common situations where these new manual controls help and that are hard or impossible to fix with what is available today. |
I was proposing either some controls over stack allocation, or that the runtime detects and resolves the common instances itself. Yes, and there have been several other instances that i'm aware of where the solution to the performance bottleneck was "you can't use a class here, don't allocate, use a struct". I would anticipate that there were other undiagnosed similar issues because of many small allocations (eg: function parameters, lambdas, synchrononsly resolved async tasks (optimisable with valuetask) ); That we need dedicated valueX to only avoid the overhead of allocations on a common path is suggestive of a need for runtime improvements. I don't know how to get more data on this, but other use cases (eg: struct params, lambdas, local lists and arrays of all types ) appear reasonably often. A demo Switching something like ToString to not be allocating would be interesting, as that occurs in many places, but the overall cost of that wouldn't be visible on many code paths I would like to suggest that it would be nice if these (storage allocation strategy) were not problems a developer had to face to begin with because the runtime handled this automatically. A language hint would be nice as the developer often knows which things are likely to not escape (just like they know which functions are good to inline). Jkotas, |
We have spent quite a lot of time on automatically detecting good candidates for stack allocation in the JIT. https://github.com/dotnet/coreclr/issues/20253 is the uber issue for it, it has link to the design doc. The code with the prototype is in master. It is disabled because of it did not show enough benefits. |
Oh, so is the feature dead for now? I thought it was disabled because it was incomplete. |
Yes, this feature is not being worked on right now by the Microsoft .NET team currently.
We looked at all places where the prototype kicks in or where it may potentially kick in once it is more complete, across the framework. It did not show enough hits. One observation was that you often need to analyze a very large amount of code to prove that it is ok to allocate on stack. This analysis takes time and the JIT is not designed to do a (global) analysis like this. |
I understand that Escape analysis is not happening, but we are also talking about a possibility of stack-allocating imperatively, which is being requested arguably more often.
So true |
I understand that escape analysis is hard. That is why I'm proposing an optimistic approach where allocations are promoted on demand. No jit time proof of lack of escape needed. |
I agree that there may be a opportunity, but we do not know how to turn it into a significant win. We had a team at Microsoft experimenting with the optimistic approach. The implementation was like what you have described in 4:
We have been testing this on a specific server workload. All per-request objects were allocated in the separate section of the heap (ie all per-request allocations were "stack allocated"), and occasional allocations that survived the request got promoted on demand. We were not able to get significant (>10%) benefits out of it, even after spending a lot of time tweaking both the workload and the implementation. One of the observations is that the occasional on-demand promotion is expensive. You have to save a lot to pay for it, so that it is a net win at the end. |
Sure you do. You said so above:
There's your answer right there: move to an AOT model for deployment. JIT is a great productivity improvement during development, but when it comes to your production codebase, there's no good reason to use it over AOT and plenty of good reasons not to. (Edge cases involving runtime code generation can be handled either with an interpreter or by including a JIT in the runtime to deal with such cases, but the main body of the code should be AOT compiled.) This is just the latest such reason to be discovered. |
Ok, I should have said: We do not have a high-confidence plan for how to turn it into a significant win.
That was just one of the observation and the solution for it does not require AOT. It can be done as high-tier JIT too. The hardest problem is to build the code generator that can do this extensive analysis. Introducing another JIT tier or enable this in AOT is the easier part. |
i find it surprising that a use case with only 10% performance was able to be found, given entire language features have been written to minimize allocations and copies ;
edit: |
I do not think that stack alloc is basically free (when everything else is equal). Here are some things that are same between heap and stack allocations:
Here are some of the reasons why stack allocations or structs tend to be cheaper:
Once you start converting arbitrary heap allocations to stack allocations over large code spans, you will start to lose these benefits:
|
I think that enabling this in AOT is good enough. Is there a Microsoft official AOT compiler for C# to support this now? |
Background:
It is always interesting to see the 2 points of view on a problem like this: "roslyn can optimize, and everyone would benefit" vs "We need a language guarantee because compiler optimisations are flaky, and sometimes can not make the assumption we would want to force them to (eg: nobody would argue that hints to force in-lining are not important). "
Variations of all the "what things can safely be left on the stack to make an operation non-allocating" set of discussions seems to be the primary place this occurs, and it often gets bogged down in escape analysis (eg: we can't do the above (stack alloc for local functions) for lambdas because we can't force a complaint implementation).
This is especially true when many of the things we take as part of the language itself are merely an implementation detail (eg: async x.ConfigureAwait(false) ).
Originally posted by @AartBluestoke in dotnet/csharplang#1862 (comment)
This is also partially a language request (as we would need either a new keyword, or attributes on local variables; some way to hint to the compiler that we'd like to "memory-inline" this 'new'). This issue is also vague about if the actual allocation is on the stack or in the local memory pool.
Feature request:
In cooperation from the runtime, request that objects be optimistically stored to the local memory pool.
for now this is a new keyword (compiled into an attribute) on a local variable
locallalloc var localList = new List<int>();
This is a request that the entire object for a local variable, be stored either on the the stack, or in the local memory pool. This isn't a recursive request, just for that specific 'new' call.
Once an object has been allocated locally it can be passed into any function. The issue is how to deal with "escaping" stack references.
"escaping" in this context is defined as copying the reference to any heap allocated location.
This is where the "Optimistic" part of this feature comes in. In a pay-for-play manner, escaping stack references are heap allocated on demand.
I think there are 4 places this could be done:
promptly on object escape; as part of the execution of the line of code x=y.
-- perhaps done by memory trap, but i'm not sure how to achieve this without increasing the cost of many different things
on return from the function when the local memory would be unwound potentially significantly later on return from the function with localalloc.
-- This could be done via keeping a bitmap of localalloc objects that have 'escaped', and the 'escapees' and prior to function return requesting the GC relocate the object. this would hook into the concurrent mark+sweep component
on the next GC pass
-- on return from a function that uses locallalloc, (and an object escaped?) the local memory section is not considered free until after the next GC gen0 pass. GC Gen 0 also considers any preserved local memory for object promotion.
keep a separate section of the heap as a stack for the purpose of holding localAlloc objects - this would minimise the 'stack overflow' problem, as expanding the 'localAlloc memspace' could be done as part of a standard GC operation.
-- i think this would be hard to implement in a backwards compatible manner, as the generated IL would have to change.
no language, just runtime optimisation
If this turns out to be cheap enough, then this could be a pure runtime issue, essentially making a new GC gen of -1 where all objects "known" to not survive can be optimistically allocated. I would propose that in this case all local variables optimistically assumed to be localloc and never escape.
On a gen0 allocation those variables that escape are flagged, and the functions flagged for rejit (yes i know we currently never rejit, but here i want to; there are other use cases for this, optimising based on inlining of children recursively, devirtualization after rejit and code elimination within child calls, etc), removing the localAlloc hint.
--
Should this be over at csharplang first, as they would say 'can't do this without runtime assistance'?
The text was updated successfully, but these errors were encountered: