-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a managed API to enumerate and mutate the heap #12628
Comments
A lot (most?) of your requirements, in particular those for debugging/profiling, are already supported by the Microsoft.Diagnostic.Runtime library which allows readonly access to the managed heap, including enumeration of objects, references, roots etc. I don't know if it contains any support for letting debuggers write into the heap. |
@weltkante That is good to know. But is this API made for "regular use"?
If the scenarios outlined in my opening post are to be achieved the API must be reliable and "production quality". It is my impression that an API based on debugging support or based on the profiler API would not deliver that. What I envision is a light-weight, reliable and fully supported managed API. |
Depends on what you mean by "regular use", the target audience is debuggers/profilers. Anyways, it's clear that it doesn't solve the problem you want solved, I just wanted to point at prior art which does implement some/most of the requirements.
Some of your requirements are incompatible with "light-weight". To look at the stack and enumerate its root references into the heap reliably you need to halt all threads except your own, thats the opposite of lightweight. If you don't do that then the heap will change between every instruction your algorithm takes while "looking at the heap". To minimize downtime you could take snapshots, but thats not exactly lightweight either. You also need to solve the not quite-so-easy problem of managed code looking at itself while its running. If you have managed code enumerating the heap then you'll be most likely modifying the heap while you are running your algorithms. Usually that problem is avoided by running this kind of code outside of the runtime its examining (i.e. either unmanaged code or separate process), or again using snapshost instead of live data. |
The enumeration can be achieved using the Profiler APIs in .NET Core 3.0+ Take a look at a WIP I'm doing here, https://github.com/mjsabby/StringDedupingProfiler -- dedupes alright but causes GC to deadlock in a complicated application I'm testing against. |
I do not think you can implement a reliable string deduping via profiler APIs in 3.0+:
You can workaround all these by hacking on the runtime internal structures or using other fragile techniques, but then you are not really achieving it via Profiler APIs. |
I should clarify what I meant ... enumeration is possible via profiling apis, and whatever level of deduping possible with just enumeration and fixing up references. And yes it requires runtime internals exposes too so I should say that as well. If you’re doing it within one generation and when mutation of the heap is stopped, does that still require such updates of write barries and other structures? Which ones? For your second point, if you only enumerate references from the gen2 segment, I think the fully constructed problem isn’t there because you’ll never reach the string since it won’t have a heap reference to it. The issue of sync blocks and hash codes can be fixed up by playing with header or as mentioned in those comments .. be ineligible for deduping. All that said this is WIP and I think with restrictions it can work, but if not I’d love to know if with the caveats added won’t work. |
That is a very good point. It's a killer for some deduping use cases if we don't find a solution to it. There must be a per-object way to tell when the object is in its final state. Well, this could exist for things like immutable collections and tasks. The last field to be assigned could act like a sentinel. The class owner would have to guarantee this as you point out. For custom application objects the application can assign a sentinel boolean field to true as the last step of construction. |
Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process. This process is part of our issue cleanup automation. |
In https://github.com/dotnet/coreclr/issues/14208 we are discussing deduplicating identical string objects for performance reasons. It was mentioned by @maoni that this would not have to be implemented in the GC. It could be done in managed code if the runtime provided an API to enumerate heap objects and mutate references to those objects.
Such an API would have multiple use cases:
Task
instancesAll of this could be done in a library with no further runtime complication. There could be a variety of libraries in competition. Libraries could create custom policies for what to deduplicate (e.g. based on string length).
Also, all of this would be opt-in. This is important because breaking object identity is a fundamental change. Not only is there compatibility impact, it fundamentally makes managed code more brittle.
There is a risk that such a powerful API would be abused. Reaching into 3rd party code and mutating internal references can break their assumptions. It can also lead to reliance on library internals. This could lead to an increased compatibility burden and more breakage when upgrading to new library versions.
The API could support the following features:
The text was updated successfully, but these errors were encountered: