Add a managed API to enumerate and mutate the heap #12628

GSPP · 2019-05-04T18:10:13Z

In https://github.com/dotnet/coreclr/issues/14208 we are discussing deduplicating identical string objects for performance reasons. It was mentioned by @maoni that this would not have to be implemented in the GC. It could be done in managed code if the runtime provided an API to enumerate heap objects and mutate references to those objects.

Such an API would have multiple use cases:

Deduplication
- Strings
- Boxed primitives and boxed custom structs
- Empty arrays
- Immutable collections
- Custom application objects
- Completed Task instances
Debugging
- Finding memory leaks and reference chains
Profiling
- You could have an entire profiler as a library
- A web page could have a button "display allocations for this request" similar to SQL profilers
Cache eviction (determine that no user references to a cache item exist. No need for GC handles or resourrection)
Logging
Validation of data structure invariants

All of this could be done in a library with no further runtime complication. There could be a variety of libraries in competition. Libraries could create custom policies for what to deduplicate (e.g. based on string length).

Also, all of this would be opt-in. This is important because breaking object identity is a fundamental change. Not only is there compatibility impact, it fundamentally makes managed code more brittle.

There is a risk that such a powerful API would be abused. Reaching into 3rd party code and mutating internal references can break their assumptions. It can also lead to reliance on library internals. This could lead to an increased compatibility burden and more breakage when upgrading to new library versions.

The API could support the following features:

Enumeration of heap objects
Enumeration of references to objects
Mutation of references. This could be done through reflection but surely it can be done faster as part of the enumeration API
Inspection of the object header for hash code information. This can be used to not deduplicate objects that look like they are being used in a hash table.
Finding roots (thread stacks, statics, etc.)

The text was updated successfully, but these errors were encountered:

weltkante · 2019-05-04T19:53:11Z

A lot (most?) of your requirements, in particular those for debugging/profiling, are already supported by the Microsoft.Diagnostic.Runtime library which allows readonly access to the managed heap, including enumeration of objects, references, roots etc. I don't know if it contains any support for letting debuggers write into the heap.

GSPP · 2019-05-05T09:43:31Z

@weltkante That is good to know. But is this API made for "regular use"?

There is experimental support for attaching to a live process (without suspending that process) to collect live samples of the heap. In a future release, I will be adding support for suspending a live process to collect data.

This library works by loading the private CLR debugging library (mscordacwks.dll) into the process.

If the scenarios outlined in my opening post are to be achieved the API must be reliable and "production quality". It is my impression that an API based on debugging support or based on the profiler API would not deliver that.

What I envision is a light-weight, reliable and fully supported managed API.

weltkante · 2019-05-05T11:51:51Z

But is this API made for "regular use"?

Depends on what you mean by "regular use", the target audience is debuggers/profilers. Anyways, it's clear that it doesn't solve the problem you want solved, I just wanted to point at prior art which does implement some/most of the requirements.

What I envision is a light-weight, reliable and fully supported managed API.

Some of your requirements are incompatible with "light-weight". To look at the stack and enumerate its root references into the heap reliably you need to halt all threads except your own, thats the opposite of lightweight. If you don't do that then the heap will change between every instruction your algorithm takes while "looking at the heap". To minimize downtime you could take snapshots, but thats not exactly lightweight either.

You also need to solve the not quite-so-easy problem of managed code looking at itself while its running. If you have managed code enumerating the heap then you'll be most likely modifying the heap while you are running your algorithms. Usually that problem is avoided by running this kind of code outside of the runtime its examining (i.e. either unmanaged code or separate process), or again using snapshost instead of live data.

mjsabby · 2019-05-06T12:01:05Z

The enumeration can be achieved using the Profiler APIs in .NET Core 3.0+

Take a look at a WIP I'm doing here, https://github.com/mjsabby/StringDedupingProfiler -- dedupes alright but causes GC to deadlock in a complicated application I'm testing against.

jkotas · 2019-05-06T15:48:15Z

This can be achieved using the Profiler APIs in .NET Core 3.0+

I do not think you can implement a reliable string deduping via profiler APIs in 3.0+:

The profiler APIs are not meant to use to mutate the heap. Mutating the heap requires the GC book-keeping (write barriers and friends) to be updated. There is no profiler API for updating the heap.
You need to know when the immutable object is fully constructed and starts being candidate for de-duping. It is hard to tell without having a cooperation from the class library.
Then there is the issue of syncblocks, hashcodes and one string reference becoming two discussed in https://github.com/dotnet/coreclr/issues/14208.

You can workaround all these by hacking on the runtime internal structures or using other fragile techniques, but then you are not really achieving it via Profiler APIs.

mjsabby · 2019-05-06T17:22:02Z

I should clarify what I meant ... enumeration is possible via profiling apis, and whatever level of deduping possible with just enumeration and fixing up references. And yes it requires runtime internals exposes too so I should say that as well.

If you’re doing it within one generation and when mutation of the heap is stopped, does that still require such updates of write barries and other structures? Which ones?

For your second point, if you only enumerate references from the gen2 segment, I think the fully constructed problem isn’t there because you’ll never reach the string since it won’t have a heap reference to it.

The issue of sync blocks and hash codes can be fixed up by playing with header or as mentioned in those comments .. be ineligible for deduping.

All that said this is WIP and I think with restrictions it can work, but if not I’d love to know if with the caveats added won’t work.

GSPP · 2019-05-06T17:53:26Z

You need to know when the immutable object is fully constructed and starts being candidate for de-duping. It is hard to tell without having a cooperation from the class library.

That is a very good point. It's a killer for some deduping use cases if we don't find a solution to it.

There must be a per-object way to tell when the object is in its final state. Well, this could exist for things like immutable collections and tasks. The last field to be assigned could act like a sentinel. The class owner would have to guarantee this as you point out.

For custom application objects the application can assign a sentinel boolean field to true as the last step of construction.

dotnet-policy-service · 2024-08-09T06:01:43Z

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of our issue cleanup automation.

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

jkotas added area-GC-coreclr and removed area-Diagnostics-coreclr labels Feb 8, 2020

jkotas mentioned this issue Feb 8, 2020

String Deduplication Design Doc #31971

Merged

dotnet-policy-service bot added backlog-cleanup-candidate An inactive issue that has been marked for automated closure. no-recent-activity labels Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a managed API to enumerate and mutate the heap #12628

Add a managed API to enumerate and mutate the heap #12628

GSPP commented May 4, 2019

weltkante commented May 4, 2019

GSPP commented May 5, 2019 •

edited

Loading

weltkante commented May 5, 2019 •

edited

Loading

mjsabby commented May 6, 2019 •

edited

Loading

jkotas commented May 6, 2019 •

edited

Loading

mjsabby commented May 6, 2019 •

edited

Loading

GSPP commented May 6, 2019 •

edited

Loading

dotnet-policy-service bot commented Aug 9, 2024

Add a managed API to enumerate and mutate the heap #12628

Add a managed API to enumerate and mutate the heap #12628

Comments

GSPP commented May 4, 2019

weltkante commented May 4, 2019

GSPP commented May 5, 2019 • edited Loading

weltkante commented May 5, 2019 • edited Loading

mjsabby commented May 6, 2019 • edited Loading

jkotas commented May 6, 2019 • edited Loading

mjsabby commented May 6, 2019 • edited Loading

GSPP commented May 6, 2019 • edited Loading

dotnet-policy-service bot commented Aug 9, 2024

GSPP commented May 5, 2019 •

edited

Loading

weltkante commented May 5, 2019 •

edited

Loading

mjsabby commented May 6, 2019 •

edited

Loading

jkotas commented May 6, 2019 •

edited

Loading

mjsabby commented May 6, 2019 •

edited

Loading

GSPP commented May 6, 2019 •

edited

Loading