Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a managed API to enumerate and mutate the heap #12628

Open
GSPP opened this issue May 4, 2019 · 8 comments
Open

Add a managed API to enumerate and mutate the heap #12628

GSPP opened this issue May 4, 2019 · 8 comments
Labels
area-GC-coreclr backlog-cleanup-candidate An inactive issue that has been marked for automated closure. enhancement Product code improvement that does NOT require public API changes/additions no-recent-activity
Milestone

Comments

@GSPP
Copy link

GSPP commented May 4, 2019

In https://github.com/dotnet/coreclr/issues/14208 we are discussing deduplicating identical string objects for performance reasons. It was mentioned by @maoni that this would not have to be implemented in the GC. It could be done in managed code if the runtime provided an API to enumerate heap objects and mutate references to those objects.

Such an API would have multiple use cases:

  1. Deduplication
    • Strings
    • Boxed primitives and boxed custom structs
    • Empty arrays
    • Immutable collections
    • Custom application objects
    • Completed Task instances
  2. Debugging
    • Finding memory leaks and reference chains
  3. Profiling
    • You could have an entire profiler as a library
    • A web page could have a button "display allocations for this request" similar to SQL profilers
  4. Cache eviction (determine that no user references to a cache item exist. No need for GC handles or resourrection)
  5. Logging
  6. Validation of data structure invariants

All of this could be done in a library with no further runtime complication. There could be a variety of libraries in competition. Libraries could create custom policies for what to deduplicate (e.g. based on string length).

Also, all of this would be opt-in. This is important because breaking object identity is a fundamental change. Not only is there compatibility impact, it fundamentally makes managed code more brittle.

There is a risk that such a powerful API would be abused. Reaching into 3rd party code and mutating internal references can break their assumptions. It can also lead to reliance on library internals. This could lead to an increased compatibility burden and more breakage when upgrading to new library versions.

The API could support the following features:

  1. Enumeration of heap objects
  2. Enumeration of references to objects
  3. Mutation of references. This could be done through reflection but surely it can be done faster as part of the enumeration API
  4. Inspection of the object header for hash code information. This can be used to not deduplicate objects that look like they are being used in a hash table.
  5. Finding roots (thread stacks, statics, etc.)
@weltkante
Copy link
Contributor

A lot (most?) of your requirements, in particular those for debugging/profiling, are already supported by the Microsoft.Diagnostic.Runtime library which allows readonly access to the managed heap, including enumeration of objects, references, roots etc. I don't know if it contains any support for letting debuggers write into the heap.

@GSPP
Copy link
Author

GSPP commented May 5, 2019

@weltkante That is good to know. But is this API made for "regular use"?

There is experimental support for attaching to a live process (without suspending that process) to collect live samples of the heap. In a future release, I will be adding support for suspending a live process to collect data.

This library works by loading the private CLR debugging library (mscordacwks.dll) into the process.

If the scenarios outlined in my opening post are to be achieved the API must be reliable and "production quality". It is my impression that an API based on debugging support or based on the profiler API would not deliver that.

What I envision is a light-weight, reliable and fully supported managed API.

@weltkante
Copy link
Contributor

weltkante commented May 5, 2019

But is this API made for "regular use"?

Depends on what you mean by "regular use", the target audience is debuggers/profilers. Anyways, it's clear that it doesn't solve the problem you want solved, I just wanted to point at prior art which does implement some/most of the requirements.

What I envision is a light-weight, reliable and fully supported managed API.

Some of your requirements are incompatible with "light-weight". To look at the stack and enumerate its root references into the heap reliably you need to halt all threads except your own, thats the opposite of lightweight. If you don't do that then the heap will change between every instruction your algorithm takes while "looking at the heap". To minimize downtime you could take snapshots, but thats not exactly lightweight either.

You also need to solve the not quite-so-easy problem of managed code looking at itself while its running. If you have managed code enumerating the heap then you'll be most likely modifying the heap while you are running your algorithms. Usually that problem is avoided by running this kind of code outside of the runtime its examining (i.e. either unmanaged code or separate process), or again using snapshost instead of live data.

@mjsabby
Copy link
Contributor

mjsabby commented May 6, 2019

The enumeration can be achieved using the Profiler APIs in .NET Core 3.0+

Take a look at a WIP I'm doing here, https://github.com/mjsabby/StringDedupingProfiler -- dedupes alright but causes GC to deadlock in a complicated application I'm testing against.

@jkotas
Copy link
Member

jkotas commented May 6, 2019

This can be achieved using the Profiler APIs in .NET Core 3.0+

I do not think you can implement a reliable string deduping via profiler APIs in 3.0+:

  • The profiler APIs are not meant to use to mutate the heap. Mutating the heap requires the GC book-keeping (write barriers and friends) to be updated. There is no profiler API for updating the heap.
  • You need to know when the immutable object is fully constructed and starts being candidate for de-duping. It is hard to tell without having a cooperation from the class library.
  • Then there is the issue of syncblocks, hashcodes and one string reference becoming two discussed in https://github.com/dotnet/coreclr/issues/14208.

You can workaround all these by hacking on the runtime internal structures or using other fragile techniques, but then you are not really achieving it via Profiler APIs.

@mjsabby
Copy link
Contributor

mjsabby commented May 6, 2019

I should clarify what I meant ... enumeration is possible via profiling apis, and whatever level of deduping possible with just enumeration and fixing up references. And yes it requires runtime internals exposes too so I should say that as well.

If you’re doing it within one generation and when mutation of the heap is stopped, does that still require such updates of write barries and other structures? Which ones?

For your second point, if you only enumerate references from the gen2 segment, I think the fully constructed problem isn’t there because you’ll never reach the string since it won’t have a heap reference to it.

The issue of sync blocks and hash codes can be fixed up by playing with header or as mentioned in those comments .. be ineligible for deduping.

All that said this is WIP and I think with restrictions it can work, but if not I’d love to know if with the caveats added won’t work.

@GSPP
Copy link
Author

GSPP commented May 6, 2019

You need to know when the immutable object is fully constructed and starts being candidate for de-duping. It is hard to tell without having a cooperation from the class library.

That is a very good point. It's a killer for some deduping use cases if we don't find a solution to it.

There must be a per-object way to tell when the object is in its final state. Well, this could exist for things like immutable collections and tasks. The last field to be assigned could act like a sentinel. The class owner would have to guarantee this as you point out.

For custom application objects the application can assign a sentinel boolean field to true as the last step of construction.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
Copy link
Contributor

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of our issue cleanup automation.

@dotnet-policy-service dotnet-policy-service bot added backlog-cleanup-candidate An inactive issue that has been marked for automated closure. no-recent-activity labels Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-GC-coreclr backlog-cleanup-candidate An inactive issue that has been marked for automated closure. enhancement Product code improvement that does NOT require public API changes/additions no-recent-activity
Projects
None yet
Development

No branches or pull requests

5 participants