Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom allocators - (size, disposal, pools, etc). #1235

Open
ayende opened this issue Jul 14, 2015 · 77 comments

Comments

Projects
None yet
@ayende
Copy link
Contributor

commented Jul 14, 2015

One of the hardest things that we have to handle when writing server application or system software in .NET is the fact that we don't have good control over memory.
This range from simple inability to state "how big is this thing" to controlling how much memory we'll use for certain operations.

In my case, working on RavenDB, there are a lot of operations that require extensive memory usage, over which we have little control. The user can specify any indexing function they want, and we'll have to respect that.
When doing heavy indexing, that can cause several issues. In particular, it means that we generate a LOT of relatively short term data, during which other operations also run. Because we are system software, we are doing a lot of I/O, which require pinning memory.

The end result is that we may have memory with the following layout.

[useful objects] [ indexing garbage ] [pinned buffers doing i/o] [ indexing garbage] [ pinned buffers ]

That result in high memory fragmentation (on Gen0, mostly), which is hard to deal with.

It also means that when indexing is done, we have to cleanup quite a lot of garbage, and because the indexing garbage is mixed with stuff that is used right now, it either cannot be readily allocated or require a big compaction.

It would be great if we had greater control over memory usage. Being able to define a heap and instruct the CLR to allocate objects from it would be wonderful. We won't have request processing memory intermixed with background ops memory, and we have a good way to free a lot of memory all at once.

One option would be to do something like this:

using(var heap = Heap.Create(HeapOptions.None, 
    1024 * 1024, // initial size
    512 * 1024 * 1024)) // max size
{

  using(heap.AllocateFromMe())
  {
     var sb = new StringBuilder();
     for(var i = 0; i < 100; i ++ )
           sb.AppendLine(i);
     Console.WriteLine(sb.ToString());
  }
}

This will ensure that all allocations inside the scope are allocated on the new heap. The GC isn't involved in collecting items from this heap at all, it is the responsibility of the user to take care of that, either by explicitly freeing objects (rare) or by disposing the heap.

Usage of references to the heap after it is destroyed won't be allowed.

Alternatively, because that might be too hard / complex. Just having a way to do something like:

 heap.Allocate<MyItem>();

Would be great. Note that we can do the same right now by allocating native memory and using unsafe code to get a struct pointer back. This works, but very common types like arrays or strings cannot be allocated in this manner.

Having support for explicit usage like this would greatly alleviate the kind of gymnastics that we have to go through to manage memory.

@gistofj

This comment has been minimized.

Copy link

commented Jul 14, 2015

👍 to more memory control for those that need it. I'd be happy with a way to managed memory very coarsely at the heap level. Not at useful as destructible or own values, but it is something.

@omariom

This comment has been minimized.

Copy link
Contributor

commented Jul 14, 2015

What if objects of the normal heap have references to objects in a user controlled heap? What will happen after the user heap is freed?

@ayende

This comment has been minimized.

Copy link
Contributor Author

commented Jul 14, 2015

Crash? Null reference?

This isn't meant to be something you would do lightly. And you can take
responsibility for that
On Jul 14, 2015 6:49 PM, "Omari Omarov" notifications@github.com wrote:

What if objects of the normal heap have references to objects in a user
controlled heap? What will happen after the user heap is freed?


Reply to this email directly or view it on GitHub
#1235 (comment).

@redknightlois

This comment has been minimized.

Copy link

commented Jul 14, 2015

This was mentioned also by Miguel de Icaza a couple of years ago. http://tirania.org/blog/archive/2012/Apr-04.html

@omariom We have already some sort of similar mechanism with weak references. They are nullified. This is certainly not for everyone, but it does have its uses in certain niches (even at the peril of introducing subtle bugs).

@OtherCrashOverride

This comment has been minimized.

Copy link

commented Jul 15, 2015

In C++, this is called "placement new".

In C#, the main solutions to this are object pools, IDisposable, and IEnumerable. Object pools prevent the allocation of new memory so also avoid the fragmentation. IDisposable allows explict control over the release of non-managed resources or pinned memory. Finally, IEnumerable allows you to allocate/deallocate on demand instead of all at once.

As an alternative, you can create a native library to handle resources from a non-managed heap rather than pinning. The library would then have a C# facade over it that wrapps P/Invoke calls.

@ayende

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2015

Yes, I'm aware of all of those. And none of those really work for those
cases.

Consider a highly simplify case of needing to read record from a CSV file
and do some work on them.
Each line we read becomes garbage very quickly. Now, assume that the file
is large, processing time is long, and while some data can be discarded
immediately (the line we just read), some data is kept for the duration of
the entire file run.

Now, assume that you have to process many such files concurrently.

There is no way for us to control the memory usage. I would like to say
"this process cannot take more than 1GB", and I would like to actually get
an error if we exceed that, because this gives be more predictable
behavior.

*Hibernating Rhinos Ltd *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 11:48 AM, OtherCrashOverride <
notifications@github.com> wrote:

In C++, this is called "placement new".

In C#, the main solutions to this are object pools, IDisposable, and
IEnumerable. Object pools prevent the allocation of new memory so also
avoid the fragmentation. IDisposable allows explict control over the
release of non-managed resources or pinned memory. Finally, IEnumerable
allows you to allocate/deallocate on demand instead of all at once.

As an alternative, you can create a native library to handle resources
from a non-managed heap rather than pinning. The library would then have a
C# facade over it that wrapps P/Invoke calls.


Reply to this email directly or view it on GitHub
#1235 (comment).

@ayende

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2015

And while I can do this in native code, the work I'm doing is primarily
managed stuff. The only "resource" that I want to manage is memory itself.

*Hibernating Rhinos Ltd *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 11:48 AM, OtherCrashOverride <
notifications@github.com> wrote:

In C++, this is called "placement new".

In C#, the main solutions to this are object pools, IDisposable, and
IEnumerable. Object pools prevent the allocation of new memory so also
avoid the fragmentation. IDisposable allows explict control over the
release of non-managed resources or pinned memory. Finally, IEnumerable
allows you to allocate/deallocate on demand instead of all at once.

As an alternative, you can create a native library to handle resources
from a non-managed heap rather than pinning. The library would then have a
C# facade over it that wrapps P/Invoke calls.


Reply to this email directly or view it on GitHub
#1235 (comment).

@davidfowl

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2015

Sounds effectively like java nio's ByteBuffer http://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html. We've also been looking into something like this for ASP.NET 5. The current plan is to allocate native memory and memcpy user bytes into it.

@ayende

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2015

How are you going to make that work with Stream ? That accept byte[], not
byte*.

For specific things, we can use direct methods (ReadFile, WriteFile) that
can accept it. But for many things, that isn't an option.

*Hibernating Rhinos Ltd *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 1:03 PM, David Fowler notifications@github.com
wrote:

Sounds effectively like java nio's ByteBuffer
http://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html. We've
also been looking into something like this for ASP.NET 5. The current
plan is to allocate native memory and memcpy user bytes into it.


Reply to this email directly or view it on GitHub
#1235 (comment).

@davidfowl

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2015

Stream.Write will be backed by native memory. So when the user passes the managed byte[], it'll be copied into the native buffer.

@OtherCrashOverride

This comment has been minimized.

Copy link

commented Jul 15, 2015

@davidfowl How is that different from MemoryStream? Should we just expose the IntPtr (byte*) of the storage the class uses? Can it all be done with Marshal.AllocHGlobal in managed code?

https://msdn.microsoft.com/en-us/library/system.io.memorystream%28v=vs.110%29.aspx

@OtherCrashOverride

This comment has been minimized.

Copy link

commented Jul 15, 2015

Now, assume that the file is large, processing time is long, and while some data can be discarded immediately (the line we just read), some data is kept for the duration of the entire file run.

You may also want to consider MemoryMappedFile:
https://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile_methods%28v=vs.110%29.aspx

@ayende

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2015

Imagine that I'm doing :

new GzipStream(new SslStream(new NetworkStream(socket))).Read(...)

What happens now? Also note that if you have a NativeMemoryStream or
something like that, it would still need to allocate large byte arrays for
the rest of the system.

We run into this frequently when using web sockets and long running
requests.

*Hibernating Rhinos Ltd *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 1:47 PM, David Fowler notifications@github.com
wrote:

Stream.Write will be backed by native memory. So when the user passes the
managed byte[], it'll be copied into the native buffer.


Reply to this email directly or view it on GitHub
#1235 (comment).

@ayende

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2015

@OtherCrashOverride

There is already UnamangedMemoryStream (
https://msdn.microsoft.com/en-us/library/system.io.unmanagedmemorystream(v=vs.110).aspx
)

That doesn't help when your destination is an I/O stream.

*Hibernating Rhinos Ltd *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 1:58 PM, OtherCrashOverride <
notifications@github.com> wrote:

@davidfowl https://github.com/davidfowl How is that different from
MemoryStream? Should we just expose the IntPtr (byte*) of the storage the
class uses? Can it all be done with Marshal.AllocHGlobal in managed code?

https://msdn.microsoft.com/en-us/library/system.io.memorystream%28v=vs.110%29.aspx


Reply to this email directly or view it on GitHub
#1235 (comment).

@ayende

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2015

I intentionally gave a simple CSV file example, because it easy. The real
example is getting data from users and indexing them.

But even with simple CSV file. I cannot allocate a string in the memory
mapped file.
So I need to read a bunch of bytes, then create a new string. I have no
control over the size, where it is located, etc.

*Hibernating Rhinos Ltd *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 2:06 PM, OtherCrashOverride <
notifications@github.com> wrote:

Now, assume that the file is large, processing time is long, and while
some data can be discarded immediately (the line we just read), some data
is kept for the duration of the entire file run.

You may also want to consider MemoryMappedFile:

https://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile_methods%28v=vs.110%29.aspx


Reply to this email directly or view it on GitHub
#1235 (comment).

@redknightlois

This comment has been minimized.

Copy link

commented Jul 15, 2015

@OtherCrashOverride Having been on both sides of the fence, doing GPU computing and working inside a database engine with @ayende I can confirm first hand they are 2 very different beasts (what works in one wont work with the other).

In GPU land you can be very liberal about pinning memory. With a bus throughput in excess of 5Gb/s your pinning time will be measured in the microseconds. On database and/or web server land that is not true, we are speaking of at least 2 order of magnitude difference when you go to the extreme case as pointed with the network transfer. Mind you, even going to disk will be measured in high miliseconds land for big buffers (which I have seen in the wild ;) ).

This is an example of how the Gen0 looks like when such things happen. Since then we have been able to confirm what was an hypothesis at the time of writing. http://ayende.com/blog/170243/long-running-async-and-memory-fragmentation

This will become even worse with HTTP/2 where connections will be reused and facilities for long-lived connections are going to be common (not a hack like now). I couldnt find it for illustration purposes but I had seen code in Katana aimed to promote buffers to Gen2 repeateadly executing Gc.Collect() at startup to at least force that memory to go up in generations. The problem is that then you have a supply of fixed amount of memory, and if you consume it all, you are done. So you have to get memory in advance that you will either not use, or you wont have enough if work pattern changes.

@redknightlois

This comment has been minimized.

Copy link

commented Jul 15, 2015

@davidfowl Constantly copying from the managed to the native copy to then do what the stream has to do will introduce extra memory bus pressure. Wouldnt that introduce a perf regression when you are dealing with big streams/source arrays?

@OtherCrashOverride

This comment has been minimized.

Copy link

commented Jul 15, 2015

http://ayende.com/blog/170243/long-running-async-and-memory-fragmentation

That link and the comments were very helpful. From what has been stated so far the issue is that the framework is doing the pinning during the async socket operation, not the user. The solution to this may be to modify the socket API in someway or tune its internal operation to prevent it from keeping the memory pinned.

(Is the System.Net.Sockets API even available in CoreCLR at this point? https://github.com/dotnet/corefx-progress/blob/master/src-diff/README.md)

Wouldnt that introduce a perf regression when you are dealing with big streams/source arrays?

I was wondering the same thing; however, the basis for measurement is going to be whether doing multiple copies with short pin times is faster than doing no copies with long pin times overall.

@ayende

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2015

@OtherCrashOverride

I don't see a way for the framework to avoid pinning. Ideally, you are
doing DMA by letting the hardware do the operations on specific memory.
location.

Even if it isn't actually DMA, it behave very much in the same sense, that
eventually you are down to some native function that takes a pointer, and
you need that to remain fixed while the operation is running.

*Hibernating Rhinos Ltd *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 5:30 PM, OtherCrashOverride <
notifications@github.com> wrote:

http://ayende.com/blog/170243/long-running-async-and-memory-fragmentation

That link and the comments were very helpful. From what has been stated so
far the issue is that the framework is doing the pinning during the
async socket operation, not the user. The solution to this may be to modify
the socket API in someway or tune its internal operation to prevent it from
keeping the memory pinned.

(Is the System.Net.Sockets API even available in CoreCLR at this point?
https://github.com/dotnet/corefx-progress/blob/master/src-diff/README.md)

Wouldnt that introduce a perf regression when you are dealing with big
streams/source arrays?

I was wondering the same thing; however, the basis for measurement is
going to be whether doing multiple copies with short pin times is faster
than doing no copies with long pin times overall.


Reply to this email directly or view it on GitHub
#1235 (comment).

@OtherCrashOverride

This comment has been minimized.

Copy link

commented Jul 15, 2015

I don't see a way for the framework to avoid pinning.

I have some theories that would require testing and benchmarking. Principally, introducing an intermediary native buffer owned by the socket and doing a memcpy to the destination buffer when it fills. No matter how much network I/O you have, there are natural breaks in the data (MTU) and memcpy will be faster than wirespeed in many cases. So there are lots of areas to explore to see what turns out to be performant and what is not. Of course, there actually needs to be a System.Net.Sockets before any experiments can be done.

@ayende

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2015

That would be extremely costly. And it would only be relevant for sockets,
we have a lot of I/O work that might have this issue.

*Hibernating Rhinos Ltd *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 5:43 PM, OtherCrashOverride <
notifications@github.com> wrote:

I don't see a way for the framework to avoid pinning.

I have some theories that would require testing and benchmarking.
Principally, introducing an intermediary native buffer owned by the socket
and doing a memcpy to the destination buffer when it fills. No matter how
much network I/O you have, there are natural breaks in the data (MTU) and
memcpy will be faster than wirespeed in many cases. So there are lots of
areas to explore to see what turns out to be performant and what is not. Of
course, there actually needs to be a System.Net.Sockets before any
experiments can be done.


Reply to this email directly or view it on GitHub
#1235 (comment).

@OtherCrashOverride

This comment has been minimized.

Copy link

commented Jul 15, 2015

Another possibility (from the GPU world) is to 'double/tripple buffer'. In this theory, the socket would be filling one array (pinned for duration of operation) while a previously filled array (non pinned) is used by the app. The alternating of the pinned and non-pinned array give the GC the opportunity to move them. Since an array is an reference type, you are simply changing the reference used and not copying or moving any data.

@gistofj

This comment has been minimized.

Copy link

commented Jul 15, 2015

Another possibility (from the GPU world) is to 'double/triple buffer'.

Multi-buffering is done to avoid use of multi-processor locks and to assist with latency hiding. The basics are: consume enormous amounts of resource because the user only cares about speed and fidelity.

I'm wholly sure how it applies here.

@OtherCrashOverride

This comment has been minimized.

Copy link

commented Jul 15, 2015

I'm wholly sure how it applies here.

Because as mentioned it allows the GC to relocate a buffer during the time the other is pinned and waiting to be filled. This solves the long run pinning issue and applies to not only network sockets but any type of IO. Alternatively, the OP can wait for Microsoft to approve and implement the change that allows CoreCLR to allocate managed objects from an arbitrary heap space.

consume enormous amounts of resource

Are people really allocating terabyte buffers for networking?

because the user only cares about speed and fidelity.

I believe that is the reason we are having this discussion to begin with.

@redknightlois

This comment has been minimized.

Copy link

commented Jul 15, 2015

@OtherCrashOverride Probably not terabytes, but small buffers add up pretty fast if the API is not implemented properly. Case in point: dotnet/corefx#1991

@ayende

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2015

@OtherCrashOverride That would only work if the I/O operations are short.
In practice, it is perfectly normal for an I/O operation to takes many
seconds.
For example, if we are just listening on a socket.

*Hibernating Rhinos Ltd *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 6:00 PM, OtherCrashOverride <
notifications@github.com> wrote:

Another possibility (from the GPU world) is to 'double/tripple buffer'. In
this theory, the socket would be filling one array (pinned for duration of
operation) while a previously filled array (non pinned) is used by the app.
The alternating of the pinned and non-pinned array give the GC the
opportunity to move them. Since an array is an reference type, you are
simply changing the reference used and not copying or moving any data.


Reply to this email directly or view it on GitHub
#1235 (comment).

@ayende

This comment has been minimized.

Copy link
Contributor Author

commented Jul 15, 2015

@OtherCrashOverride also note that this isn't just about buffers.
OverlappedData is also a common issue here.

*Hibernating Rhinos Ltd *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811

On Wed, Jul 15, 2015 at 6:00 PM, OtherCrashOverride <
notifications@github.com> wrote:

Another possibility (from the GPU world) is to 'double/tripple buffer'. In
this theory, the socket would be filling one array (pinned for duration of
operation) while a previously filled array (non pinned) is used by the app.
The alternating of the pinned and non-pinned array give the GC the
opportunity to move them. Since an array is an reference type, you are
simply changing the reference used and not copying or moving any data.


Reply to this email directly or view it on GitHub
#1235 (comment).

@OtherCrashOverride

This comment has been minimized.

Copy link

commented Jul 15, 2015

That would only work if the I/O operations are short. In practice, it is perfectly normal for an I/O operation to takes many seconds.

The longer the operation takes to complete, the more opportunity the GC has to relocate the other buffer. When the buffers switch places being locked, the GC then has the opportunity to relocate the other buffer too. The result is that over time, both buffers are moved and no longer cause a fragmentation issue.

The suggestions were offered in the hope they would be helpful. If they are not of benefit, you may simply disregard them.

@jkotas

This comment has been minimized.

Copy link
Member

commented Jul 21, 2015

BTW: https://github.com/dotnet/coreclr/blob/master/src/mscorlib/Common/PinnableBufferCache.cs is a helper designed to deal with the buffer pinning problem. It has methods to explicitly allocate and free buffers. Internally, it manages free buffers in a GC friendly way to avoid problems with pinned buffers described above.

Socket implementation is using it as well.

@denisvlah

This comment has been minimized.

Copy link

commented Jan 5, 2018

I didn't found any public announcement regarding the Project Snowflake since the white paper was released.
If anyone can point me to the page where I ca put like to move project forward I would be very happy and even ask my friends to do the same.

Thanks.

@jkotas

This comment has been minimized.

Copy link
Member

commented Jan 5, 2018

@mjp41 @dimitriv Anything you can share about progress of Project Snowflake?

@mjp41

This comment has been minimized.

Copy link
Member

commented Jan 8, 2018

@masonwheeler , @denisvlah thanks for the interest in Project Snowflake. We are still working on the project improving the design (hopefully will write another paper/blog post soon). Our main fcous now is finding real world .NET workloads that benefit from this design point, but don't have significant evidence yet.

@ayende

This comment has been minimized.

Copy link
Contributor Author

commented Jan 8, 2018

@mjp41 If you are looking for something where it would be useful, RavenDB has several such cases.
We index a lot of data, and it would be really useful to have a scope for the index and recycle the entire thing in one shot.

This is a wishlist from 2013: https://ayende.com/blog/161889/my-passover-project-introducing-rattlesnake-clr

@masonwheeler

This comment has been minimized.

Copy link

commented Jan 8, 2018

@mjp41

Our main focus now is finding real world .NET workloads that benefit from this design point, but don't have significant evidence yet.

Wanna find plenty of them really quickly? Put up Snowflake on its own repo, with clear notes that this is an alpha release that's not production-ready yet, and let Linus's Law do the hard work for you. You know there are plenty of devs out there who would love to be able to use manual memory management on various places in their CLR projects. You'll get far more feedback (and some better feedback, if you don't mind digging through a bunch of crap) from the community than you ever will from a closed research group.

@mjp41

This comment has been minimized.

Copy link
Member

commented Jan 9, 2018

@ayende thanks for the link. At least based on our implementation, jemalloc cannot compete with Gen0 collections, so what ever we migrate really needs to be hitting Gen2 to be beneficial. So this generally works well for periodically dumped logs, and data that have been cached. I imagine there are cases where this occurs in RavenDB.

@masonwheeler we are considering this.

@ayende

This comment has been minimized.

Copy link
Contributor Author

commented Jan 9, 2018

@mjp41 The mere fact that I can control the lifetime of critical objects, is huge for us.

@clrjunkie

This comment has been minimized.

Copy link

commented Jan 9, 2018

I think the primary reason one would care for manual memory management in .NET is to avoid high GC pauses in processes that hold large tree structures which I believe is “the scenario”. I would also consider: A trimmed down C# dialect – “Unmanaged C#” (think C but C# syntax, having “free” and no lib’s) having such code packed in a special purpose assembly to be accessed via a much simplified P/Invoke style mechanism.

@ayende

This comment has been minimized.

Copy link
Contributor Author

commented Jan 9, 2018

From my point of view, having a few hotspots (typically very small pieces of the code) where I can manually control things means huge perf wins without having to deal with unmanaged everywhere.

@4creators

This comment has been minimized.

Copy link
Collaborator

commented Jan 9, 2018

Actually, it seems that a lot of media applications will hugely benefit from using manual memory management and in particular guarantee for no stop the world events during processing on time critical threads. I have not dug into Project Snowflake deep enough to check if this is possible, however, if it would be achievable than the whole world of real-time media processing will open up for C#.

@4creators

This comment has been minimized.

Copy link
Collaborator

commented Jan 9, 2018

Ahh, the very same features would be of great value to game developers.

@masonwheeler

This comment has been minimized.

Copy link

commented Jan 9, 2018

Agreeing with @ayende and @4creators here. Having specific things that could be isolated from the GC would be a massive advantage for the game engine I'm working on.

@mjp41

This comment has been minimized.

Copy link
Member

commented Jan 9, 2018

@clrjunkie

I think the primary reason one would care for manual memory management in .NET is to avoid high GC pauses in processes that hold large tree structures which I believe is “the scenario”.

Is precisely the kind of scenario where we see benefit.

@ayende we have very much focussed on the minimal code changes to use the API. Very much adopt in small places. Most of the time the GC perform really well, and massively improves productivity, so we want that to be the default.

@4creators with Project Snowflake, we don't have any mechanism to stop the GC, we use both the standard (server/workstation) GC plus our runtime extensions. There is already GC.TryStartNoGCRegion.

@verelpode

This comment has been minimized.

Copy link

commented Aug 10, 2018

@ayende wrote:

Note that what I would really like is to define a custom heap, and just drop the whole thing in one shot.

@masonwheeler wrote:

... or it gets dropped when you call the .Drop() method, in which case you've just introduced the concept of dangling references into what used to be a memory-safe environment.

What if the code runs in a separate AppDomain? However, currently GC is done per-process not per-AppDomain, so this idea would require an option that enables separate GC or heap for an AppDomain. Ideally a lightweight kind of AppDomain that can be frequently created and unloaded/dropped. This would be safe because the objects in one AppDomain cannot contain references to objects in a different AppDomain. Such an AppDomain could also have an option to entirely disable GC in the AppDomain, meaning all objects in the AppDomain remain alive until the AppDomain is unloaded/dropped.

Alternatively, my experimental benchmark in corefxlab/#2417 recorded 3x to 4x faster performance when using references to structs stored in arrays instead of class instances. There I describe an ability to make a field in a normal struct or class that safely points to a struct in an element of an array.

See also the message where I describe an idea where a class could have an attribute applied that says that the class should use automatic-reference-counting instead of the normal CLR GC.

In the same message, I also describe an idea for a read-only array of class instances where each element of the array is immediately non-null and cannot be changed to any other reference, and the array is GC'd as a single object not individual class instances.

@svick

This comment has been minimized.

Copy link
Contributor

commented Aug 10, 2018

@verelpode How does that make sense in the context of .Net Core, which, as I understand it, doesn't have AppDomains?

@verelpode

This comment has been minimized.

Copy link

commented Aug 14, 2018

@svick -- Good question. I didn't know that (no time to read everything), but here is my suggested solution: Slightly change the way we think of this idea. Instead of calling it "AppDomain", let's call this idea something else, such as "Memory Domain" or "GC Domain". It may make good sense to rename this idea to "Memory Domain" because the idea that I described is indeed not the same as the existing AppDomain feature, rather it has similarity and difference. So let's say that a "Memory Domain" would be similar to an AppDomain except that a "Memory Domain" would have a separate heap/allocator/GC, whereas AppDomains use the per-process GC and LOH. The part where "Memory Domains" and AppDomains share the same idea is that the objects in one domain cannot contain references to objects in another domain, thereby solving the problem that @masonwheeler mentioned.

I like the name "Memory Domain" better than "GC Domain" because GC might be entirely disabled within a "Memory Domain". Maybe GC is always disabled inside a "Memory Domain", or maybe it is optionally disabled. @ayende would like no GC in a "Memory Domain", instead he would like to drop the entire "Memory Domain" when he's finished using it, and I desire this also. A "Memory Domain" itself would probably be garbage-collected via a finalizer outside of the Memory Domain, but no GC inside the domain.

Re multi-threading, ideally a "Memory Domain" would not require use of threads, but would be thread-safe to support the cases where multi-threading is desired.

Does that sound good to you?

@ayende

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2018

Something like that can be pretty nice, yes. Even just being able to have separate GC for parts of the app would be great.
My user facing code is using a dedicated thread / heap that isn't going to freeze because of long GC cycle on the backend code.

@verelpode

This comment has been minimized.

Copy link

commented Aug 14, 2018

@ayende -- I agree, even if GC cannot be disabled in the domain, the separated GC would still be helpful. However, ideally, I'd like to disable GC inside the domain because, for example, when I ran my benchmark over in #2417, I observed that GC collections wasted a lot of processing time because GC collections ran 996 times during the time when I only needed ONE garbage collection to run (at the end). The other 995 collections were unnecessary and wasted processing time, making my program run slower. (And System.GC.TryStartNoGCRegion doesn't work because it is limited to ephemeral segment size, and because we shouldn't stop GC for the entire process including every thread.)

Large workloads would benefit from a fine-grained ability to disable (or manually start+stop) GC for all objects in a "domain", and it's simpler and faster to discard an entire domain rather than determine which individual objects inside it can be garbage-collected.

@ayende

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2018

I love this idea much better than dotnet/corefx#31643
It works very clearly with existing infrastructure and concepts. We don't need to deal with the idea of cross domain concepts without a proxy, for example.

@verelpode

This comment has been minimized.

Copy link

commented Aug 14, 2018

Can you clarify/elaborate on the proxy topic? How would you like to communicate with (or control) the object(s) in the other "Memory Domain"? With a proxy like in AppDomains or serialization+unserialization or a different way?

@ayende

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2018

@verelpode I would expect to work with them via proxies, like we used to have with app domains.
In that way, you have explicit separation between what memory reside in what domain. You can also do cross references between the domains, but only with explicit proxies and not sharing of objects.

@verelpode

This comment has been minimized.

Copy link

commented Aug 15, 2018

I would love to have the "Memory Domains" feature (including safe proxies similar to AppDomains), but my idea needs input/critique from CLR experts and/or MS engineers. The CLR internals is not my area of expertise, thus I can't say exactly how it would be implemented internally.

@juepiezhongren

This comment has been minimized.

@akutruff

This comment has been minimized.

Copy link

commented Mar 18, 2019

This all comes down to, some first class support of object pooling. The strongest evidence: Roslyn, the C# compiler itself, had to end up implementing multiple object pools to achieve performance goals. See here:

Pools from Roslyn source

ASP.NET Core

Discussion

In today's world of increasing concurrency, shared objects that are treated as readonly/immutable during parallel processing always need deterministic cleaned up when the last task completes.

Arrays/Buffers - We end up pooling these every single time.

POCO's - Small message-like object that hover dangerously in Gen1 are used over and over again, yet are treated no differently by the GC. (POCOS need to be reference types for polymorphism/pattern matching without boxing when they are put in a queue.) Readonly structs, ref returns, Span, and stackalloc are great steps for processing on the stack, but do not address the inevitable need to call a form of .ReturnToPool(). Things will need to get buffered and will end up stored off the stack. It's unavoidable in queued scenarios. Value types are not your friends here either as you're going to be boxing and unboxing like crazy. The actor model is alive and well and happening more and more with pattern matching and increased parallelism.

There needs to be some way to achieve pause-free memory control in our ecosystem that involves some first class support of reference counting as well as custom allocators. It's not just one feature or keyword to solve this. Further, let's tell the GC to treat certain objects as memory critical and that they must be treated differently. This includes banning certain object instances from entering phase 1/2/LOH during garbage collection. (We would be able to make this a type-based policy, but let us make it only for a subset of instances as well.) Even defining our own "phase" with a GC-as-a-service paradigm would be lovely. In other words, let's write code to help the GC and not replace/fight it. This is not a place for being declarative.

Destructor-like behavior - (not finalizers as we need to access managed memory) We need tight deterministic cleanup if we're working with pools and we need to be able to set a guarantee on when they run.

GC Policy - Set a GC config on an object that says: "Do not promote this object to Gen 2 ever. Run a delegate/destructor as either a callback or Task/ValueTask on the ThreadPool, or on a thread reserved by the application. Do not pause the world for this object. It's the appliation's job to return it to a pool. The pool is marked as not to be compacted, and do not put it in the LOH. This will not be solved by new keywords similar to using() blocks. It likely won't be able to be declarative like other solutions.

MemoryPool<T> - Doesn't get the job done unfortunately. IMemoryOwner<T> is a reference type. The owner objects themselves either need to be pooled if we have frequent acquire and release of our objects, and we have to roll your own reference counting on top of it. Looking at the implementations, the best shot is a heuristic to avoid CPU cache thrashing with thread local storage that ends up in local cache starvation in a pipelined producer/consumer scenarios. We can try to wrap MemoryOwner in a value type/ struct to avoid further allocations, yet you end up treating that as a mutable handle. (Mutable value types are evil, yet looking at pooling implementations above... you see handles that are stateful structs with comments warning you.)

When the the ever looming day comes that we hit a pause from Gen 2, there is absolutely no good solution to this problem given the current run-time or language support. WeakReference does not get it done. ConditionalWeakTable still needs our own form of dirty GC or pooling of the WeakReferences themselves because as you add WeakReferences, you end up with a ton of them in the finalizer queue.

Snowflake - This is a great step in the right direction. Obviously smart people are making strides, and there's greatness there. There is one absolutely huge issue that kicks the can too far down the road:

Finally, we have also built shareable reference counted objects, RefCount, but we are considering API extensions in this space as important future work.

Reference counting needs to be solved at the same time to support queuing scenarios and immediate release of scarce buffers. What we have right now with pooled objects on the stack is at least manageable. It all really breaks when we go to shared pointers. Having an immutable, pooled, object in a logging queue and a network queue immediately sends us back to square one. There is an argument to be made for incremental improvement and doing deterministic cleanup later. For such a fundamental change to memory management, leaving clean reference counting as a TODO has not historically worked out well.

For writing pools, and factories we really need to have support for treating constructors as general delegates. We end up needing and we do use this pattern every day: list.Select(x=> new Foo(x)), Factory.Create(x => new Foo(x)) or we learn the hard way that the new() generic constraint used Activator.CreateInstance, and you can't use any constructor arguments. I wish I could do this: Factory.Create(Foo.constructor) and the constructor is converted to an open delegate. Most importantly though, you end up having to make pooled instances have an Initialize(x, y) function when they are getting recycled. Otherwise, they have no way to be stateful. Let me call a constructor on an existing object as many times as I like in the same memory location without invalidating references to that object. (foo.constructor(x)) Last I checked, we can hack and do this through IL if we wanted to. (The memory layout is deterministic after all, right?)

Lastly, over almost 16 years of working in .NET, every single project has had to have an object pool at some point. It's no longer premature optimization, but an inevitability. In financial software, you end up with a gazillion tiny objects for orders, and quotes that love ending up in gen 2. For video, audio, and big unmanaged resources, you will end up pooling and it's going to be after you've written a lot of great code that's going to result in value types turning into classes making things worse. (But hey, it's not a buffer!) For gaming, you better hope that your unpredictable GC pause finishes while also having time to do your physics processing as well. (You're just going to drop frames, because you only have 15ms that you're already squeezing as much work into as you can.)

For C# language feature discussions - "The runtime doesn't support IL for ___" seems to come up in discussion in this area. It's why I wrote it all here. It's too big to not bring it all together as that's how we write programs: run-time and language.

I can't express enough how much fixing this area will benefit the community. It's been my priority 0 since 2008.

Edit: Tried opening a new issue specifically focused on on object pooling and it got closed. Will leave this here, but that doesn't bode well for this ever being looked at holistically in a public fashion.

@Peperud

This comment has been minimized.

Copy link

commented Mar 19, 2019

This all comes down to, some first class support of object pooling. The strongest evidence: Roslyn, the C# compiler itself, had to end up implementing multiple object pools to achieve performance goals....

Amen!

@jkotas jkotas changed the title Custom allocators - (size, disposal, etc). Custom allocators - (size, disposal, pools, etc). Mar 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.