Real time 'GC' #39

schets · 2016-01-29T18:21:07Z

It would be pretty neat if users could specify strict limits on GC - I have an exploratory implementation in this branch that lets users disable/enable the GC in a given scope, and I think there could be more added, mainly two things -

Users can specify the maximum number of items to collect within a limited scope
Users can specify the maximum time that can be spent collecting in a given scope

Since collections in one thread don't block collections in another, there are also some really cool things that could be done here in combination with the above. An idea I had is a system where GC work is offloaded to separate worker threads, or at least non real-time threads - This lets crossbeam behave like a pauseless GC that quickly reclaims memory and doesn't impose big throughput penalties on the real-time threads. I'm sure there are other things that could be done.

I'm not sure what a good, or even idiomatic rust api for something like this is. I doubt what I tried is. Maybe the scope api for scoped threads can also handle other scope-based attributes of crossbeam like real-time regions or limiting GC in a given scope.

Of course, crossbeam and the associated datastructures probably won't be suitable for hard real-time work, simply because jemalloc isn't hard real-time afaik. But it would be nice to get something close, at the very least something effective for soft real-time and latency sensitive programs.

aturon · 2016-01-30T15:18:39Z

FWIW, I'd love to expose functionality along these lines, as well as instrumentation for tracking GC, etc. Crossbeam is still quite experimental, so I'm open to landing early-stage APIs (perhaps under a feature flag) to gain experience.

The current heuristics around when to do a GC and so on are pretty arbitrary, though I did some basic testing with benchmarks to find the smallest local threshhold that didn't impose a performance penalty. I suspect there's a lot more room for experimentation there, as well.

I've never dug deep into GC/allocator customization myself, so maybe a good starting place is just to collect some of the useful ideas that have cropped up elsewhere, and see what makes sense to offer in crossbeam?

schets · 2016-02-01T05:54:11Z

Now that I think about it more, I really don't like the API I put in that PR. Adding new features would probably require more, or at least more complicated macros which are kinda opaque and weird as opposed to normal code. I think something more along the lines of the scope api would be better and safer while avoiding macro shenanigans.

I think that the smallest thresholds possible without negatively affecting performance are good - being able to get simple concurrent memory management and avoid GC-like latency spikes is a huge advantage. Also, if these thresholds are adjustable, then users can change them if it actually affects their use.

I've looked at some literature (JVM options, RTJS, Azul JVM, library/software sources) and have some ideas for what could go in (not that all of this needs to) and what the upsides/downsides of such features are. In the below, I use application threads to mean threads that are doing useful work for the program and operating on the datastructures with crossbeam, real-time application threads as application threads with strict latency requirements, and GC threads to mean threads that purely do work for the garbage collector.

In a given scope, one might be able to control:

GC prevention: prevent (and re-enable) the GC. This can take many forms - one could simply skip all GC related activity, or migrate garbage to to global lists. There would need to be an option to force GC disabling (but not force enabling!) so that a poorly behaved third-party library couldn't re-enable it and cause big latency spikes.
Dedicated GC threads: Cossbeam would create GC threads which take work directly from application threads and the global bag. This would lower overall throughput due to worse cache locality and hitting the multithreaded part of the allocator hard, but would allow real-time application threads to never have to use the GC. While needing more than one GC thread seems extreme, in the day and age of 40+ core x86 servers and 100+ core Arm/PowerPC servers, it's entirely plausible.
GC limits: Limit the amount of time spent in the garbage collector, or amount of items collected. Timing restrictions as opposed to completely disabling the GC would allow some of the work to be distributed into application threads as long as it didn't break time/usage limits.
Skipping the allocator: For a limited subset of cases, it might make sense to allow the GC to send freed memory chunks directly back certain writer threads, skipping the allocator and getting better lock-freedom properties. There are ways of doing this with lock-free freelists/queues that are wait-free for consumers and involve few to zero atomics. There could also be GC threads dedicated to this.
Multiple epochs/GC's: There could be separate epochs and global gcs which a datastructure may register with (as opposed to the default one). This could be useful to separate real-time threads away from the rest of the application, threads that may block epoch advancement. If there's a sort of freelist scheme in place, this may be really useful to ensure that a set of real-time threads have enough GC worker time to keep the freelists populated.

Also, in a given scope, user could optionally collect stats (global and local) on:

local GC frequency
of GC calls
statistics on number of operations per call
total time spent in GC
portion of Crossbeam time (time between participant enter/exit calls) spent in GC
latency statistics

And if if we really want to get fancy, we could enable some forms of logging although that seems maybe a bit out of the scope of this project.

This is more of a brain dump that claiming that Crossbeam needs the features, and some of them are fairly involved. I have a lot of reason to believe that they are very, very attractive to people who want the benefits of a GC when writing high performance multicore data structures and don't want to shell out tremendous amounts of money for specialized JVMs and highly specialized hardware (+ the questionable throughput of most real-time jvms), while still bending over backwards to get the GC to cooperate.

schets · 2016-02-09T21:52:44Z

I'm going to break this out into a few separate issues, since it's kinda hard to meaningfully discuss such a giant blob of different ideas.

schets closed this as completed Feb 9, 2016

ghost mentioned this issue Jan 30, 2019

AtomicCell: Why do you use SeqCst? #317

Closed

nical mentioned this issue Jan 12, 2021

ThreadSanitizer data race alert in read_volatile #646

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real time 'GC' #39

Real time 'GC' #39

schets commented Jan 29, 2016

aturon commented Jan 30, 2016

schets commented Feb 1, 2016

of GC calls

schets commented Feb 9, 2016

Real time 'GC' #39

Real time 'GC' #39

Comments

schets commented Jan 29, 2016

aturon commented Jan 30, 2016

schets commented Feb 1, 2016

of GC calls

schets commented Feb 9, 2016