New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving the garbage collector #134
Comments
Ignore that first comment, I got accounts mixed up. I've deleted it from the thread. Awesome work! Crossbeam has been in need of a lot of this for a while. I think many, if not all of these will be eventually mergable into crossbeam (I have quite a few old branchs with the incremental ). I'll look into the code more this weekend but I have questions and comments upfront:
I think PRs for some of the more direct changes would be gladly accepted by the team, and we should get on some of the more involve stuff, especially performance items. |
@schets I believe you wanted to quote parts of my comment, but the quotes are not visible :) |
That I did |
@schets Just to elaborate a bit on your comments:
#[bench] fn pin_coco(b: &mut Bencher) { b.iter(|| coco::epoch::pin(|_| ())) }
#[bench] fn pin_crossbeam(b: &mut Bencher) { b.iter(|| crossbeam::mem::epoch::pin()) }
Thanks for taking a look! :) Criticism is welcome and I'll be happy to answer any further questions. |
On the 2. bullet point, I made a PR to coco: https://github.com/stjepang/coco/pull/1 |
Can you take another look at the first comment I posted? Github formatting completely changed the numbers I had written and I just fixed it. For 1, how does performance compare when actually using the datastructures? On Intel, there are a lot of strange interactions with the memory subsystem that don't manifest without multiple threads and running pin in a tight loop will miss all of them. I have some specific ideas but haven't had time to test them yet. For 4, I remember having some code that advanced the epoch after some number of bytes were waiting to be freed and code that aggressively pushed large objects into the global queue. This doesn't solve your problem though of a thread avoiding crossbeam and never advancing the epoch. Also, the global queue mechanics were quite different for this. At a higher level, it looks like coco focuses on aggressively moving garbage to the global queue instead of each thread freeing what garbage it creates, and this seems to be underlying a lot of the major design decisions. If so, that seems like it would lead to performance issues in the allocator? |
I benchmarked skip list iteration where each step pins the current thread and increments/decrements a reference counter. Performance is quite good with this mechanism, but this was benchmarked with a single thread only. I'll try again with multiple threads...
My strategy is to flush large objects so that they immediately become globally available for garbage collection. Any thread is able to collect such garbage. Moreover, every thread collects some garbage just after it flushes a bag. It also collects garbage between every 128 pinnings. If all threads become inactive with respect to crossbeam/coco and stop pinning, the garbage is left in the queue forever, yeah. I don't see how to solve this problem without spinning up a background thread that constantly keeps vacuuming garbage. But I also think this is not a big problem either. :) This is what the chronology of a deque might look like:
There might be hiccups if another threads gets pinned and holds epoch advancement, though.
Well, depends on what you mean by "aggresively", but yes. Whenever a thread-local bag becomes full, it is pushed into the global queue. This doesn't happen often enough to cause too much contention, so it's okay. I benchmarked stacks and queues backed by crossbeam and coco, and the result was that both GCs have very similar overhead in practice. I didn't test the performance of garbage collection alone, though... Does this answer your questions? |
For your last two questions - I'm worried how well this will work with many cores. Epoch advancement still requires iterating over all active threads in the set, but I'm wary of anything else that introduces global sharing/contention points. What you say about contention on the global queue may be true for 4 cores, but it may be much less true on a 48 or 64 core machine. I can go spin up a VM on a 96 core arm machine for really cheap, and Power8 machines can have almost 200 hardware threads. If all of those threads are doing a lot of garbage generation, there might now be quite a bit of contention on the garbage queue. Deallocation only from the global queue gives me some similar worries as well. You have a data + unpredictable control dependency on the (pointer, free_fnc) pair loaded from the bag and freeing pointers from another core means the allocator will soon be returning pointers that are taking up space in a remote cache and aren't in the local cache, basically a type of false sharing. There might be some allocator-internal sharing problems as well. All of the above are problems that won't really show up in microbenchmarks because they only matter on the whole application-scale, and thread-local bags go a long way towards solving them. Having a more real-world benchmark would help a lot with this. Possible some fake webservy thing that shares data in a global crossbeam map, communicates with crossbeam queues, etc but does enough work to give a meaningful idea of how crossbeam interacts with normal running programs. I have some ideas for this. Having said that, I really like the idea of being able to push garbage directly into the global queue before the epoch is up and being able to incrementally work through the global queue. That's a great way to dealing with large allocations. |
That would be great! Do you have links to websites that provide such machines?
Hmm, it's probably possible to mitigate the contention in several ways (multiple queues, maybe even a work-stealing mechanism, etc.). The library is at the moment not too optimized for massive numbers of cores, but this is something I'd like to do once I get access to an adequate machine.
Yeah, this is a bit worrying. Might be a very difficult problem to solve, but I'll think about it...
Awesome, I'd love to get help with benchmarking. Let me know what ideas you come up with. |
packet.net has the 96 core arm machine |
My thoughts were that threads would default to thread-local garbage bags except for exceptional circumstances like:
I think this would get the benefit of both cases. It prevents an inactive thread from holding on to too much garbage but in normal cases keeps thread-local bags.
I think using thread-local bags when appropriate would help a lot here.
I was going to write an actor-based fake market data system since it's the perfect storm for testing these cases. Lots of message passing, each actor can maintain state in some global map, and each actor has a lot of local state where performance depends heavily on the cache (maintaining and reading a poorly-written order book). |
Sorry for delay, I noticed this discussion just now (from the mention in the dropable pr). I'm going to do a deeper study later but one thing to consider is that the memory reclamation system should try to be friendly with the allocator thread cache. Otherwise it's just gonna push contention there. |
In jemalloc, the fast-path allocation for reasonably small objects looks something like: malloc: thread_cache *tc = get_thread_cache(alloc_size);
return tc->cache_stack.pop(); And free: /* prior data dependency on value of free_ptr */
thread_cache *tc = get_thread_cache(free_size);
tc->cache_stack.push(free_ptr);
/* GC slow path occasionally */ So yeah, freeing pointers which aren't in the cache and whose values aren't in the cache will cause contention in the allocator and on allocated values. The stack operations don't write to the freed pointer but the freeing process probably will soon by virtue of reallocating it |
Threads should whenever possible clear their own garbage. I didn't really read into the code yet to see how aggressive coco is pushing stuff into the global garbage. Most memory allocator benefit from some sort of allocation/thread affinity. That should probably help reduce contention in any sort of global garbage pool anyway. |
I'm fairly certain all collection happens from the global queue in coco, and bags are put in the global queue based on size and epoch advancement. |
One of the biggest problems at the moment is that garbage objects are not segregated by which thread allocated those objects. The GC is totally oblivious to that. Perhaps all allocated objects should be tagged with current thread ID, and when they become garbage they should be promoted into that thread's local garbage bag? Does that sound reasonable? Can we do better?
Correct. Bags are put into the global queue as soon as they get full, or when they get explicitly flushed. @schets The actor-based market data system sounds great. It's going to be very helpful if you build that! Btw, one thing I really wish for is a real-world use case for a concurrent ordered map. It'd be interesting to put such data structures to a real test, for sure. |
It's a hard question whether you care about the allocating thread more than the deallocating thread. I believe jemalloc eventually will return pointers to the allocating thread but only after internal GC cycles. It also depends on the allocation pattern. The items may have been alive for so long the cache lines in question don't have any association with the original thread. I expect that freeing garbage in the thread which last read it will provide the most benefits, since the freeing thread has near certainly read the pointer recently.
I wouldn't say this will use an ordered map in a 'real' way, but it could certainly be made to use it. |
Can we keep (all/some) bags in the current thread until thread exit? Also eventually try to collect the global bags as well. Similar to the current crossbeam implementation. |
I'm working on exactly that right now |
Can you elaborate why you want to do that? The idea is the following: If the thread-local bag gets full, that means the current thread is producing tons of garbage and needs help from other threads. The bag capacity should be configured so that this happens rarely enough and contention isn't a problem. Or maybe you just insist on some limited amount of bags staying in the thread-local staging area? |
This. When a thread is overwhelmed with garbage it can push into the global queue, but otherwise it keeps things local and benefits from doing so |
The point is avoiding contention in the global bag list and potentially in the allocator (by trying to be sympathetic to its thread cache). There could be L123 cache benefits from doing so as well. Reading the code now I'm concerned that if the average number o bags per thread per epoch exceeds COLLECT_STEPS collection might never catch up. Does that make sense or am I missing something? |
If you enforce that a thread must do more incremental deallocations than it does allocations eventually it will catch up I believe. @stjepang I benchmarked the x86 cmpxchg change you made on current master and it is faster in some cases and within benchmark error for a few. |
Yeah, this is something that can certainly be tweaked.
Correct. But this just won't happen in any realistic scenario. I intentionally chose a reasonably large
Great! It's good that we can replicate benchmark results. |
I'll leave the optimization things for another discussion.
I assume this is to discourage user from holding up the epoch but on the other hand its less ergonomic and if the user really needs to hold the pin it'll be forced to use the same closure pattern downstream. |
You mean, we must put a bound on the amount of accumulated garbage in the GC? If so, that's going to be tricky without blocking threads too much and killing the throughput of concurrent collections. Honestly, I'm not aware of any good solutions to this problem. Hazard pointers are a possibility, but they're terribly slow. Any ideas?
I haven't thought this through, but... two things:
To elaborate on the second point, consider the following: thread_local! {
static GUARD: RefCell<Option<Guard>> = None;
} Then suppose that someone stores a guard there. When does that guard get dropped? Well, it's possible that it gets dropped on thread exit. However, while all thread-local storage is getting destructed on thread exit, So, in order to make guards sound from the safety point of view, when dropped they must also check whether But I agree that closures are less ergonomic. |
Maybe I'm being paranoid :) Regarding the Pin interface
I'm fine with the closure interface, it's a trade-off I guess, ergonomics for "safety". -- |
Does anyone have experience with the I've just discovered that it might eliminate costly Interesting facts about it:
I'm curious whether this syscall could be a viable synchronization method for crossbeam. Before I dive in and try to integrate it into the new GC, I'd like to check whether there are any caveats to watch out for first. Pinging @joshtriplett because he's probably knowledgeable about all this. |
@stjepang As long as you can cope with running on a system without it available, then using it when available helps hugely. One caveat: when you can successfully use it, you really want readers compiled to expect it so that they don't need their own barriers; you don't want a runtime conditional on that. So you may have to have a separate compile-time-specialized version of some reader routines to get maximum performance in that case. (On writers, a runtime conditional won't add significant overhead, but as long as you have the infrastructure...) Also, as of commit 907565337ebf998a68cb5c5b2174ce5e5da065eb ("Fix: Disable sys_membarrier when nohz_full is enabled"), enabling If you plan to use it, I would highly recommend chatting with Matthieu Desnoyers, who created it. He has some other future ideas for how to improve it, such as having a process-local version that doesn't affect other processes on the system. I'd love to see those enacted, but he needs more of a use case to make the case for them. |
@joshtriplett Thank you for your response, this is very helpful! There are two more things I was wondering:
|
@stjepang RCU does support stashing garbage away. Take a look at I think that code in coreclr relies on an implementation detail of cross-CPU shootdown handling. I don't know how safe that is, and it's definitely not portable. |
We have |
…beam-rs#134) Add clang arguments to the list of arguments that take values.
Hi folks! :)
I'd like to draw attention to this rayon issue: nikomatsakis/rayon#195
Here we see rayon suffering from bad performance after switching from crate
deque
tocrossbeam
.I've summarized in this comment what's wrong with crossbeam's epoch system and deque implementation. It suffers from multiple problems, but all of those are fixable. Some are easy, and some will require a lot of effort.
While crossbeam has recently been going through a refactoring phase, I've designed a new epoch system from scratch. I could write a lot about it's inner workings and decisions that were made. The short story is that it's superior to crossbeam's in pretty much all areas:
Atomic
API is more ergonomic.Atomic
s can be tagged with several bits.flush()
.To solve all those problems, a complete rewrite was needed. Unfortunately, I don't know how to patch crossbeam's existing epoch system implementation to fix all of those - it really needs a fresh start.
You can check out my new crate here: https://github.com/stjepang/coco
There's a finished deque implementation that fixes all the problems rayon had with crossbeam. And it's much faster than crossbeam! :)
Here's what I'm aiming to implement in the very near future:
select
macro: https://github.com/stjepang/channel (WIP)By the way, I've also developed a faster equivalent to
MsQueue
that also supports the pop operation with timeouts. In the end I deleted it because I want to write an even better queue. :)After that I'd like to explore several other types of maps, including hash maps.
In order to implement all this, a solid and very performant epoch system is required - and now I have it! The design was evolving for some time but now it has finally settled on one that seems just right.
I'm writing all this to ask the crossbeam team how they feel about it. @aturon has politely asked me not to split the crate ecosystem if possible, and I'd like to respect that. Ideally, it'd be best to merge the improvements into crossbeam. And for that to happen we need to reach a consensus as a team.
In any case, I'll keep using
coco
to develop new data structures until crossbeam gets a better epoch system. Then we can pretty easily port it's data structures to crossbeam, as far as I'm concerned.Please don't get me wrong, I don't want to be too harsh to crossbeam. This is a magnificent project, a huge inspiration to my work, and I believe @aturon did fantastic job designing it! Thank you for that! :)
That said, my position is that 90% of it has to be scratched and reimplemented if we want to solve existing problems and move on. If you're skeptical about a big rewrite, perhaps I could simply keep developing
coco
until we see if my ideas turn out to be any good? At least it's something the crossbeam team should keep an eye on...What do you think?
The text was updated successfully, but these errors were encountered: