Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
464 lines (349 sloc) 16.2 KB

Summary

This is a proposal for an improved epoch GC implementation. It aims to solve weaknesses in the current implementation and add new features that are necessary for building more complex data structures.

Motivation

Data structures pin the current thread before using Atomics, and unpin it when they're finished. Pinning effectively stalls garbage collection - it prevents destruction of any new garbage before the thread gets unpinned.

The two core operations of epoch GCs are:

  1. Pinning: we pin the current thread, execute a block of code that uses Atomics, and finally unpin the thread. Any garbage created while the thread is pinned will not be destroyed before it gets unpinned.

  2. Adding garbage: a heap-allocated object is added to the GC. As soon as all pinned threads get unpinned, it will become unreachable and therefore safe for destruction.

Garbage collection is occasionally automatically triggered by both of these operations.

The following subsections analyse the problems with Crossbeam's current epoch GC and offer potential solutions.

Slow thread pinning

To pin and unpin a thread, the current implementation executes 6 atomic operations and a full fence. Coco proves that pinning can be much more efficient: it executes 3 atomic operations and a full fence.

Benchmark (on a modern Intel x86-64):

test pin_coco      ... bench:          11 ns/iter (+/- 0)
test pin_crossbeam ... bench:          27 ns/iter (+/- 1)

Long GC pauses

Crossbeam collects garbage just after pinning the current thread, if it notices that the global epoch and thread-local epoch differ. Then it destroys all garbage stored in one of the three thread-local garbage vectors.

A lot of garbage (hundreds of thousands of items) might accumulate in thread-local vectors. Pauses due to garbage destruction might take tens or hundreds of milliseconds. Long pauses are undesirable, so it'd be better to collect garbage incrementally.

But when and how much garbage should be collected, exactly? A good time to collect some garbage is just after producing garbage. By collecting more garbage than was produced we can make sure that the speed of garbage collection is higher than the speed of garbage production.

Garbage stuck thread-locally

Another problem is that garbage that gets stored in thread-local vectors will never get collected unless the owning thread exits or gets pinned.

A better idea would be to allow keeping only a bounded number of items in thread-local vectors and spilling it over into globally shared vectors.

Very large objects (e.g. arrays backing Chase-Lev deques or hash tables) should probably avoid thread-local vectors altogether and go straight into a global vector, where any thread can collect them. Large objects must be collected more eagerly than small objects in order to reclaim wasted memory faster.

Destructors

At the moment, Crossbeam supports only deferred memory deallocation - it has no support for running destructors. But sometimes it makes sense to run destructors, or even completely arbitrary functions.

For example, skiplist nodes might use custom allocation/deallocation functions if they pack towers directly inside nodes (rather than keeping a pointer to a separately allocated tower). Then we might want to destroy nodes like this:

impl<K, V> for Skiplist<K, V> {
    fn remove(&self, key: K) {
        epoch::pin(|scope| {
            let node: Ptr<Node<K, V>>;
            // ...

            let raw = node.as_raw() as usize; // Necessary to make the closure Send.
            scope.defer(move || {
                Node::destroy(raw as *mut Node<K, V>);
            });
        });
    }
}

Note how Scope::defer is very similar to call_rcu in the Linux kernel.

Detailed design

Thread registration

Threads are registered the first time they get pinned. There is a global lock-free singly-linked list of thread entries. On registration a thread adds a new entry to the list, which contains its thread-local data.

When a thread exits, its thread-local storage gets destructed, and some drop code is triggered. The thread removes its entry from the list, and moves all its thread-local garbage into the global garbage.

Tracking epochs

There is a globally shared AtomicUsize representing the global epoch. When a thread gets pinned, it loads the global epoch and stores it into its thread-local AtomicUsize, indicating what was the global epoch when it got pinned. To unpin a thread, we just set the thread-local atomic to a sentinel value.

Occasionally, we check whether all currently pinned threads have seen the latest global epoch - if so, it is incremented. When an object becomes garbage, it is tagged with the current global epoch. As soon as the global epoch advances twice (differs from the tag by two), it will be safe for destruction.

Pinning is fast, and the memory overhead of garbage tagging is minimal. This approach is used in SQL Server (pdf, section 5.2) and in Coco with great success.

Flushing garbage

Pretty much all garbage objects can be divided into one of these three categories:

  1. Small objects. These are e.g. nodes in a queue or a skiplist. Such objects are generally quick to destroy and consume small amounts of memory.

  2. Medium objects. These are nodes in a B-tree, radix tree, or perhaps hashtable buckets. They consist of dozens of smaller objects.

  3. Large objects. These are usually so large that they hold the whole data structure - e.g. arrays backing Chase-Lev deques. If they run destructors, they may be slow to destroy. Moreover, since they consume a lot of memory, we really want them destroyed as soon as possible.

Garbage objects are buffered in thread-local storage and won't get destroyed before the storage fills up. After adding a large garbage object it is a good idea to completely flush the storage so that garbage doesn't get stuck indefinitely.

For that reason there is the Scope::flush function.

The interface

// Pins the current thread, executes a function, and unpins the thread.
fn pin<F, T>(f: F) -> T where F: FnOnce(&Scope) -> T;

// Returns true if the current thread is pinned.
fn is_pinned() -> bool;

// Represents a scope in which the current thread is pinned.
struct Scope;

impl Scope {
    // Deferred deallocation of heap-allocated object `ptr`.
    unsafe fn defer_free<T>(&self, ptr: Ptr<T>);

    // Deferred destruction and deallocation of heap-allocated object `ptr`.
    unsafe fn defer_drop<T: Send + 'static>(&self, ptr: Ptr<T>);

    // Deferred execution of arbitrary function `f`.
    unsafe fn defer<F: FnOnce() + Send + 'static>(&self, f: F);

    // Flushes all garbage in the thread-local storage into the global garbage
    // queue, attempts to advance the epoch, and collects some garbage.
    // 
    // Even though flushing can be explicitly called, it is also automatically
    // triggered when the thread-local storage fills up or when we pin the
    // current thread a specific number of times.
    fn flush(&self);
}

Drawbacks

Breaks compatibility with the current API.

Alternatives

Thread entries

Thread entries don't have to form a linked list - other data structures may be used as well. Linked list is a known data structure to suffer from bad memory locality.

An alternative could be an array of thread entries. On thread registration we simply iterate through the array and reserve a vacant entry. The main benefit of this approach is that iterating over thread entries in order to check which current threads are pinned is probably much faster due to better memory locality. However, this method might be a bit more difficult to implement.

In any case, this is something that needs experimentation and benchmarking.

Epoch tracking

Alternative epoch tracking schemes exist. Perhaps Crossbeam's current one (with three rotating thread-local vectors) could be improved and optimized just as well.

Again, this is something that needs to be tried out.

System-wide fences

Pinning a thread requires issuing a SeqCst fence, which is very costly. The reason is that we must announce that the current thread is pinned, make sure that the announcement is visible to all other threads (by issuing a fence), and only then continue.

An alternative might be to attack the problem from the other side: when a thread wants to check which threads are pinned, it could issue a system-wide barrier that essentially performs SeqCst on all CPU cores. This way we don't execute barriers on every pin operation but rather on every epoch advancement operation.

A system-wide barrier can be issued using the sys_membarrier syscall on relatively recent Linux kernels and FlushProcessWriteBuffers on Windows. The main problem with those is portability. Moreover, in order to harness the full potential of this optimization, we mustn't check whether a system-wide barrier is available on every pin operation, but have multiple compile-time specialized versions of code.

Flushing garbage

After adding a very large garbage object, instead of flushing we could perhaps specify a number signifying just how "urgent" the object is. For example, we might say that an array of length 1 has urgency 1 and an array of length 1000 has urgency 1000.

Then, we could sum the urgencies of all garbage in thread-local storage, and if it is greater than some threshold, migrate all of it into the global garbage queue.

This mechanism would add some precision over the simple flush-based approach. It is definitely an interesting idea that needs to be explored, but we're postponing it for later as this RFC is already complex enough.

Dedicated GC threads

In Crossbeam threads cooperatively collect garbage - each of them does some work. An alternative way of collecting garbage could be to spawn a dedicated thread (or multiple threads) whose only job is to collect and destroy garbage, advance epochs, and so on.

Unresolved questions

Extreme use cases

In order to be truly scalable, Crossbeam must perform well in scenarios like:

  1. Latency critical code.
  2. Very high thread counts (hundreds, maybe thousands).

Optimizing for those will require a lot of care and tuning, but is not of high priority at the moment.

Linking multiple crate versions

If a program links in multiple versions of crossbeam-epoch, there will be multiple epoch GC runtimes. This is wasteful because now multiple runtimes accumulate garbage.

Rayon had a similar issue and decided to factor out its core into a separate crate called rayon-core that is rarely updated and basically never breaks backwards compatibility.

Perhaps adopting a similar scheme for crossbeam-epoch would be a good idea.

Destructors and local GCs

There are two kinds of concurrent data structures:

  1. Those that disallow borrowing elements. Examples are queues, deques, and stacks. Popping an element takes full ownership of it. There is no way to just "peek inside" an element (i.e. to borrow it temporarily).

  2. Those that allow borrowing elements. Examples are sets and maps. Find operation simply returns a reference to the found element. Removing an element involves finding it and marking it as removed.

Whether a data structure falls into the first or the second category has deep consequences on deferred destructors. Consider a hash set. If an element is removed from it, it must be added to the GC for deferred destruction. For example, a hash set might implement remove like this:

impl<T> for HashSet<T> {
    fn remove(&self, value: &T) {
        epoch::pin(|scope| {
            let elem: Ptr<T>;

            // Find the element and mark it as removed.
            // ...

            // Destroy it later.
            scope.defer_drop(elem);
        });
    }
}

The epoch GC will destroy garbage at some time in the future, but it doesn't guarantee when exactly. Since T might hold non-static references, we mustn't destroy Ts after those references expire. There are three solutions to the problem:

  1. Restrict the type so that it either doesn't have a destructor or it doesn't have non-static references. Here's a sketch:
pub unsafe trait DeferredDropSafe {}
unsafe impl<T: Send + Copy> DeferredDropSafe for T {}
unsafe impl<T: Send + 'static> DeferredDropSafe for T {}

impl<T: DeferredDropSafe> for HashSet<T> {
    // ...
}
  1. Introduce local garbage storage for each instance of HashSet. When it gets dropped it destroys all the garbage it contains. For example:
struct HashSet<T> {
    // ...
    garbage: Garbage,
}

impl<T> for HashSet<T> {
    fn remove(&self, value: &T) {
        epoch::pin(|scope| {
            let elem: Ptr<T>;

            // Find the element and mark it as removed.
            // ...

            // Destroy it later.
            self.garbage.defer_drop(elem, scope);
        });
    }
}
  1. Within the destructor of HashSet flush all threads' thread-local garbage and wait for the global epoch to advance, and then wait until all garbage is destroyed. The main problem with this solution is that it is slow and makes the algorithm blocking.

Another question we might ask is: how exactly is HashSet::find going to hand out references? A possible solution is to provide a callback function that executes while the thread is still pinned:

impl<T> for HashSet<T> {
    fn find<R, F: Fn(Option<&T>) -> R>(&self, value: &T, f: F) -> R {
        epoch::pin(|scope| {
            let elem: Ptr<T>;
            // Find the element.
            // ...

            // Call the callback function.
            unsafe {
                f(elem.as_ref())
            }
        })
    }
}

An alternative could be to return a custom wrapper that increments the element's reference count and then escapes the pinned scope:

struct Entry<'a, T: 'a> {
    // ...
}

impl<'a, T: 'a> Drop for Entry<'a, T> {
    fn drop(&mut self) {
        if self.inner.decrement() == 0 {
            epoch::pin(|scope| unsafe {
                scope.defer_free(Ptr::from_raw(self.inner.as_raw()));
            })
        }
    }
}

impl<T> for HashSet<T> {
    fn find<'a>(&'a self, value: &T) -> Option<Entry<'a, T>> {
        epoch::pin(|scope| {
            let elem: Ptr<Element<T>>;
            // Find the element.
            // ...

            // Attempt to increment the reference count and escape the scope.
            // If the current reference count is positive, incrementing
            // will succeed, otherwise it will fail.
            match unsafe { elem.as_ref() } {
                Some(e) if e.increment() > 0 => Some(Entry::new(self, e)),
                _ => None,
            }
        })
    }
}

Finally, it's worth noting that HashSet must be invariant with respect to T. Consider the following snippet:

impl<T> HashSet<T> {
    fn insert(&self, t: T) {
        // ...
    }
}

fn main() {
    let set: HashSet<&str> = HashSet::new();
  
    let s = "hello".to_string();
    let s: &str = &s;

    // `set` lives longer than the inserted reference `s` is valid!
    // This is bad. In order to prevent insertion in such cases,
    // `T` in `HashSet<T>` must be invariant.
    set.insert(s);
}

Oversubscription

If we have more threads than there are available processors, the epoch GC is accumulating a lot of garbage. This is understandable, as there are more threads producing garbage, and some threads are even preempted while they are pinned, thus slowing the GC down.

Overall, simple tests indicate that the maximum amount of accumulated garbage probably scales linearly with the number of running threads. That is not too bad, but one should keep in mind that it's best to have about the same number of threads using the epoch GC as there are processors.

Other memory management schemes like hazard pointers don't suffer as much because they accumulate less (and a bounded amount of) garbage, but their memory consumption scales linearly with the number of threads as well.

This problem could be alleviated by encouraging the use of a common thread pool abstraction in Rust. Tokio/futures and Rayon already have sophisticated and generic thread pools. This is still an evolving area in Rust's concurrency story, but it's possible that manually spawning unpredictable numbers of OS threads might become a thing of past.