Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize unbounded channels #279

Merged
merged 2 commits into from Dec 28, 2018

Conversation

@stjepang
Copy link
Member

commented Dec 28, 2018

Use a different queue for unbounded channels.

This gives us:

  • Performance improvements (see benchmarks below).
  • Lower memory consumption (memory reclamation is not deferred, it's eager).
  • Fewer dependencies (no more crossbeam-epoch).

Before:

unbounded_mpmc            Rust crossbeam-channel   0.392 sec
unbounded_mpsc            Rust crossbeam-channel   0.373 sec
unbounded_select_both     Rust crossbeam-channel   0.536 sec
unbounded_select_rx       Rust crossbeam-channel   0.589 sec
unbounded_seq             Rust crossbeam-channel   0.462 sec
unbounded_spsc            Rust crossbeam-channel   0.235 sec

After:

unbounded_mpmc            Rust crossbeam-channel   0.266 sec
unbounded_mpsc            Rust crossbeam-channel   0.250 sec
unbounded_select_both     Rust crossbeam-channel   0.449 sec
unbounded_select_rx       Rust crossbeam-channel   0.438 sec
unbounded_seq             Rust crossbeam-channel   0.333 sec
unbounded_spsc            Rust crossbeam-channel   0.210 sec
@stjepang stjepang force-pushed the stjepang:optimize-unbounded branch from c5135d4 to 05554e6 Dec 28, 2018
@stjepang

This comment has been minimized.

Copy link
Member Author

commented Dec 28, 2018

bors r+

bors bot added a commit that referenced this pull request Dec 28, 2018
Merge #279
279: Optimize unbounded channels r=stjepang a=stjepang

Use a different queue for unbounded channels.

This gives us:

* Performance improvements (see benchmarks below).
* Lower memory consumption (memory reclamation is not deferred, it's *eager*).
* Fewer dependencies (no more `crossbeam-epoch`).

Before:

```
unbounded_mpmc            Rust crossbeam-channel   0.392 sec
unbounded_mpsc            Rust crossbeam-channel   0.373 sec
unbounded_select_both     Rust crossbeam-channel   0.536 sec
unbounded_select_rx       Rust crossbeam-channel   0.589 sec
unbounded_seq             Rust crossbeam-channel   0.462 sec
unbounded_spsc            Rust crossbeam-channel   0.235 sec
```

After:

```
unbounded_mpmc            Rust crossbeam-channel   0.266 sec
unbounded_mpsc            Rust crossbeam-channel   0.250 sec
unbounded_select_both     Rust crossbeam-channel   0.449 sec
unbounded_select_rx       Rust crossbeam-channel   0.438 sec
unbounded_seq             Rust crossbeam-channel   0.333 sec
unbounded_spsc            Rust crossbeam-channel   0.210 sec
```

Co-authored-by: Stjepan Glavina <stjepang@gmail.com>
@bors

This comment has been minimized.

Copy link
Contributor

commented Dec 28, 2018

@bors bors bot merged commit c2d595c into crossbeam-rs:master Dec 28, 2018
2 checks passed
2 checks passed
bors Build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@stjepang stjepang deleted the stjepang:optimize-unbounded branch Dec 28, 2018
@stjepang stjepang referenced this pull request Dec 28, 2018
Copy link
Contributor

left a comment

Hi @stjepang! I'm curious how this PR succeeded in implementing a channel without depending on crossbeam-epoch. May I ask if there is any good reference for this implementation?

@stjepang

This comment has been minimized.

Copy link
Member Author

commented Dec 29, 2018

@jeehoonkang Here's a reference implementation: https://github.com/stjepang/queue

This unbounded MPMC queue is very similar to Dmitry Vyukov's bounded MPMC - it's the same idea, except we have a linked list of blocks rather than one big circular array.

This queue is not 100% lock-free (but neither is SegQueue) - there are a few locking sections of code, but they're very small so chances of hitting them are slim. They don't seem to have a big effect on scalability, throughput, nor latency, which is good.

In my benchmarks (those in crossbeam-channel/benchmarks), this queue seems to be faster than MsQueue, SegQueue, and crossbeam-deque. I'll benchmark with crossbeam-circbuf, too - should be interesting. Eliminating crossbeam-epoch is a noticeable perfomance win because pinning incurs one TLS access and one SeqCst fence per operation. Tokio benchmarks show better results with this queue than with SegQueue and crossbeam-channel, too. Finally, latency distribution of operations on this queue is better than with other queues.

Another benefit of this queue is that memory reclamation is not deferred at all - it's fully eager and there is no concept of garbage! As soon as the last operation using a block is done, it gets destroyed. I did some tests and measured lower memory overhead in high concurrency scenarios than with queues using epochs. Hopefully this fixes the problem where Firecracker tests were failing on Amazon's internal CI due to excessive memory use by crossbeam-channel.

So how does this avoid epochs? The queue has head and tail indices of type AtomicUsize, and there are also head and tail pointers to blocks of type AtomicPtr<Block<T>>. Each block has a slot where a message can be stored and each slot has a state with three bits: READ, WRITE, and DESTROY. All bits are initially zero.

To send a message, load tail.index and tail.block, then do CAS to advance tail.index forward. If we succeeded, now we "own" the slot at that index in that block. Now it's safe to dereference the block pointer and write the message into the slot. When done, we do a fetch_or to set the WRITE bit in the slot's state.

To receive a message, similarly load head.index and head.block, then do CAS to advance head.index forward. If we succeeded, now we "own" the slot at that index in that block. Now it's safe to dereference the block pointer and read the message from the slot (but first wait until WRITE is set). When done, we do a fetch_or to set the READ bit in the slot's state. If we notice that the DESTROY bit is set too, we must call destroy(block).

Here comes block destruction. If our receive operation got the last slot in the block, after reading the message and setting the READ bit, we must call destroy(block). The destroy function iterates over slots in the block and sets the DESTROY bit in each slot's state. If we notice during fetch_or that the READ bit is set too, that means nobody is using the slot anymore. However, if the READ bit is not set, that means some other thread is still reading from the slot, but now that thread became responsible for destroying the block so we just return.

That's the gist of it. I'm omitting a few less important remaining details, but hopefully they make sense when reading the code.

I'm very excited about this queue. Although it doesn't seem super interesting from the CS perspective, in practice it seems to be outperforming pretty much everything else as a general-purpose unbounded MPMC queue.

@jeehoonkang

This comment has been minimized.

Copy link
Contributor

commented Dec 29, 2018

  • Your queue seems very interesting. I'll look into it real soon.

  • in practice it seems to be outperforming pretty much everything else as a general-purpose unbounded MPMC queue.

    I wonder this queue will beat other unbounded MPMC queues in many-core systems with more than 64 cores. Do you have benchmark data on those systems? I'm asking this because in such a highly parallel systems, locking or the lack of progress guarantee usually translates to low performance.

  • Although it doesn't seem super interesting from the CS perspective,

    AFAICT, everything practical is interesting form the CS perspective :) To quote Bertrand Meyer:

    The outcomes that matter in research are not numerous publications, best-paper awards, completed PhD theses, keynote invitations, software tools, citations and other measurable signs of progress. I was after real success, in the sense of changing the way the IT industry develops software.

@stjepang

This comment has been minimized.

Copy link
Member Author

commented Dec 29, 2018

I haven't tried with that many cores yet, let's do this!

So I'm thinking about writing a comprehensive benchmark suite for queues similar to crossbeam-channel/benchmarks, except better. The benchmarks would be organized as follows.

There are scenarios with non-blocking operations:

  • sequential (send N messages, receive N messages)
  • mp (send N messages with T threads)
  • mc (receive N messages with T threads)

There are scenarios with blocking operations (we spin+yield when the queue is full or empty):

  • spsc (send and receive N messages with 2 threads)
  • spmc (send and receive N messages with T+1 threads)
  • mpsc (send and receive N messages with T+1 threads)
  • mpmc (send and receive N messages with 2T threads)

We measure:

  • Total time to complete all operations - that's throughput.
  • The time to complete each operation. Then sort the times and print percentiles: 50, 90, 99, 999 - that's latency.

We do this for a bunch of different Ts (number of threads), ranging from 1 to 100 or so.

A few notes about bounded queues:

  • In scenarios with non-blocking operations, we set capacity to N so that the queue never gets full.
  • In scenarios with blocking operations, we set capacity to 1, 10, 100, and N.

How does this plan sound? What do benchmarks in published papers on concurrent queues do? Should we do something differently?

Also, what should the message type be? I'm thinking maybe it should be of type [usize; S], where S = 0, 1, 3, 10, 50, 100.

@stjepang

This comment has been minimized.

Copy link
Member Author

commented Dec 30, 2018

I wonder this queue will beat other unbounded MPMC queues in many-core systems with more than 64 cores.

I just played with benchmarks on 24-core machine for a few hours. Here are some conclusions:

  • SegQueue and my queue crate seem to be roughly equal. One is somewhat faster than the other depending on the scenario.

  • MsQueue is very slow due to high pressure on the allocator. I think MsQueue is practically useless because other queues are almost strictly better than it.

  • Queues can't scale much because they update head/tail pointers (contention on the same cache lines), but they can be "resilient" to contention in the sense that they don't get worse as the number of threads increases. Both SegQueue and my queue seem to be good at it.

  • Go channels are quite a bit slower than Rust queues.

  • queue uses less memory than other queues due to lack of GC.

bors bot added a commit that referenced this pull request Jan 21, 2019
Merge #291
291: Rewrite SegQueue for better performance r=stjepang a=stjepang

The implementation of `SegQueue<T>` is completely rewritten and is based on https://github.com/stjepang/queue, which provides notably better performance. This one doesn't use `crossbeam-epoch` for memory reclamation, which means we don't have to pin and execute a full fence on every operation.

For more information on how this queue works, see:
* stjepang/queue#1
* #279 (comment)

One new addition in this PR is the `SegQueue::len()` method.

Benchmarks before:

```
unbounded_mpmc            Rust segqueue     0.336 sec
unbounded_mpsc            Rust segqueue     0.261 sec
unbounded_seq             Rust segqueue     0.306 sec
unbounded_spsc            Rust segqueue     0.201 sec
```

Benchmarks after:

```
unbounded_mpmc            Rust segqueue     0.186 sec
unbounded_mpsc            Rust segqueue     0.206 sec
unbounded_seq             Rust segqueue     0.241 sec
unbounded_spsc            Rust segqueue     0.115 sec
```

Co-authored-by: Stjepan Glavina <stjepang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.