Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a bounded SPSC queue #338

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Add a bounded SPSC queue #338

wants to merge 10 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Feb 28, 2019

Inspired by the Rust 2019: Rust Audio thread, I decided to add a high-performance wait-free bounded SPSC queue. This is the most common queue type in audio programming and it's very important it's wait-free and as fast as possible.

cc @Engid @raphlinus Pinging you if interested.

Simple benchmarks (ArrayQueue vs SegQueue vs spsc, smaller is better):

bounded_seq               Rust ArrayQueue   0.176 sec
bounded_spsc              Rust ArrayQueue   0.104 sec

unbounded_seq             Rust SegQueue     0.236 sec
unbounded_spsc            Rust SegQueue     0.141 sec

bounded_seq               Rust spsc         0.031 sec
bounded_spsc              Rust spsc         0.029 sec

@Engid
Copy link

Engid commented Feb 28, 2019

I cannot say thank you enough, but I'll try: Thank you!!!

@cianoc
Copy link

cianoc commented Mar 1, 2019

This is fantastic. Thanks!

@vertexclique
Copy link

Thanks, that would be a way to go now 🎉

///
/// let (p, c) = spsc::<i32>(100);
/// ```
pub fn spsc<T>(cap: usize) -> (Producer<T>, Consumer<T>) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the input argument really is a NonZeroUsize.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but NonZeroUsize would be more appropriate for memory layout optimizations rather than ad-hoc uses like this one. There's little advantage of spsc(NonZeroUsize::new(n).unwrap()) over spsc(n) :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, the rationale for my comment above was to remove the runtime-panic behavior for something that can be encoded at type-level. I'm fine with your call if you think it is not worth it for ergonomics.

/// assert_eq!(c.pop(), Ok(10));
/// assert_eq!(c.pop(), Err(PopError));
/// ```
pub fn pop(&self) -> Result<T, PopError> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a consumer point-of-view, why not at an Option<T>?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because push returns a Result<(), PushError<T>> so I'm returning a Result here for consistency. But otherwise I agree, Option<T> would be fine as well.

impl<T> Producer<T> {
/// Attempts to push an element into the queue.
///
/// If the queue is full, the element is returned back as an error.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may sound silly, but I'd be happy if this behavior-when-full would be stated a bit more prominently in the module docs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not sure what you mean exactly. Would you be happy if the first sentence said the following?

/// Attempts to push an element into the queue, returning it back on failure.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I was unclear, I meant for the whole spsc module (rustdoc and/or README). Something like "a bounded SPSC queue that allocates a fixed-capacity buffer on construction, preventing insertions when full".

tail: Cell<usize>,
}

unsafe impl<T: Send> Send for Producer<T> {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would making just the Inner itself Send + Sync work? That way, you'll need only one impl.

Copy link
Author

@ghost ghost Mar 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would work too, yeah. Although, I personally prefer specifying these kinds of invariants on the API boundaries rather than inside the implementation and hoping auto traits work as intended... :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leveraging auto-traits is in theory a good things because it helps to catch bugs. Like, if you add Rc<T> to Produce in the future, the auto-trait approach will catch this (as in, Producer wouldn't be Send), while manual usnafe impl will lead to unsound code. For simple cases it doesn't matter though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, but it's a double-edged sword. Relying on auto-traits can be risky when all of the following is true:

  • Auto-traits are manually implemented for private types.
  • Auto-traits are automatically inferred for public types.
  • There's a layer of unsafe code between private and public types.

The problem is that the layer of unsafe code can break invariants inferred from auto-traits implemented on private types.

Here's an example. If we relied on Inner: Send + Sync and replaced all Cell<usize>s with AtomicUsizes, then Sync would be inferred for Producer and Consumer, which would be incorrect and could cause data races. That's because the unsafe code in Producer and Consumer already assumes that they don't implement Sync.

So maybe auto-traits are good only when manually implemented on the innermost layer of safe code? But in any case, I think this queue is simple enough that we'd be fine either way.

/// assert_eq!(p.push(10), Ok(()));
/// assert_eq!(p.push(20), Err(PushError(20)));
/// ```
pub fn push(&self, value: T) -> Result<(), PushError<T>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given sp nature of this thing, I'd expect &mut self here, but I guess it doesn't really matter and makes API more flexible?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly - I just made an effort to avoid requiring &mut self since mutability can be annoying to deal with at times.

@alexbool
Copy link
Contributor

Great work! What's the status on this?

@mgeier
Copy link

mgeier commented Apr 13, 2019

The two SPSC ringbuffer implementations I know have a way to read/write multiple "items" at a time (and a way of checking how much space is there to read/write). Would it be feasible to add such a feature to this implementation?

The main use case would be if you have a ringbuffer of audio samples (e.g. T = f32) you can read/write blocks of different lengths from/to the ringbuffer (and you don't need a separate pop()/push() operation for each single audio sample).

Here's two examples:

The JACK ringbuffer (docs, code) has jack_ringbuffer_read_space() and jack_ringbuffer_write_space() to check the available size and jack_ringbuffer_get_read_vector() and jack_ringbuffer_get_write_vector() to get pointers to memory where data can be read from or written into.

The PortAudio ringbuffer (docs, code) has PaUtil_GetRingBufferReadAvailable(), PaUtil_GetRingBufferWriteAvailable(), PaUtil_GetRingBufferReadRegions() and PaUtil_GetRingBufferWriteRegions() with pretty much the same functionality.

I guess instead of (unsafe) raw pointers it would make more sense to provide (safe) &[T] slices for reading and &mut [T] for writing.

@ghost
Copy link
Author

ghost commented Apr 20, 2019

@alexbool The PR is blocked mostly by the fact that I'm not sure what changes to anticipate in the future and how to structure the modules.

Right now I'm thinking we should perhaps move the SPSC queue (types Producer and Consumer) into a submodule array::spsc. Then, later in the future, we might add SPMC and MPSC versions of array-based and segment-based queues too.

So the idea is to structure the crate like this:

mod array {
    mod spsc {
        struct Producer<T>;
        struct Consumer<T>;
        fn new(cap: usize) -> (Producer<T>, Consumer<T>);
    }

    mod mpsc {
        // ...
    }

    mod spmc {
        // ...
    }

    struct ArrayQueue<T>; // mpmc version
}

mod seg {
    mod spsc {
        struct Producer<T>;
        struct Consumer<T>;
        fn new() -> (Producer<T>, Consumer<T>);
    }

    mod mpsc {
        // ...
    }

    mod spmc {
        // ...
    }

    struct SegQueue<T>; // mpmc version
}

use array::ArrayQueue;
use seg::SegQueue;

How does that look?

@ghost
Copy link
Author

ghost commented Apr 20, 2019

@mgeier Interesting!

I wonder how should we expose a safe interface for that in Rust. Note that we can't just return &[T] and &mut [T] for reading and writing - there are multiple problems with that:

  • Reading and writing from/into uninitialized memory is possible.
  • There is a risk of data races because we could read from &[T] while the producer thread is writing into the slice.

It's a bit tricky, but we can probably figure out a reasonable interface...

Another option is to expose methods for batched reads and writes, e.g.:

impl<T> Consumer<T> {
    fn read_batch(&self, dest: &mut Vec<T>);
}

impl<T> Producer<T> {
    fn write_batch(&self, src: &mut Vec<T>);
}

@mgeier
Copy link

mgeier commented Apr 22, 2019

@stjepang I'm a total Rust beginner and I'm not at all an expert on lock-free data structures. Please bear that in mind!

I didn't think about uninitialized memory. The implementations I was mentioning just operate on plain bytes of memory, I guess that's not a serious approach for Rust, right?

Would it be possible to somehow default-initialize the whole underlying Vec<T>?

I guess it would be fine if this feature were only available for a somewhat "trivial" subset of types T. Mostly types like f32 would be interesting.

There is a risk of data races because we could read from &[T] while the producer thread is writing into the slice.

If a user requests a "read slice", this memory area must of course be "blocked" for writing (and for other readers as well, but it is a "single reader" queue anyway).

In the JACK example, the function jack_ringbuffer_get_read_vector() provides access to all "items" that can currently be read from the ringbuffer. But until jack_ringbuffer_read_advance() is called, no writes are possible to this area of memory (but there may still be some other memory left to write to).

In the PortAudio example, callers of the function PaUtil_GetRingBufferReadRegions() can specify how many "items" they want to be reading (but they may actually get less). Those "items" are not available for writing until PaUtil_AdvanceRingBufferReadIndex() is called.

I would hope that with the lifetimes of Rust slices, it would not be necessary to call a separate function when finished accessing the data. But of course I have no idea if this would actually work.

Another option is to expose methods for batched reads and writes

Do read_batch() and write_batch() really need Vec in their API?
Wouldn't that also work with slices?

Anyway, this may force an unnecessary copy of data in some cases, because you need the data to live in some contiguous memory before/after the write/read operation.

If you have e.g. strided data (e.g. interleaved audio channels), you'll have to make an additional copy.
OTOH, if you get a &mut [T], you could directly write into it from your strided values.

UPDATE:

The Rust bindings for JACK provide access to its ring buffer. They use this API for writing multiple "items" at once:

fn get_vector<'a>(&'a mut self) -> (&'a mut [u8], &'a mut [u8])

See https://docs.rs/jack/0.6.0/jack/struct.RingBufferWriter.html#method.get_vector

@ghost
Copy link
Author

ghost commented Apr 22, 2019

@mgeier Here's another try - how about the following interface?

impl<T> Consumer<T> {
    fn pop_batch(&mut self) -> PopBatch<'_, T>;
}

impl<T> Producer<T> {
    fn push_batch(&mut self) -> PushBatch<'_, T>;
}

struct PopBatch<'a, T>;

impl<T> PopBatch<'_, T> {
    fn len(&self) -> usize;
    fn pop(&self) -> Result<T, PopError>;
}

struct PushBatch<'a, T>;

impl<T> PushBatch<'_, T> {
    fn len(&self) -> usize;
    fn push(&self) -> Result<(), PushError<T>>;
}

The idea is that we start a batch by calling pop_batch() or push_batch(), then do a number of pop/push operations, and finally drop the PopBatch/PushBatch type to update the atomic indices inside the queue.

This way each individual pop/push operation in a batch will only do a single memcpy to transfer the value, whereas each regular non-batched pop/push operation updates the atomic indices every single time.

Would this work for you? Or is it really important that you're able to directly access contiguous slices of memory inside the queue?

@ghost
Copy link
Author

ghost commented Apr 22, 2019

Actually, we could then also add the ability for the batches to access direct slices of memory when T is a trivial type (i.e. it implements Copy) and zero-initializable:

impl<T: Copy + Zeroable> PopBatch<'_, T> {
    fn as_slice(&self) -> &[T];
    fn advance(&mut self, n: usize);
}

impl<T: Copy + Zeroable> PushBatch<'_, T> {
    fn as_slice(&mut self) -> &mut [T];
    fn advance(&mut self, n: usize);
}

For example, in order to push a bunch of elements, we would call push_batch(), grab a slice into the queue with as_slice(), then write data into the slice, and call advance(n) to move the write index by n positions forward.

@mgeier
Copy link

mgeier commented Apr 23, 2019

Thanks @stjepang, this looks very promising!

But we would still need two slices, right?

I guess something like (probably using the plural as_slices()?):

fn as_slice(&mut self) -> (&mut [T], &mut [T]);

I don't think that the additional push() and pop() methods of PushBatch and PopBatch objects are really needed. At least I cannot see a real use case. I think such a thing should only be provided if there is need for it in the community.

If anything, I think some kind of a "peek" functionality would be more interesting. But for the cases where the "slices" functionality is available, a separate "peek" method is not necessary, because we can simply "peek" into the slices.

and finally drop the PopBatch/PushBatch type to update the atomic indices inside the queue.

Assuming that len(), push() and pop() are not really needed, do we actually need those separate types?

What if advance() directly updates the atomic indices?

I guess it wouldn't make too much sense to call advance() multiple times (without updating the indices) and then update the indices when dropping PopBatch/PushBatch. Also, calling advance() (without updating the indices) and afterwards as_slice() again doesn't make much sense, either, right?

When I hear advance(), I expect some atomic operations to happen ...

If this works:

{
    let mut mybatch = myqueue.push_batch();
    let (slice1, slice2) = mybatch.as_slice();
    // write to slices
    mybatch.advance(some_amount);
}

... shouldn't something like this also work?

let (slice1, slice2) = myqueue.write_slices();
// write to slices
myqueue.advance_write_index(some_amount);

(I probably missed a mut somewhere)

Or is it really important that you're able to directly access contiguous slices of memory inside the queue?

Well I don't know if it would be that important for me personally. But if we want to convince low-level audio programmers to switch to Rust, we should try to provide library constructs with as little overhead as possible. Having to make unnecessary copies of audio data is not a good selling point!

@ghost
Copy link
Author

ghost commented Apr 23, 2019

@mgeier Thanks for your patience, I really appreciate it!

After thinking a bit more about it, I believe we could implement the following API:

impl<T: Copy + Default> Producer<T> {
    fn write_slices(&mut self) -> (&mut [T], &mut [T]);
    fn advance(&self, n: usize);
}

impl<T: Copy> Consumer<T> {
    fn read_slices(&self) -> (&[T], &[T]);
    fn advance(&self, n: usize);
}

Two tricks that are necessary for ensuring safety:

  1. The two advance() functions will check whether n is valid, i.e. they will make sure the index doesn't go too far (or else panic occurs).

  2. Function write_slices() will initialize the slices with T::default() values if they would otherwise hold uninitialized data. Note that if they hold some data that has been written and subsequently read, we don't have to do anything.

Neither of these two should really be a performance hit since they can be optimized away very well.

Would you be happy with this API?


Alternative 1: write_slices() could return slices of type &mut [MaybeUninit<T>], which would totally sidestep the problem of uninitialized memory. But then users of the API would have to deal with MaybeUninit, which could be a little annoying.

Alternative 2: It would also be possible for write_slices() to return ((*mut T, usize), (*mut T, usize)), but I suppose you'd rather not have the users of the API be forced to write unsafe code.

@mgeier
Copy link

mgeier commented Apr 23, 2019

Would you be happy with this API?

Yes, thanks, that looks very nice!

Please excuse my ignorance, but I have several more questions:

One open question for me is whether there should be additional convenience functions. Getting those pairs of slices is the most important functionality because it is the lowest-level interface. But probably there should also be convenience functions like the read_batch() and write_batch() functions you suggested above?
I'm definitely biased, because the JACK and PortAudio ringbuffers have such functions as well.
Probably it's not necessary to add more functions until somebody requests them?

As I said, I'm a Rust beginner, so I cannot really tell which interface makes sense and which doesn't.
Would it be feasible to use iterators on top of the suggested API?
Something like creating a mutable iterator for each of the returned slices, chaining the two and then assigning values via iteration?
Or should there be a separate API for iterators?

I'm also not sure if the functions read_available() and write_available() are missing or not.
Of course, one could call read_slices() and check their combined length, but probably a separate function for that would simplify things?
I guess it is a common use case to regularly check if the ringbuffer has enough data available and only get the slices when there's enough data.

Another open question is the T::default() thing you described above.
It seems strange to me that write_slices() would sometimes call T::default() and sometimes it wouldn't.
I'd like to believe you when you say that this can be optimized away, but it makes me feel a bit uneasy.
Wouldn't it be possible to initialize the whole memory in the constructor?
I guess if only push() is used, the memory doesn't have to be initialized and default-initialization would cause unwarranted overhead in that case, right?

TBH, I didn't know MaybeUninit and still don't really understand it. But I don't think it would simplify the usage, right? And it must have a performance penalty, right?

Finally, I'm wondering whether the push()/pop() and the write_slices()/read_slices() APIs will ever be used on the same object. It seems to me that I would want to use either one or the other API, depending on whether I want to use chunks of (audio) data or some higher-level user-defined structs.
Would it make sense to have two different spsc types (which would share a large part of the implementation)?

@ghost
Copy link
Author

ghost commented Apr 23, 2019

Would it be feasible to use iterators on top of the suggested API?
Something like creating a mutable iterator for each of the returned slices, chaining the two and then assigning values via iteration?

You could do that with the standard iterator API:

let (s1, s2) = p.write_slices();
let iter = s1.iter_mut().chain(s2.iter_mut());

for slot in iter {
    // do something
}

p.advance(/* some number */);

Or with a slightly different API, similar to one of those I proposed before:

let mut batch = p.batch();

for v in my_values_to_push {
    batch.push(v);
}

drop(batch); // advance the write_index by the number of pushed items

I'm also not sure if the functions read_available() and write_available() are missing or not.
Of course, one could call read_slices() and check their combined length, but probably a separate function for that would simplify things?

I just assumed those functions are not needed because you can compute them from .len() and .capacity(). But then again, it doesn't hurt to also have .read_available() and .write_available(). So maybe we should add them, too. Good idea!

It seems strange to me that write_slices() would sometimes call T::default() and sometimes it wouldn't.
I'd like to believe you when you say that this can be optimized away, but it makes me feel a bit uneasy.
Wouldn't it be possible to initialize the whole memory in the constructor?

We would zero-initialize the whole buffer inside the constructor. Note that we can't initialize with T::default() because the constructor doesn't require T: Default. In order to be able to initialize with the default value, we'd need a special constructor that does require T: Default.

Fortunately, almost every piece of data that implements Copy is zero-initializable. We can check whether T::default() really does return a zeroed value of type T, and if so, write_slices() doesn't have to do anything because the whole buffer was zeroed in the constructor. This check will happen during compile time because the compiler is sufficiently smart.

If T::default() doesn't return a zeroed value (and we don't have the special constructor that requires T: Default), then it gets tricky. We only need to set the slots in slices (returned by write_slices()) to T::default() if they've never been written to before. You can think of this as lazy initialization of the buffer - instead of doing it during construction, we do it lazily on write_slices(). Also note that if you call write_slices() twice in succession, only the first call initializes memory.

So yeah, it's really just a bunch of boring and pedantic safety stuff. But the overall effect on performance is zero in 99% cases and minimal in the remaining 1% cases. So no big deal.

TBH, I didn't know MaybeUninit and still don't really understand it. But I don't think it would simplify the usage, right? And it must have a performance penalty, right?

MaybeUninit<T> has literally zero cost and is just a simple wrapper type indicating that the piece of T may be uninitialized. So if you want to read from it, you must open an unsafe {} block and call .read() on it, promising you're 100% sure the data inside is initialized.

Fortunately, writing into an MaybeUninit<T> is perfectly safe. And that's great because we don't want to read anything from [MaybeUninit<T>] slices returned by write_slices() anyway. We only want to write into them!

Again, more boring pedantic stuff. We're just making sure that if the user really attempts to do something silly like read from slices returned by write_slices(), nothing explodes. :)

Finally, I'm wondering whether the push()/pop() and the write_slices()/read_slices() APIs will ever be used on the same object. It seems to me that I would want to use either one or the other API, depending on whether I want to use chunks of (audio) data or some higher-level user-defined structs.
Would it make sense to have two different spsc types (which would share a large part of the implementation)?

It's probably true that one would either use single-value or batched operations. But does it hurt if we support both? So if we support both, push()/pop() will always be available, while write_slices()/read_slices() will be available only when T: Copy.

My feeling is that it makes sense to have different types only if supporting both types of operations simultaneously somehow hurts performance or increases the risk of bugs. But that doesn't seem true to me. Is my thinking correct here?

@mgeier
Copy link

mgeier commented Apr 26, 2019

we [...] don't have write_slices() due to problems with uninitialized memory

Based on your comments above, I thought those problems could be solved!

I was expecting something like this (which you suggested in #338 (comment)):

impl<T: Copy + Default> Producer<T> {
    fn write_slices(&mut self) -> (&mut [T], &mut [T]);
    fn advance(&self, n: usize);
}

And I'm missing advance() in the Consumer:

impl<T: Copy> Consumer<T> {
    fn advance(&self, n: usize);
}

The big (and I guess only) difference to your original suggestion is that Consumer::read_slices() doesn't need T: Copy.

I have no opinion on that, but your push()/pop() signatures seem to have changed from &self to &mut self, see also https://github.com/crossbeam-rs/crossbeam/pull/338/files#r264038883.

I was a bit confused by the seemingly unnecessary duplication of Consumer::read_available() and PopBatch::read_available(), but I think I understand it now.

Just to make sure my understanding is correct:

  • Between calling Consumer::pop_batch() and dropping the returned PopBatch object, it is not possible to call Consumer::read_available(). Therefore, we need an additional PopBatch::read_available().
  • For each successful call to PopBatch::pop(), the value returned from PopBatch::read_available() decreases, even though the atomic read index hasn't been changed yet.
  • When Consumer::pop_batch() is called, the atomic write index is cached and not checked anymore for subsequent PopBatch::pop() calls.
  • Having a single PopBatch object, I can switch between calling .pop() and iterating over it and calling .pop() again. The value returned from PopBatch::read_available() decreases each time.
  • In summary, a sequence of statements like this would work (and the commented ones wouldn't compile), assuming there are enough items available for reading:
    let n1 = consumer.read_available();
    let (slice1, slice2) = consumer.read_slices();
    // read some values from the slices (or not)
    // don't call consumer.advance()!
    let batch = consumer.read_batch();
    let n2 = batch.read_available();  // n1 == n2, unless someone has written to the Producer
    // wait a bit
    let n3 = batch.read_available();  // n2 == n3, even if someone has written to the Producer
    let x = batch.pop();
    let n4 = batch.read_available();  // n4 == (n3 - 1), even if someone has written to the Producer
    let x = batch.next();
    let n5 = batch.read_available();  // n5 == (n4 - 1), even if someone has written to the Producer
    
    //let n = consumer.read_available();  // Compiler error
    //let (slice1, slice2) = consumer.read_slices();  // Compiler error
    //let x = consumer.pop();  // Compiler error

Finally, I think your Iterator implementation above is a bit off. If I'm not mistaken, it should be something like this:

impl<T> Iterator for PopBatch<'_, T> {
    type Item = T;
    fn next(&mut self) -> Option<Self::Item>;
}

@ghost
Copy link
Author

ghost commented Apr 26, 2019

Based on your comments above, I thought those problems could be solved!

Ugh, sorry! Yeah, with Copy and Default we can have write_slices().

I've just created a new document where we can write the code together and add comments. Might be easier than going back and forth on Github.

In this PR, I'll just add the minimal possible interface (push/pop only) and then we can follow up with a new PR that adds more features to it.

Finally, I think your Iterator implementation above is a bit off.

I'm just omitting fn next for brevity because it doesn't add any information ;)

@ghost
Copy link
Author

ghost commented Apr 26, 2019

I have no opinion on that, but your push()/pop() signatures seem to have changed from &self to &mut self, see also https://github.com/crossbeam-rs/crossbeam/pull/338/files#r264038883.

If we don't make pop() take &mut self, then one could do the following:

  1. Grab slices using read_slices().
  2. Call pop() a couple times.
  3. Read invalid data from the slices retrieved in the first step.

But if we make pop() take a &mut self, the borrow checker will complain.

If method read_slices() didn't exist, then it'd be totally okay to take a &self. Alternatively, we can make read_slices() take a &mut self so that pop() doesn't have to. But I find the inverse a cleaner solution.

Between calling Consumer::pop_batch() and dropping the returned PopBatch object, it is not possible to call Consumer::read_available(). Therefore, we need an additional PopBatch::read_available().

I agree.

For each successful call to PopBatch::pop(), the value returned from PopBatch::read_available() decreases, even though the atomic read index hasn't been changed yet.

Correct.

When Consumer::pop_batch() is called, the atomic write index is cached and not checked anymore for subsequent PopBatch::pop() calls.

Yes.

Having a single PopBatch object, I can switch between calling .pop() and iterating over it and calling .pop() again. The value returned from PopBatch::read_available() decreases each time.

Yes. By the way, it's worth mentioning that PopBatch mutably borrows Consumer so you can't call pop() while there's an active PopBatch.

@mgeier
Copy link

mgeier commented Apr 26, 2019

I've just created a new document

Thanks, it looks great!

I can't wait to try out this API!

In this PR, I'll just add the minimal possible interface (push/pop only) and then we can follow up with a new PR that adds more features to it.

OK, cool.

And thanks for the explanation about &mut self, that makes sense!

@ghost
Copy link
Author

ghost commented Apr 27, 2019

All right, I have slimmed down the API (removed is_empty(), is_full(), len()) and will leave the PR open so that everyone can take one final look.

After that I'll follow up with several PRs introducing the new methods we have discussed.

@mgeier By the way, in case you live in Berlin, I'll be tomorrow in co.up coworking space in Adalbertstraße. :)

@ghost ghost mentioned this pull request Apr 30, 2019
Copy link
Contributor

@jeehoonkang jeehoonkang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stjepang It seems awesome. Thanks.

/// The queue capacity.
cap: usize,

/// Indicates that dropping a `Buffer<T>` may drop elements of type `T`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer -> Inner

@mgeier
Copy link

mgeier commented Jul 31, 2019

Sorry if that's a naive question: Is it possible to make the ring buffer UnwindSafe (or RefUnwindSafe?)?

I would like to use it in a FFI function within catch_unwind() and I get this error:

error[E0277]: the type `std::cell::UnsafeCell<usize>` may contain interior mutability and a reference may not be safely transferrable across a catch_unwind boundary

@mgeier
Copy link

mgeier commented Sep 12, 2019

In case somebody is interested in a use case for the SPSC ring buffer, I'm using it in https://github.com/AudioSceneDescriptionFormat/asdf-rust/blob/master/src/streamer.rs.

In this project I need ring buffers for both use cases:

  • plain old data (with Copy), in my case f32 audio samples
  • non-trivial objects (without Copy), in my case the consumer-end of another ring buffer

Since the "slices" API is not yet available, I couldn't directly implement the first case and had to come up with a somewhat complicated work-around.

What about merging this PR and moving forward with the "slices" API?

What is the timeline of the previously discussed re-structuring of the modules/structs/functions in crossbeam::queue?

If the re-structuring takes some more time, what about merging this PR now and doing the re-structuring later together with everything else?

AFAICT we've reached consensus regarding the names of most functions and structs in this PR, the "only" open questions are with regards to the module hierarchy and the "constructor" functions, right?


UPDATE (April 2021): In the meantime, I've updated the code (AudioSceneDescriptionFormat/asdf-rust@cd5fc8d) to use the new API I've introduced in https://crates.io/crates/rtrb, which simplifies things a lot.

@uklotzde
Copy link

An extension of this work enhanced with batch operations on slices can be found here:

https://github.com/uklotzde/crossbeam/tree/spsc2

Some background: I'm a developer of the Mixxx team. Recently I came up with a new SPSC queue implementation to replace our naive C++ implementation and the error-prone PortAudio C code. The enthusiasm for my initiative was limited, so I decided to abandon the C++ work. Since I didn't find a versatile Rust implementation I ported and extended my code.

Publishing another implementation in a separate crate would further fragment the ecosystem and crossbeam seems to be the right place where those concurrency tools should reside.

I've tried to closely follow the existing design in this PR with some notable exceptions:

  • The names of operations where an error result is not considered exceptional behavior start with try_, i.e. try_push() and try_pop(). This is consistent with the mpsc facilities in std.
  • The error results should explicitly express their intention by using an enum with corresponding (tuple) variants. Even if only a single variant exists.
  • Completely hiding mutation between behind interior mutability is disputable. Modifying the contents of the queue should require exclusive access, i.e. &mut self.
  • The idea to add cache pads to the buffer upon allocation is borrowed from here: https://github.com/rigtorp/SPSCQueue. Possible extension: Instead of only requiring a single empty slot between writer and reader one could reserve multiple slots occupying a whole cache line. This would prevent false sharing in situations when the queue is almost full. But it cannot prevent false sharing in the opposite situation when the queue is almost empty. Might not be worth the effort.
  • Both an allocating constructor with_capacity() (renamed) as well as a generic constructor new() is provided. The latter works for no_std and allows to provide a static or externally allocated buffer that is managed by the owner of the queue.
  • The raw batch operations in the generic producer/consumer are declared as unsafe. Imposing additional trait bounds on T as suggested in the preceding comments removes this burden.
  • I'm a strong advocate of extensively using debug assertions in code that could not be tested exhaustively. For detecting violations of invariants or for catching wrong implicit assumptions early and confidently during development.
  • This more versatile implementation is slightly slower when running the benchmarks, expect +10% total time. The benefits of the batch operations on slices don't count yet.

I didn't want to open a competing PR and instead are asking for your feedback here. It is still work in progress.

@@ -142,7 +142,7 @@ unsafe impl<T: Send> Send for SegQueue<T> {}
unsafe impl<T: Send> Sync for SegQueue<T> {}

impl<T> SegQueue<T> {
/// Creates a new unbounded queue.
/// Creates a unbounded queue.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Creates a unbounded queue.
/// Creates an unbounded queue.

@GeorgeScrivener
Copy link

Any update on this PR?
Looks like it'll be really useful for audio work. I'm happy to help if there's anything still required to get it in

@xacrimon
Copy link

xacrimon commented Feb 8, 2020

Yeah I really want this in. Updates would be nice.

@irh
Copy link

irh commented Mar 23, 2020

Looks like it'll be really useful for audio work. I'm happy to help if there's anything still required to get it in

I'd also be happy to help out here, what needs to be done?

@dnrusakov
Copy link

@stjepang are you still planning to land this PR?

@vgatherps
Copy link

@stjepang I'm happy to help as well.

This design is very close to two other super useful designs, and the wrappers could all share the same datastructure:

  • A single-producer broadcast queue. Each write will be seen by multiple viewers. You can't move out of the queue but can operate on const references to Sync data in the queue.
  • A queue where the writer can overwrite slow readers. In this, the writer will never block and instead will invalidate readers that it passes, while the readers detect this failure. This is useful for ensuring that slow readers will fail instead of blocking others, but it prevents operations that operate on a view of the queue.

I'm happy to help with both of these features.

@nibrivia
Copy link

Also interested in seeing this land in crossbeam and can give time to make it happen...

@mgeier
Copy link

mgeier commented Oct 25, 2020

I just discovered an unrelated (AFAIK) implementation of a lock-free SPSC ringbuffer: https://crates.io/crates/ringbuf

Any opinions on that?

[UPDATE: I tried a few quick benchmarks and this PR is significantly faster than ringbuf, but a more thorough analysis would of course be interesting.]

@mgeier
Copy link

mgeier commented Nov 5, 2020

OK, since there doesn't seem to be much movement here, I decided to take things into my own hands ...

With the permission of @stjepang I grabbed the code from this PR and shoved it into a new repo: https://github.com/mgeier/rtrb.

I then implemented some of the things discussed in the comments above, some of the things discussed in a few DropBox Papers documents, and I came up with a few additional API tweaks.

I was trying to not cause any performance regressions compared to the code in this PR, I hope I was successful.

If you are interested, please have a look: https://github.com/mgeier/rtrb.

The code in this PR is of course very good, I hope I didn't butcher it too much, but I've certainly introduced some bugs.

I'm very much open for further suggestions regarding API and/or implementation. Please feel free to create issues and PRs on my aforementioned repo.

If there are not too many objections, I will publish this in the next few days as a package on https://crates.io.

@mgeier
Copy link

mgeier commented Nov 18, 2020

It took me a few more days than expected, but finally I've released my new ring buffer (based on this very PR):

https://crates.io/crates/rtrb

The API documentation, as usual, is available at:

https://docs.rs/rtrb

I'd love to hear some feedback, but please use https://github.com/mgeier/rtrb in order to not spam the comments of this PR.

@likeabbas
Copy link

Hi, is there any reason why this has stalled?

@xacrimon
Copy link

Crossbeam is largely stalled.

@likeabbas
Copy link

likeabbas commented Apr 25, 2021

Crossbeam is largely stalled.

That is unfortunate. Is there any insight as to why it's stalled, and if it will eventually be revitalized?

@PieterPenninckx
Copy link

Hi, is there any reason why this has stalled?

@likeabbas , the work in this PR lives in a separate crate: rtrb, see this comment above. See also this discussion about some upcoming changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet