A bounded multi producer multi consumer (MPMC) queue implemented in Rust using only std , no external crates for the queue itself. The queue supports multiple threads pushing and popping simultaneously without data races, backed by a fixed size ring buffer.
Two implementations are provided : one using standard locking primitives and one fully lock free , so their performance characteristics can be directly compared under varying contention scenarios.
.
├── benches
│ └── throughput.rs # Criterion benchmarks
├── src
│ ├── queue
│ │ ├── blocking.rs # Mutex + Condvar implementation
│ │ ├── lockfree.rs # Atomic sequence number implementation
│ │ ├── ring_buffer.rs # Underlying ring buffer
│ │ └── slot.rs # Per-slot data + sequence atomic
│ ├── sync
│ │ └── backoff.rs # Exponential backoff strategy
│ ├── traits
│ │ └── bounded_queue.rs # BoundedQueue<T> trait
│ └── utils
│ └── cache_pad.rs # Cache line padding wrapper
└── tests
├── blocking.rs
├── lockfree.rs
├── fifo.rs
├── stress.rs
└── asymemetrics.rs
Wraps a RingBuffer<T> behind a Mutex, with two Condvars called not_full and not_empty for producer and consumer coordination. The push operation acquires the lock and waits on not_full if the buffer is full, while pop waits on not_empty if the buffer is empty. Each successful push signals not_empty, and each successful pop signals not_full. The implementation is simple and correct, but the single global lock serializes all producers and consumers, which becomes a bottleneck as the number of threads grows.
Uses per slot atomic sequence numbers to coordinate producers and consumers without any global lock. Each slot holds a sequence: AtomicUsize alongside its data. A producer claims a slot by reading the tail position and verifying seq == pos (slot is empty), then doing a CAS on tail to reserve it. After writing, it sets seq = pos + 1 to signal the slot is ready. A consumer claims a slot by verifying seq == pos + 1 (slot is full), CASing head, reading the data, and then setting seq = pos + capacity to recycle the slot for the next lap. Head and tail are wrapped in CachePadded to prevent false sharing, and push/pop use an exponential Backoff on contention rather than spinning tight.
Per slot sequence numbers instead of a global version counter — a single shared counter would re introduce serialization. Per slot sequences mean a producer and a consumer can be active on different slots simultaneously with no interference between them.
CachePadded on head and tail — head and tail are written frequently by different threads. Without padding they would share a cache line, causing every write to invalidate the other thread's cached copy (false sharing). Padding each to 64 bytes gives them independent cache lines.
Exponential backoff in push/pop — a tight CAS retry loop wastes memory bus bandwidth and starves other threads. The Backoff type increases the delay between retries, giving competing threads room to make progress and reducing overall cache coherence traffic.
Backoff reset after yield — after calling yield_now(), the backoff step resets to 0. Since the OS scheduler has already given other threads time to run, starting back at short spins is more responsive than yielding again immediately.
Power-of-two capacity with bitwise AND indexing — capacity is rounded up to the next power of two at construction time. This allows slot indexing via pos & (capacity - 1) instead of pos % capacity, replacing a division instruction (~20–90 cycles) with a single AND (~1 cycle) on the hot path.
AcqRel on CAS instead of SeqCst — the safety guarantee comes from per-slot sequence atomics, not from a global total order between head and tail. AcqRel is sufficient and avoids the MFENCE/LOCK XCHG overhead of SeqCst, which is measurable on ARM (Apple Silicon).
Separate Condvar per direction in the blocking queue — using not_full and not_empty separately means a producer only wakes consumers (not other producers), and a consumer only wakes producers. A single condvar would cause unnecessary thundering herd wakeups on every operation.
1. Writing through UnsafeCell in slot.rs
pub data: UnsafeCell<Option<T>>,data is wrapped in UnsafeCell to allow interior mutability without a lock. The write (*slot.data.get() = Some(item)) and read ((*slot.data.get()).take()) are safe because the CAS on slot.sequence acts as the exclusive access gate — only one thread can win the CAS for a given slot position, so no two threads ever touch data for the same slot concurrently.
2. unsafe impl Sync for Slot<T> in slot.rs
unsafe impl<T: Send> Send for Slot<T> {}
unsafe impl<T: Send> Sync for Slot<T> {}The compiler refuses to derive Sync for Slot because UnsafeCell is not Sync. The manual impl is sound because the sequence number protocol enforces a strict happens before relationship: the producer's store(pos + 1, Release) after writing data synchronizes with the consumer's load(Acquire) that observes seq == pos + 1 before reading. This guarantees the consumer always sees a fully written value and no data race is possible.
cargo testTests cover sequential correctness, FIFO ordering, concurrent contention, boundary conditions, stress scenarios, and asymmetric producer consumer ratios for both implementations.
cargo benchCriterion generates an HTML report under target/criterion/. Four benchmark groups are included:
| Group | What it measures |
|---|---|
blocking |
Throughput at 1/2/4/8/16 threads × 64/256/1024 capacity |
lockfree |
Same grid for the lock-free implementation |
scaling |
Both implementations side-by-side across thread counts at cap=1024 |
asymmetric |
8 producers / 2 consumers and 1 producer / 8 consumers |
To view the full HTML report with graphs:
open target/criterion/report/index.htmlBenchmarks run on a 10 core machine. Each cell shows the median time to complete 100 push + 100 pop operations per thread.
| Threads (prod + cons) | Blocking (µs) | Lock-free (µs) | Winner |
|---|---|---|---|
| 1 + 1 | 22.3 | 22.7 | Tied |
| 2 + 2 | 36.5 | 37.6 | Tied |
| 4 + 4 | 68.2 | 74.5 | Blocking |
| 8 + 8 | 139.0 | 138.2 | Tied |
| 16 + 16 | 353.6 | 304.2 | Lock-free |
| Threads | Cap | Blocking | Lock-free |
|---|---|---|---|
| 1 | 64 | 27.0 | 21.6 |
| 1 | 256 | 22.5 | 21.0 |
| 1 | 1024 | 24.1 | 22.9 |
| 4 | 64 | 146.9 | 82.2 |
| 4 | 256 | 81.5 | 77.8 |
| 4 | 1024 | 73.7 | 63.2 |
| 8 | 64 | 429.2 | 183.2 |
| 8 | 256 | 196.7 | 161.3 |
| 8 | 1024 | 135.4 | 132.0 |
| 16 | 64 | 1089.9 | 396.0 |
| 16 | 256 | 513.2 | 369.0 |
| 16 | 1024 | 354.2 | 287.9 |
| Workload | Median (µs) |
|---|---|
| 8 producers / 2 consumers | 79.5 |
| 1 producer / 8 consumers | 73.2 |
Three incremental optimisations were applied to the lock-free queue after the initial implementation. The table shows median time (µs) at each stage.
| Threads | Baseline | + Backoff reset | + AcqRel + bitwise AND | Improvement |
|---|---|---|---|---|
| 1 + 1 | 24.6 | 21.9 | 22.7 | -8% |
| 2 + 2 | 38.9 | 51.1 | 37.6 | -3% |
| 4 + 4 | 99.8 | 96.7 | 74.5 | -25% |
| 8 + 8 | 133.3 | 146.0 | 138.2 | -4% |
| 16 + 16 | 388.6 | 456.7 | 304.2 | -22% |
The biggest gains are at 4 and 16 threads where the bitwise AND indexing eliminates repeated division on the hot path. The 2-thread case is volatile due to OS scheduling noise at low contention.
Red = blocking, Green = lockfree. Lock-free pulls ahead at 16 threads where core saturation makes the mutex the dominant bottleneck.
1 producer + 8 consumers (73µs) is faster and tighter than 8 producers + 2 consumers (79µs).
Additional benchmark snapshots from earlier optimization stages are included below for comparison.
At low thread counts (1–4 threads), blocking and lock free performance are comparable, and the blocking queue occasionally wins because mutex acquisition is cheap when contention is low and there is no CAS retry overhead.
The lock free queue pulls clearly ahead at high thread counts. At 8 threads with cap=64 it is more than 2× faster (183 µs vs 429 µs), and at 16 threads with cap=64 it is 2.7× faster (396 µs vs 1090 µs). This is where the global mutex in the blocking queue becomes the dominant bottleneck: every push and pop serializes through it, and thread scheduling overhead for waking or sleeping on Condvar compounds the cost.
Larger capacity helps the blocking queue disproportionately at higher thread counts (cap=1024 at 8 threads: 135 µs vs cap=64: 429 µs) because threads block less often — the buffer rarely fills or empties completely, so Condvar wakeups are less frequent.
At 16 threads on a 10 core machine the lock free queue wins clearly (304 µs vs 354 µs at cap=1024). Unlike the original implementation, the AcqRel + bitwise AND optimisations give it enough headroom to stay ahead even when the OS scheduler is under pressure.
For asymmetric workloads, 1 producer / 8 consumers (73 µs) is faster than 8 producers / 2 consumers (79 µs). With a single producer there is zero contention on the tail pointer — the bottleneck is purely on the consumer side where 8 threads compete on head, which scales better than having 8 threads hammering tail into a buffer that fills quickly.





