Lamport's queue

I was inspired to do this small project after doing a workshop on lock-free and high-performance concurrent algorithms held by Martin Thompson at the 2013 San Francisco QCon. The workshop was focused on Java, but it inspired me to try it out in C++ and some inline assembly.

What is it?

This project is simply an implementation of two bounded-buffer queues for the producer-consumer problem, one which uses POSIX mutexes and one implemented as a Lamport's queue using an atomic compare-and-swap subroutine. I'm doing a performance test by having multiple producers compete over the queue (contention). The design of the test is the same as it was in the workshop, but the implementation is entirely written by me.

How can I run it?

To run this you need an 64-bits Intel x86, an OS that supports POSIX threads and be able to build and compile using GNU Make and GNU C++11 (i.e. use Linux + GCC).

You should also tweak the number of concurrent threads to how many CPUs you have, alternatively have a CPU that supports hyperthreading. If you don't do this, running the performance test will be pointless -- on a Intel Core 2 machine I actually got worse performance for the lock-free queue. I'm defaulting to 4 threads, since this can run on a dual-core machine that supports hyperthreading, but this is easily changed at compile time by editing the Makefile.

Limitations

From a code perspective, I haven't really had time to look into a decent testing framework. I will probably separate and create better tests for this in the future, but this was really about doing a contention queue without locks more than anything else.

From a concurrency point of view, I haven't tested it with more than one consumer. By scoping it to just one consumer, I've done an "optimized" version of the queue using mutexes, by only locking the tail pointer.

Other optimisations

It's pretty apparent that doing mod and div has a high performance impact, because of the architecture's inability to pipeline these instructions and that these instructions demand a lot of CPU cycles. For the two "optimised" queues, I have therefore rounded the buffer size up to the nearest power of two and use a bitwise mask instead to calculate the buffer index by division. This alone improves performance drastically. I haven't done any other optimizations, but of course inlining functions, reducing memory access and references (and trying to hit the cache more often), tuning and tweaking compiler flags etc would also be fun to do at some point.

Note: If you turn on too aggressive compiler optimizations (like -O3), weird things start to happen.

Future work

At some point I would like to test multiple consumers, e.g. a scenario with 5-6 producers and 2-3 consumers. This would involve a minor redesign as the consumer thread is now the main thread. It would also involve dropping the "optimized" queue as I would need to safe-guard the head pointer as well. Another future experiment would be to test the impact of (atomic) memory loads.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
src		src
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lamport's queue

What is it?

How can I run it?

Limitations

Other optimisations

Future work

About

Releases

Packages

Languages

enfiskutensykkel/lamport

Folders and files

Latest commit

History

Repository files navigation

Lamport's queue

What is it?

How can I run it?

Limitations

Other optimisations

Future work

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages