Deep dive into pthread internals, memory architecture, and hardware-level synchronization.
- Thread mechanics: clone() syscall, stack allocation, TLS, futex, scheduling
- Synchronization internals: Futex-based mutexes, spinlock implementations, barriers
- Memory architecture: Cache coherency (MESI), false sharing, cache line effects
- Memory ordering: Acquire/release semantics, compiler/CPU reordering, barriers
- Performance analysis: perf, ThreadSanitizer, assembly inspection
- Hardware atomics: x86 LOCK prefix, ARM LL/SC, CAS operations
sudo apt-get install build-essential linux-tools-generic valgrindAssumes: Strong C, familiarity with threading concepts (you know what a mutex is, now learn how it works).
- 01_atomics - Memory ordering deep dive: seq_cst vs relaxed vs acquire-release, message passing patterns
- 02_rwlock - Reader-writer locks, scalable concurrent reads
- 03_cache_effects - False sharing, cache coherency (MESI), 64-byte cache lines (2-10x impact!)
- 04_memory_ordering - x86 TSO vs ARM weak ordering, acquire/release patterns, ThreadSanitizer essential
- 05_spinlock_internals - TAS → TTAS → TTAS+PAUSE → Exponential backoff, CPU PAUSE instruction
- 06_barriers - Phase synchronization, manual implementation, epoch pattern
- 07_lockfree_queue - Lock-free SPSC queue, cache alignment, acquire-release synchronization
- 08_summary - Summary capstone: compare mutex vs per-thread padded counters and an acquire/release SPSC path
make all # Build all exercises
make run-01 # Run exercise 01make asm-05 # View assembly (see LOCK prefix)
make tsan-04 # ThreadSanitizer race detection
make perf-03 # Cache misses, coherency traffic
make objdump-05 # Disassemble binaryFor experienced developers (Rust/Node.js background):
- Skim Exercise 00 (reference), skip Exercise 02 (you know RwLock)
- Core path: 01 → 03 → 04 → 05 → 07 (4-5 hours)
- Use analysis tools liberally:
make asm-XX,make tsan-XX,make perf-XX - Read SYSTEMS_GUIDE.md for deeper theory
- See IMPROVEMENTS.md for detailed enhancement guide
Traditional path:
- Read SYSTEMS_GUIDE.md (theory: futex, cache coherency, memory models)
- Work through exercises 01-07 with assembly/perf inspection
- Reference API.md for pthread details
clone()syscall flags (CLONE_VM, CLONE_THREAD)- Stack allocation via
mmap, guard pages - TLS and
%fsregister (x86) - Context switch cost (~1-3µs)
- Mutex: Futex fast path (atomic CAS) + slow path (syscall)
- Spinlock: Test-and-set, TTAS, exponential backoff, PAUSE
- Atomics:
LOCKprefix (x86), LL/SC (ARM) - Condvars: Spurious wakeups, predicate loops, futex queues
- Cache: L1/L2/L3 latency (4/12/40 cycles), 64-byte lines
- MESI: Modified, Exclusive, Shared, Invalid states
- False sharing: Independent vars on same cache line
- Compiler reordering: Optimizer rearranges code
- CPU reordering: Store buffers, load speculation
- Barriers: Compiler (
asm volatile), hardware (MFENCE,DMB) - C11 orders: relaxed, acquire, release, seq_cst
Cache effects:
./exercises/03_cache_effects/03_cache_effects
perf stat -e cache-misses,LLC-loads ./exercises/03_cache_effects/03_cache_effects
make asm-03Spinlock internals:
./exercises/05_spinlock_internals/05_spinlock_internals
make asm-05 # See LOCK XCHG, PAUSE instructionsMemory ordering:
make tsan-04 # Detect broken synchronization- Equivalent
- Build:
xcode-select --installthenbrew install llvm binutils; runmake all. Use Apple Clang by default (gccis Clang on macOS). - TSan:
TSAN_CC=clang make tsan-04(ThreadSanitizer supported on Apple Clang). - Disassembly:
llvm-objdump -d --x86-asm-syntax=intel -S <bin> | less. - Debugging:
lldb <bin>(instead ofgdb).
- Build:
- Same
- All exercises build on Intel and Apple Silicon; pthreads + C11 atomics work unchanged.
- Assembly generation via
clang -S -O2 -fverbose-asmworks; if-masm=intelis rejected, omit it.
- Different
perf-*targets are Linux-only. Use sampling instead:xcrun xctrace record --template 'Time Profiler' --launch ./<bin> --output trace.traceor quicksample <pid> 5 -mayDie.objdump-%uses GNU objdump; preferllvm-objdumpon macOS (see Equivalent above).- Internals differ: no
futex/cloneon macOS; mutexes/condvars are built on Mach primitives (e.g.,ulock), but concepts map 1:1.
- Equivalent
- Recommended: WSL2 + Ubuntu. Install deps (
sudo apt-get install build-essential linux-tools-generic valgrind) and usemake/perfas-is. - Native (MSYS2/MinGW): Install MSYS2, then in the UCRT64 shell:
pacman -S --needed base-devel mingw-w64-ucrt-x86_64-toolchain. Build withmake all(linkslibwinpthreadvia-pthread). - Disassembly:
llvm-objdump -d -S <bin>orobjdump -d -S <bin>from MSYS2 binutils.
- Recommended: WSL2 + Ubuntu. Install deps (
- Same
- Exercises compile with GCC/Clang in MSYS2 using pthreads and C11 atomics.
- Spin/wait semantics and memory-ordering exercises apply unchanged conceptually.
- Different
perf-*not available natively; use Windows Performance Recorder/Analyzer (WPR/WPA) or Visual Studio Profiler. Prefer WSL2 for parity.- ThreadSanitizer is limited on native Windows toolchains; use Clang+WSL2 for TSan.
- Under-the-hood uses Win32 primitives (e.g., SRWLOCK) rather than Linux
futex.
- Linux
pthread_mutex_t/pthread_cond_trely onfutex()— user mode fast path,FUTEX_WAIT/FUTEX_WAKEslow path, optional priority inheritance (FUTEX_LOCK_PI). - Windows
SRWLOCK/CRITICAL_SECTIONspin in user mode, then park on kernel waits (push locks or keyed events). Windows exposes the futex-like primitive asWaitOnAddress/WakeByAddress*, but high-level locks call it internally. - Priority inheritance & robustness: POSIX has
PTHREAD_PRIO_INHERITandPTHREAD_MUTEX_ROBUST; Windows relies on priority boosting heuristics andWAIT_ABANDONED(for kernelMutexobjects only). - Cross-process: POSIX allows
PTHREAD_PROCESS_SHAREDin shared memory; Windows requires named kernel objects (Mutex/Semaphore/Event) becauseSRWLOCK/CRITICAL_SECTIONare process-local.
- See
exercises/08_summaryfor a hands-on summary that combines atomics, memory ordering, cache effects, and synchronization. Build withmake 08_summaryand run withmake run-08.
- SYSTEMS_GUIDE.md - Core theory (futex, MESI, memory models)
- API.md - pthread API reference
man futex,man pthreads- Intel SDM - Memory ordering, LOCK prefix
- Drepper: What Every Programmer Should Know About Memory
- McKenney: Linux Kernel Perfbook