Companion proof-of-concept for the article Parallelizing the Critical Path: OpenMP in Latency-Sensitive Systems. Each executable demonstrates a core concept from the article using a realistic trading-system scenario — computing technical indicators across a universe of instruments.
- C++20 compiler with OpenMP support (GCC 11+ or Clang 14+)
- CMake 3.14+
On Ubuntu/Debian:
sudo apt install build-essential cmakeWith Make (simplest):
makeWith CMake (if installed):
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)All binaries are placed in build/.
The headline benchmark. Runs a multi-stage signal pipeline across 2000 instruments (5000 ticks each): rolling volatility, a 6-period EMA chain, spectral DFT (12 frequency bins on log-return windows), and a final score aggregation. Reports median serial vs parallel latency with warmup runs.
./build/benchmarkSample output (8-core / 16-thread Intel i7-11800H, Ubuntu 24.04, GCC 13.3):
=== Signal Pipeline Benchmark ===
Instruments: 2000 | Ticks each: 5000
Pipeline: rolling vol -> EMA chain (6 periods) -> spectral DFT (12 bins) -> score
Threads: 16
Warmup: 3 runs | Timed: 10 runs
Serial (median): 1900332 us
Parallel (median): 227518 us
Speedup: 8.35x
Demonstrates the basic #pragma omp parallel for pattern — computing fast and slow EMA independently across 500 instruments.
./build/parallel_for_emaWalks through the progression from the article:
- Data race — racy
signal_count++produces inconsistent results across trials - Critical section — correct but serialized
- Atomic — correct with lower overhead
- Reduction — correct with no contention (the right answer)
./build/race_condition_demoShows per-thread scratch buffers declared inside a #pragma omp parallel block (automatically private), plus the explicit private() clause for variables declared outside the region.
./build/thread_local_storageCompares static, dynamic, and guided scheduling on a heterogeneous instrument universe where tick counts range from 200 to 5000 per symbol — demonstrating why static scheduling leaves performance on the table when per-iteration work varies.
./build/scheduling_strategiesControl thread count and placement via environment variables:
# Use exactly 4 threads
OMP_NUM_THREADS=4 ./build/benchmark
# Pin threads to cores (important for latency)
OMP_PROC_BIND=true OMP_PLACES=cores ./build/benchmark├── CMakeLists.txt
├── Makefile
├── README.md
├── include/
│ └── market_data.h # Instrument types, EMA/volatility helpers, universe generators
└── src/
├── benchmark.cpp # Multi-stage signal pipeline benchmark
├── parallel_for_ema.cpp # Basic parallel for with EMA
├── race_condition_demo.cpp # Race -> critical -> atomic -> reduction
├── thread_local_storage.cpp# Per-thread scratch buffers
└── scheduling_strategies.cpp # static vs dynamic vs guided
Benchmarks were compiled and run on:
| Component | Detail |
|---|---|
| CPU | Intel Core i7-11800H (8 cores / 16 threads) @ 2.30 GHz (boost 4.60 GHz) |
| L3 Cache | 24 MiB |
| OS | Ubuntu 24.04.3 LTS |
| Compiler | GCC 13.3.0 |
| Flags | -std=c++20 -O2 -fopenmp |
MIT