Matrix Engine is a C++20 matrix computation library with sequential and parallel execution modes.
- C++20-compatible compiler
- CMake 3.16+
cmake -S . -B build
cmake --build build --config Release -j./build/matrix_tests./build/add_benchmark [iters=5] [threads=8]
./build/multiply_benchmark [iters=5] [threads=8]
./build/determinant_benchmark [iters=10] [threads=8]cd benchmark
chmod +x benchmark.sh
./benchmark.shSee .txt output files in /benchmark.
The benchmark was run on an Apple M3 CPU with 8 cores (4 performance and 4 efficiency). The performance will vary across machines and platforms.
The library is organized into three layers:
Matrix (include/Matrix.h) — Template class Matrix<T, rows, cols> with compile-time dimensions. Move-only semantics backed by std::unique_ptr. Provides element access, minor computation, determinant, and addition.
Executors (include/Executor.hpp) — Abstract Executor base with submit(Func, Args...) -> std::future<R> and cancel(). Two concrete implementations:
LockExecutor— mutex + condition variable, tasks queued instd::queueLockFreeExecutor— atomic circular buffer with CAS loops andwait/notify
Parallel Operations (include/ParallelOperation.hpp) — Stateless, reusable operations dispatched over ExecutionMode (SeqMode or ParMode{Executor&}) via std::visit:
MultiplyOperation/AddOperation— row-level parallelization; each worker thread writes to an independent rowDeterminantOperation— parallelizes top-level Laplace cofactor terms only (not recursive calls)PendingTasksGuard— RAII guard that callscancel()on exception,dismiss()on success
Example:
matrix_engine::MultiplyOperation multiplyOperation;
matrix_engine::ExecutionMode const seqMode{matrix_engine::SeqMode{}};
matrix_engine::ExecutionMode const parMode{matrix_engine::ParMode{executor}};
auto seq = multiplyOperation(seqMode, a, b);
auto par = multiplyOperation(parMode, a, b);- Operations (
MultiplyOperation,AddOperation,DeterminantOperation) are independent structs with a templatedoperator(), not subclasses of a common base. - A virtual base would require erasing the matrix type parameters (
T,rows,cols) behind a common interface, forcing runtime dispatch on types that are fully known at compile time. - The template parameters differ per operation (multiply takes two differently-sized matrices; determinant takes a square matrix), so no single virtual signature could cover all three without type erasure overhead.
- Using plain structs keeps the implementation stateless, zero-overhead, and open to overloading for new type combinations without touching a class hierarchy.
- Current design is OCP along the operation axis: adding a new operation (e.g.
TransposeOperation) requires no changes to existing code. It is not OCP along the execution mode axis: adding a new mode (e.g.GpuMode) requires extending theExecutionModevariant and adding arun()overload in every existing operation. This is the inherent expression problem tradeoff ofstd::variant+std::visit— the inverse would be true with a virtual hierarchy. The current tradeoff is intentional since execution modes are stable and new operations are the more likely extension point.
ParModestoresExecutor&; the executor must outlive each operation call.ParModeborrows an executor instead of owning it viastd::unique_ptrorstd::shared_ptr.- Borrowing keeps ownership explicit and avoids hidden lifetime extension.
std::unique_ptrwould imply ownership transfer, which is not desired for a reusable thread pool.std::shared_ptrwould add shared-ownership overhead and make shutdown timing less predictable.- Operations submit jobs and wait before returning, so a non-owning reference is sufficient.
- Executors are designed to be reusable.
cancel()is queue-drain behavior: it removes pending tasks but does not stop tasks already running.PendingTasksGuardis an RAII guard that callscancel()on early exit (for example, exceptions); normal completion callsdismiss().
- Matrix data is stored as
std::unique_ptr<array<array<T, cols>, rows>>rather than a plain member array. - A plain
array<array<T, cols>, rows>member would be allocated inline. For large matrices (e.g. N=12000 as used in benchmarks), this exceeds typical stack limits (~8 MB) — a 12000×12000floatmatrix is ~576 MB — making heap allocation a hard requirement. - Heap allocation via
unique_ptralso makes moves O(1) — only the pointer is transferred — which is important since operations likeminor()anddeterminant()return matrices by value through deep recursion. - Copy constructor and copy assignment are explicitly deleted to enforce move-only semantics and prevent accidental deep copies.
- Each parallel operation submits tasks to the executor and holds a
vector<future<T>>of in-flight jobs. - If an exception is thrown mid-operation (e.g. during task submission), futures already in the vector would be abandoned — their results never waited on — while the executor continues running the associated tasks against now-dangling references.
PendingTasksGuardwraps the executor and the job vector. Its destructor callsexecutor.cancel()to drain the pending queue, then waits on any already-running futures, preventing use-after-free on captured references.- On the success path,
dismiss()deactivates the guard so the destructor is a no-op. - The guard is non-copyable and non-movable to prevent accidental transfer of this responsibility.