Axiom is an open-source, high-performance C++ tensor library that brings NumPy and PyTorch simplicity to native code. With state-of-the-art SIMD vectorization, BLAS acceleration, and Metal GPU support, Axiom delivers HPC-grade performance while maintaining an intuitive API that feels natural to Python developers.
The Axiom library offers ...
- ... Python-familiar API through operator overloading, method chaining, and NumPy-compatible function names
- ... high performance through Accelerate, OpenBLAS, and manually tuned SIMD kernels
- ... vectorization by SSE2/3/4, AVX, AVX2, AVX-512, FMA3/4, ARM NEON/ARMv8, WASM SIMD, RISC-V Vector, and PowerPC VSX
- ... parallel execution by OpenMP with intelligent workload thresholds
- ... full GPU acceleration via Metal Performance Shaders (MPSGraph) — every operation runs on GPU, not just matmul
- ... einops integration for intuitive
rearrange("b h w c -> b c h w")tensor manipulation - ... zero-copy views with strides-based memory model eliminating unnecessary data copies
- ... complete dtype coverage including Float16/32/64, Int8-64, Bool, and Complex64/128
- ... portable distribution with dynamically linked BLAS backends for cross-platform deployment
Get an impression of the familiar syntax in the Quick Start section and the impressive performance in the Benchmarks section.
// NumPy: x = np.where(x > 0, x, 0)
auto x = Tensor::where(x > 0, x, 0);
// NumPy: y = x.reshape(2, -1).T
auto y = x.reshape({2, -1}).T();
// PyTorch: z = F.softmax(scores, dim=-1)
auto z = scores.softmax(-1);If you know NumPy or PyTorch, you already know Axiom.
3500+ GFLOPS on M4 Pro. Beats Eigen & PyTorch.
// Einops-style rearrangement
auto img = x.rearrange("b h w c -> b c h w");
// Einops-style reduction (spatial pooling)
auto pooled = x.reduce(
"b (h p1) (w p2) c -> b h w c",
"mean", {{"p1", 2}, {"p2", 2}}
);
// Global average pooling
auto gap = features.reduce("b h w c -> b c", "mean");Complex transformations, readable code.
- 26 comprehensive test suites covering all operations
- CI/CD pipeline testing CPU and GPU paths
- Cross-platform validation on macOS, Linux, Windows
- NaN/Inf guards with
assert_finite()safety rails - Shape assertions with
assert_shape("b h w c")
Production-ready from day one.
Latest Release: Axiom 1.0.0
git clone https://github.com/frikallo/axiom.git
cd axiom && make releaseOr fetch directly in CMake:
include(FetchContent)
FetchContent_Declare(axiom
GIT_REPOSITORY https://github.com/frikallo/axiom.git
GIT_TAG main)
FetchContent_MakeAvailable(axiom)
target_link_libraries(your_target Axiom::axiom)#include <axiom/axiom.hpp>
using namespace axiom;
int main() {
// Tensor creation - just like NumPy
auto a = Tensor::zeros({3, 4});
auto b = Tensor::ones({4, 5});
auto c = Tensor::randn({3, 4});
auto d = Tensor::linspace(0, 1, 100);
// Intuitive operations
auto result = (a + c).relu().matmul(b);
// Conditional selection - Python's np.where()
auto x = Tensor::randn({100});
auto positive = Tensor::where(x > 0, x, 0.0f);
// Einops-style rearrangement
auto img = Tensor::randn({2, 224, 224, 3});
auto nchw = img.rearrange("b h w c -> b c h w");
// Full transformer attention in 5 lines
auto Q = Tensor::randn({2, 8, 64, 64});
auto K = Tensor::randn({2, 8, 64, 64});
auto V = Tensor::randn({2, 8, 64, 64});
auto scores = Q.matmul(K.transpose(-2, -1)) / std::sqrt(64.0f);
auto output = scores.softmax(-1).matmul(V);
return 0;
}GPU acceleration? Just change the device. Every operation runs on Metal—no code changes required:
// CPU version
auto x = Tensor::randn({1024, 1024}, DType::Float32, Device::CPU);
// GPU version - same API, 10-20x faster on Apple Silicon
auto x = Tensor::randn({1024, 1024}, DType::Float32, Device::GPU);
// Everything just works: matmul, softmax, reductions, broadcasting, indexing...
auto result = x.matmul(x.T()).softmax(-1).sum({1}); // All on GPUNo other C++ tensor library offers this. Eigen, Armadillo, Blaze—all CPU-only. With Axiom, you get the same clean API with full GPU acceleration on macOS.
Axiom mirrors NumPy and PyTorch APIs so closely that translating Python code is almost mechanical:
| NumPy / PyTorch | Axiom |
|---|---|
np.zeros((3,4)) |
Tensor::zeros({3,4}) |
np.arange(0, 10, 0.5) |
Tensor::arange(0, 10, 0.5) |
np.linspace(0, 1, 100) |
Tensor::linspace(0, 1, 100) |
x.reshape(-1, 4) |
x.reshape({-1, 4}) |
x.transpose(0, 2, 1) |
x.transpose({0, 2, 1}) |
np.concatenate([a,b], axis=1) |
Tensor::cat({a,b}, 1) |
np.where(cond, a, b) |
Tensor::where(cond, a, b) |
x[x > 0] |
x.masked_select(x > 0) |
torch.gather(x, dim, idx) |
x.gather(dim, idx) |
F.softmax(x, dim=-1) |
x.softmax(-1) |
F.layer_norm(x, shape) |
ops::layer_norm(x, w, b) |
Full einops pattern syntax for semantic tensor manipulation:
// Reshape and transpose in one operation
auto transposed = x.rearrange("b h w c -> b c h w");
// Flatten spatial dimensions
auto flat = x.rearrange("b h w c -> b (h w) c");
// Patch embedding (Vision Transformer style)
auto patches = img.rearrange("b (h p1) (w p2) c -> b (h w) (p1 p2 c)",
{{"p1", 16}, {"p2", 16}});
// Reduce with pattern
auto pooled = x.reduce("b (h 2) (w 2) c -> b h w c", "mean");
auto gap = features.reduce("b h w c -> b c", "mean");Axiom automatically selects the fastest available backend:
| Platform | BLAS Backend | Vectorization | GPU |
|---|---|---|---|
| macOS (Apple Silicon) | Accelerate + vDSP | ARM NEON / ARMv8 | Metal (MPSGraph) |
| macOS (Intel) | Accelerate | SSE2-4.2 / AVX / AVX2 | Metal |
| Linux (x86_64) | OpenBLAS | SSE2-4.2 / AVX / AVX2 / AVX-512 / FMA3 | — |
| Linux (ARM) | OpenBLAS | ARMv7 / ARMv8 NEON | — |
| Windows | Native (OpenBLAS optional) | SSE2-4.2 / AVX / AVX2 | — |
| WebAssembly | Native | WASM SIMD | — |
| RISC-V | Native | RISC-V Vector ISA | — |
| PowerPC | Native | VSX | — |
OpenMP parallelization with intelligent thresholds ensures overhead is only incurred when beneficial.
Full SIMD Support Matrix (via xsimd)
| Architecture | Instruction Set Extensions |
|---|---|
| x86 (Intel/AMD) | SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, FMA3, AVX2 |
| x86 (AVX-512) | AVX-512 (GCC 7+, Clang, MSVC) |
| x86 (AMD) | All of the above + FMA4 |
| ARM | ARMv7 NEON, ARMv8 NEON |
| WebAssembly | WASM SIMD128 |
| RISC-V | Vector ISA (RVV) |
| PowerPC | VSX |
Axiom uses xsimd for portable SIMD abstraction, automatically dispatching to the optimal instruction set at compile time with runtime fallback to scalar operations.
Complete LAPACK-backed linear algebra module:
// Decompositions
auto [U, S, Vh] = linalg::svd(A);
auto [Q, R] = linalg::qr(A);
auto L = linalg::cholesky(A);
auto [eigvals, eigvecs] = linalg::eigh(A);
// Solvers
auto x = linalg::solve(A, b); // Ax = b
auto x = linalg::lstsq(A, b); // Least squares
auto Ainv = linalg::pinv(A); // Pseudoinverse
// Analysis
auto d = linalg::det(A);
auto n = linalg::norm(A, "fro");
auto r = linalg::matrix_rank(A);
auto k = linalg::cond(A);All operations support batch dimensions: A.shape = (batch, M, N).
// Single tensor
tensor.save("weights.axfb"); // FlatBuffers (fast, zero-copy)
tensor.save("weights.npy"); // NumPy format (Python interop)
auto loaded = Tensor::load("weights.axfb");
// Multiple tensors (model checkpoints)
Tensor::save_tensors({{"weight", W}, {"bias", b}}, "model.axfb");
auto params = Tensor::load_tensors("model.axfb");Arithmetic & Math (click to expand)
| Operation | Function | Operator |
|---|---|---|
| Addition | ops::add(a, b) |
a + b |
| Subtraction | ops::subtract(a, b) |
a - b |
| Multiplication | ops::multiply(a, b) |
a * b |
| Division | ops::divide(a, b) |
a / b |
| Power | ops::power(a, b) |
— |
| Modulo | ops::modulo(a, b) |
a % b |
| Square root | ops::sqrt(a) |
— |
| Exponential | ops::exp(a) |
— |
| Logarithm | ops::log(a) |
— |
| Absolute value | ops::abs(a) |
— |
| Sign | ops::sign(a) |
— |
| Floor/Ceil | ops::floor(a), ops::ceil(a) |
— |
| Trigonometric | ops::sin, cos, tan |
— |
| Error function | ops::erf(a) |
— |
Comparison & Logical
| Operation | Function | Operator |
|---|---|---|
| Equal | ops::equal(a, b) |
a == b |
| Not equal | ops::not_equal(a, b) |
a != b |
| Less than | ops::less(a, b) |
a < b |
| Greater than | ops::greater(a, b) |
a > b |
| Logical AND | ops::logical_and(a, b) |
a && b |
| Logical OR | ops::logical_or(a, b) |
a || b |
| Logical NOT | ops::logical_not(a) |
!a |
| Bitwise ops | ops::bitwise_and/or/xor |
&, |, ^ |
Reductions
tensor.sum() // Total sum
tensor.sum({0, 2}) // Sum along axes
tensor.sum({0}, true) // Keep dimensions
tensor.mean(), tensor.max(), tensor.min()
tensor.argmax(axis), tensor.argmin(axis)
tensor.any(), tensor.all() // Boolean reductions
tensor.var(axis, ddof) // Variance (Bessel correction)
tensor.std(axis, ddof) // Standard deviation
tensor.prod(axis) // ProductShape Manipulation
// Reshape and views
tensor.reshape(new_shape) // View if contiguous, copy otherwise
tensor.view(new_shape) // View only (asserts contiguous)
tensor.flatten() // To 1D
tensor.squeeze() // Remove size-1 dims
tensor.unsqueeze(axis) // Add size-1 dim
// Transpose and permute
tensor.T() // Matrix transpose
tensor.transpose(axes) // Arbitrary permutation
tensor.swapaxes(a, b) // Swap two axes
tensor.moveaxis(src, dst) // Move axis
// Flip and rotate
tensor.flip(axis) // Reverse along axis
tensor.flipud(), tensor.fliplr()
tensor.rot90(k, axes) // Rotate 90° k times
tensor.roll(shift, axis) // Circular shift
// Join and split
Tensor::cat({a, b}, axis) // Concatenate
Tensor::stack({a, b}, axis) // Stack with new axis
tensor.split(n, axis) // Split into n parts
tensor.chunk(n, axis) // Chunk (may be unequal)Neural Network Operations
// Activations
tensor.relu()
tensor.leaky_relu(0.01f)
tensor.sigmoid()
tensor.tanh()
tensor.gelu()
tensor.silu() // Swish
// Softmax
tensor.softmax(axis)
tensor.log_softmax(axis)
// Normalization
ops::layer_norm(x, weight, bias, axis, eps)
ops::rms_norm(x, weight, axis, eps)
// Dropout (training mode)
auto [out, mask] = ops::dropout(x, 0.1f, training);Indexing & Selection
// Conditional selection
Tensor::where(cond, a, b) // a where true, b where false
tensor.where(cond, value) // Fluent API
// Masking
tensor.masked_fill(mask, val) // Fill where mask is true
tensor.masked_select(mask) // Extract elements
// Gather/Scatter (PyTorch-style)
tensor.gather(dim, indices)
tensor.scatter(dim, indices, src)
tensor.index_select(dim, indices)
// Diagonal operations
Tensor::diag(v, k) // Vector to diagonal matrix
tensor.diagonal(offset) // Extract diagonal
tensor.trace() // Sum of diagonal
Tensor::tril(m, k) // Lower triangular
Tensor::triu(m, k) // Upper triangularSee docs/ops.md for the complete API reference.
| Platform | Compiler | Build System | Optional |
|---|---|---|---|
| macOS 11+ | Xcode 13+ / Clang 13+ | CMake 3.20+ | Metal GPU |
| Linux (x86_64/ARM) | GCC 10+ / Clang 13+ | CMake 3.20+ | OpenBLAS, OpenMP |
| Windows | MSVC 2019+ | CMake 3.20+ | OpenBLAS |
| WebAssembly | Emscripten 3.0+ | CMake 3.20+ | — |
| RISC-V | GCC 10+ / Clang 13+ | CMake 3.20+ | — |
Axiom automatically detects and links available BLAS libraries:
- Apple Accelerate (macOS) — Preferred on Apple platforms
- OpenBLAS — High-performance open-source BLAS
- Native fallback — Always works, pure C++ implementation
For portable distributions, Axiom can dynamically link BLAS at runtime.
| Category | Types |
|---|---|
| Floating Point | Float16, Float32, Float64 |
| Signed Integer | Int8, Int16, Int32, Int64 |
| Unsigned Integer | UInt8, UInt16, UInt32, UInt64 |
| Boolean | Bool |
| Complex | Complex64, Complex128 |
Sketch in NumPy, deploy with Axiom.
Axiom is building toward a future where prototyping in Python and deploying in C++ requires zero mental overhead. The API parity is intentional—your NumPy code translates line-by-line.
- Lazy Evaluation (in development) — Expression graph compilation for automatic kernel fusion and memory optimization
- ONNX Runtime Integration — Load and run ONNX models directly
- Quantization Toolkit — INT8/INT4 quantization for edge deployment
- Custom Op Registration — Extend Axiom with your own kernels
- Full Portability — Single codebase targeting x86, ARM, RISC-V, WebAssembly, and embedded platforms with dynamically linked backends
# Clone
git clone https://github.com/frikallo/axiom.git
cd axiom
# Build (release mode)
make release
# Run tests
make test
# Install system-wide
sudo cmake --install build
# Optional: Build with OpenMP
cmake -B build -DCMAKE_BUILD_TYPE=Release -DAXIOM_USE_OPENMP=ON
cmake --build buildContributions are welcome! Please ensure:
- Code follows the project style (
make format) - All tests pass (
make test) - New features include tests
- Documentation is updated
See CONTRIBUTING.md for detailed guidelines.
Axiom is licensed under the MIT License. You are free to use, modify, and distribute Axiom in both open-source and proprietary projects.
See LICENSE for the full license text.
If Axiom is useful in your research, please cite:
@misc{axiom2025,
title={Axiom: High-Performance Tensor Library for C++},
author={Noah Kay},
year={2025},
url={https://github.com/frikallo/axiom}
}