Skip to content

ashvardanian/NumWars

Repository files navigation

NumWars

Mixed-Precision Numerics Benchmarks for Rust & Python

NumWars banner

There are many strong libraries for numerical computing. Most of them are written in C, C++, and Fortran, with excellent Rust wrappers and Python bindings on top.

Where Rust is especially convenient is dependency management and reproducible benchmarking, making it a good place to line up apples-to-apples comparisons across native crates and their Python bindings. NumWars exists for the same reason StringWars exists for StringZilla: to compare NumKong against mainstream CPU stacks on the workloads it was built for, including:

Of course, the APIs and internal kernels of those projects are different. So this repository focuses on the workload families NumKong was designed for and compares their effective throughput using the native unit for each operation family instead of forcing everything into fake global ops/s.

Important

The numbers below are reference measurements collected on Intel Sapphire Rapids CPU in single-threaded mode. They will vary with CPU model, compiler flags, BLAS backend, and problem size. Rebuild and rerun on your own hardware before treating them as absolute.

Benchmarks at a Glance

Packed Matrix Multiplication

NumKong packed dots are mixed-precision by design. i8 inputs produce i32 outputs. bf16 and f16 inputs produce f32 outputs. f32 inputs produce f64 outputs. The mainstream baselines shown here keep f32 → f32. Compared to Rust projects, it means:

NumKong:
numkong::Tensor::dots_packed i8 → i32    ██████████████████████████████████ 1,357.36 GSO/s
numkong::Tensor::dots_packed bf16 → f32  █████████████████▎                   684.96 GSO/s
numkong::Tensor::dots_packed f16 → f32   ██▊                                  106.63 GSO/s
numkong::Tensor::dots_packed f32 → f64   █                                     42.04 GSO/s

Alternatives:
faer::linalg::matmul::matmul f32 → f32   ██▏                                   81.21 GSO/s
matrixmultiply::sgemm f32 → f32          █▉                                    78.61 GSO/s
ndarray::ArrayBase::dot f32 → f32        █▉                                    78.55 GSO/s
nalgebra::DMatrix × DMatrixᵀ f32 → f32   █▉                                    74.21 GSO/s

Compared to Python:

NumKong:
numkong.dots_packed i8 → i32    ███████████████████████████████████████████ 1,110.31 GSO/s
numkong.dots_packed bf16 → f32  ██████████████████▊                           487.89 GSO/s
numkong.dots_packed f16 → f32   ███▌                                           91.80 GSO/s
numkong.dots_packed f32 → f64   █▋                                             42.69 GSO/s

Alternatives:
numpy.matmul f32 → f32          █████▋                                        145.73 GSO/s

See dots/README.md for details.

Pairwise Similarity

Single-pair vector kernels at 2048 dimensions. This lists Dot products and true Euclidean distances measurements into one throughput-sorted view. NumKong keeps its mixed-precision promotions, while the baseline libraries mostly stay in their input type.

Compared to Rust projects, it means:

NumKong:
numkong::Dot::dot u8 → u32                ████████████████████████████████████ 54.28 GSO/s
numkong::Dot::dot i8 → i32                ████████████████████████████▋        43.18 GSO/s
numkong::Euclidean::euclidean u8 → f32    ███████████████████████████          40.83 GSO/s
numkong::Euclidean::euclidean i8 → f32    ██████████████████████▊              34.10 GSO/s
numkong::Dot::dot bf16 → f32              █████████████▎                       20.09 GSO/s
numkong::Euclidean::euclidean bf16 → f32  ████████▍                            12.65 GSO/s
numkong::Dot::dot f32 → f64               ████                                  6.12 GSO/s
numkong::Euclidean::euclidean f32 → f64   ███▋                                  5.53 GSO/s

Alternatives:
ndarray::ArrayBase::dot f32 → f32         █████▏                                7.75 GSO/s
nalgebra::Matrix::dot f32 → f32           █████                                 7.56 GSO/s
ndarray sqrt((a - b)·(a - b)) f32 → f32   ███▏                                  4.75 GSO/s
nalgebra (a - b).norm() f32 → f32         ███▏                                  4.63 GSO/s

Compared to Python:

NumKong:
numkong.euclidean u8 → f32                  ███████████████████████████████████ 5.65 GSO/s
numkong.euclidean i8 → f32                  ███████████████████████████████▌    5.08 GSO/s
numkong.dot u8 → u32                        ██████████████████████████████▎     4.88 GSO/s
numkong.euclidean f32 → f64                 ████████████████████▌               3.33 GSO/s
numkong.dot i8 → i32                        ████████████████████▏               3.25 GSO/s
numkong.dot f32 → f64                       █████████████████                   2.76 GSO/s
numkong.euclidean bf16 → f32                ██▋                                 0.41 GSO/s
numkong.dot bf16 → f32                      ██▍                                 0.37 GSO/s

Alternatives:
scipy.linalg.blas.sdot f32 → f32            ███████████████████▌                3.14 GSO/s
scipy.spatial.distance.euclidean u8 → f32   ███                                 0.48 GSO/s
scipy.spatial.distance.euclidean i8 → f32   ██▍                                 0.38 GSO/s
scipy.spatial.distance.euclidean f32 → f32  ██▍                                 0.38 GSO/s

See similarity/README.md for details.

All-Pairs Similarity Matrices

Matrix-vs-matrix comparisons at 2048 rows by 2048 dimensions. These are the packed many-to-many siblings of the pairwise spatial kernels above. The merged lists below include angular and euclidean metrics, and the headline unit is GSO/s.

Compared to Rust projects, it means:

NumKong:
numkong::Tensor::angulars_packed u8 → f32      ██████████████████████████████ 694.88 GSO/s
numkong::Tensor::angulars_packed i8 → f32      █████████████████████████████▋ 686.93 GSO/s
numkong::Tensor::euclideans_packed i8 → f32    █████████████████████████████▋ 685.67 GSO/s
numkong::Tensor::euclideans_packed u8 → f32    █████████████████████████████  672.37 GSO/s
numkong::Tensor::angulars_packed bf16 → f32    █████████████▏                 304.59 GSO/s
numkong::Tensor::euclideans_packed bf16 → f32  █████████████                  302.61 GSO/s
numkong::Tensor::euclideans_packed f32 → f64   █                               21.22 GSO/s
numkong::Tensor::angulars_packed f32 → f64     █                               20.64 GSO/s

Alternatives:
ndarray angular matrix f32 → f32               █▊                              38.20 GSO/s
nalgebra euclidean matrix f32 → f32            █▊                              37.91 GSO/s
ndarray euclidean matrix f32 → f32             █▊                              37.59 GSO/s
nalgebra angular matrix f32 → f32              █▊                              36.97 GSO/s

Compared to Python through SciPy cdist:

NumKong:
numkong.angulars_packed u8 → f32      ███████████████████████████████████████ 465.04 GSO/s
numkong.euclideans_packed u8 → f32    ██████████████████████████████████████▊ 463.47 GSO/s
numkong.euclideans_packed i8 → f32    ██████████████████████████████████████▊ 463.37 GSO/s
numkong.angulars_packed i8 → f32      ██████████████████████████████████████  454.74 GSO/s
numkong.angulars_packed bf16 → f32    ███████████████████                     226.56 GSO/s
numkong.euclideans_packed bf16 → f32  █████████████████▌                      210.12 GSO/s
numkong.euclideans_packed f32 → f64   █▊                                       20.24 GSO/s
numkong.angulars_packed f32 → f64     █▌                                       19.84 GSO/s

Alternatives:
scipy.cdist euclidean f32 → f64       ▎                                         2.83 GSO/s
scipy.cdist cosine f32 → f64          ▎                                         2.62 GSO/s

See similarities/README.md for details.

Elementwise Operations

Bandwidth-sensitive elementwise kernels — add and scale — over 1,000,000 elements. Sum shown as representative sample. In Rust:

NumKong:
numkong::EachSum i8 → i8      █████████████████████████████████████████████     23.23 GB/s
numkong::EachSum f16 → f16    ██████████████████████████████████████▌           19.93 GB/s
numkong::EachSum bf16 → bf16  ████████████████████████████████████▉             19.08 GB/s
numkong::EachSum f32 → f32    ████████████████████████████████████▎             18.73 GB/s

Alternatives:
serial code f32 → f32         ██████████████████████████████████████▍           19.86 GB/s
nalgebra::add f32 → f32       ██████████████████████████████████████▍           19.82 GB/s
ndarray::add f32 → f32        ██████████████████████████████████████▎           19.79 GB/s

In Python:

NumKong:
numkong.add i8 → i8      █████████████████████████████████████████▋             30.91 GB/s
numkong.add f32 → f32    ███████████████████████████████████████▋               29.39 GB/s
numkong.add f16 → f16    ██████████████████████████████████████▉                28.84 GB/s
numkong.add f64 → f64    ██████████████████████████████████████▉                28.79 GB/s
numkong.add bf16 → bf16  █████████████████████████████████████▍                 27.72 GB/s

Alternatives:
numpy.add i8 → i8        █████████████████████████████████████████████          33.32 GB/s
numpy.add f32 → f32      ██████████████████████████████████▋                    25.65 GB/s
numpy.add f64 → f64      █████████████████████████████████▊                     25.03 GB/s
numpy.add f16 → f16      █▎                                                      0.95 GB/s

See each/README.md for details.

Reductions

Horizontal reductions over 1,000,000 elements. The suite covers sum and row-wise L2 norms. In Rust:

ndarray::ArrayBase::sum f32 → f32        █████████████████████████████████████████████ 32.50 GB/s
polars::ChunkedArray::sum f32 → f32      ███████████████████████████████████████████▎ 31.26 GB/s
numkong::reduce_moments().sum f32 → f64  █████████████████████████████████████████▋ 30.09 GB/s
serial sum loop f32 → f32                ████████▊                               6.38 GB/s

Row-wise L2 norms over a 2048×2048 matrix:

ndarray row norms f64 → f64              █████████████████████████████████████████████ 27.11 GB/s
numkong::Dot self-dot + sqrt bf16 → f32  ████████████████████████████████████████▌ 24.46 GB/s
ndarray row norms f32 → f32              ███████████████████████████████████▉   21.63 GB/s
numkong::Dot self-dot + sqrt f32         █████████████████████████████████▏     20.01 GB/s
serial row norms loop f32 → f32          ██████████▊                             6.54 GB/s

In Python over 1,000,000 elements:

NumKong:
numkong.sum i8 → i8          █████████████████████████████████████████████      32.02 GB/s
numkong.sum f32 → f32        ████████████████████████████████████████▉          29.17 GB/s
numkong.norm f32 → f64       ███████████████████████████████▎                   22.32 GB/s
numkong.sum f64 → f64        █████████████████████████████                      20.68 GB/s
numkong.norm bf16 → f64      █████████████████████████                          17.82 GB/s

Alternatives:
numpy.sum f64 → f64          █████████████████████████████████▉                 24.16 GB/s
numpy.sum f32 → f32          ██████████████████████████▊                        19.06 GB/s
numpy.linalg.norm f64 → f64  ███████████▌                                        8.21 GB/s
numpy.linalg.norm f32 → f64  ██████████▌                                         7.48 GB/s
numpy.sum i8 → i8            ███▊                                                2.68 GB/s

See reduce/README.md for details.

MaxSim

ColBERT-style late interaction with 2048 query vectors, 2048 document vectors, and 2048 dimensions. NumKong promotes f32 → f64 here as well, while ndarray stays in f32. In Rust:

NumKong:
numkong::MaxSimPackedMatrix::score f16 → f32   ██████████████████████████████ 423.69 GSO/s
numkong::MaxSimPackedMatrix::score f32 → f64   █████████████████████████████▌ 415.47 GSO/s
numkong::MaxSimPackedMatrix::score bf16 → f32  ███████████████▊               224.48 GSO/s

Alternatives:
ndarray Q @ Dᵀ max-reduce f32 → f32            ██▋                             38.36 GSO/s

Compared to Python:

NumKong:
numkong.maxsim_packed f16 → f32   ███████████████████████████████████████████ 833.26 GSO/s
numkong.maxsim_packed f32 → f64   ████████████████████████████████████████▏   776.43 GSO/s
numkong.maxsim_packed bf16 → f32  ██████████████████████▏                     428.56 GSO/s

Alternatives:
numpy matmul f32 → f32            ██████▋                                     129.03 GSO/s

See maxsim/README.md for details.

Geospatial Distances

Throughput over 2048 coordinate pairs. The unit is MP/s, or million coordinate pairs per second. The merged lists below include both Haversine and Vincenty distances.

Compared to Rust projects, it means:

NumKong:
numkong::haversine f32 → f32      ████████████████████████████████████████████ 486.92 MP/s
numkong::haversine f64 → f64      █████████████▊                               151.65 MP/s
numkong::vincenty f32 → f32       ██████▍                                       68.96 MP/s
numkong::vincenty f64 → f64       █▋                                            17.79 MP/s

Alternatives:
geo::Haversine distance f32 → f32 ███▋                                          38.88 MP/s
geo::Haversine distance f64 → f64 ██▎                                           24.07 MP/s
geo::Vincenty distance f64 → f64  ▏                                              1.15 MP/s

Compared to Python and its alternatives:

NumKong:
numkong.haversine f32 → f32           ████████████████████████████████████████ 475.41 MP/s
numkong.haversine f64 → f64           █████████████                            154.92 MP/s
numkong.vincenty f32 → f32            ████▊                                     54.99 MP/s
numkong.vincenty f64 → f64            █▌                                        17.87 MP/s

Alternatives:
geopy.distance.great_circle f64 → f64                                            0.18 MP/s
geopy.distance.geodesic f64 → f64                                              0.0096 MP/s

See geospatial/README.md for details.

Mesh Alignment

Throughput over point clouds with 2048 3D points each. The unit is MP/s, or million 3D points per second. The labels include the full return signature so RMSD and Kabsch can share one sorted list cleanly. In Rust:

NumKong:
numkong::MeshAlignment::rmsd f64 → f64      ██████████████████████████████████ 971.35 MP/s
numkong::MeshAlignment::rmsd f32 → f32      ████████████████████▊              592.73 MP/s
numkong::MeshAlignment::rmsd f16 → f16      ████████████████████▎              578.61 MP/s
numkong::MeshAlignment::rmsd bf16 → bf16    ███████████████████▉               567.69 MP/s
numkong::MeshAlignment::kabsch f32 → f32    ██████████████▏                    404.69 MP/s
numkong::MeshAlignment::umeyama f32 → f32   ███████████▊                       335.03 MP/s
numkong::MeshAlignment::kabsch bf16 → bf16  █████████▌                         272.09 MP/s
numkong::MeshAlignment::umeyama bf16 → bf16 █████████▍                         268.63 MP/s
numkong::MeshAlignment::kabsch f16 → f16    █████████▎                         264.46 MP/s
numkong::MeshAlignment::umeyama f16 → f16   █████████▎                         264.89 MP/s
numkong::MeshAlignment::kabsch f64 → f64    ████████▋                          245.90 MP/s
numkong::MeshAlignment::umeyama f64 → f64   █████▏                             147.75 MP/s

Alternatives:
nalgebra-based RMSD f32 → f32               ██████████████████▊                537.04 MP/s
nalgebra-based Kabsch f32 → f64             ████▎                              121.63 MP/s
nalgebra-based Umeyama f32 → f64            ███▊                               106.47 MP/s

Compared to Python and its alternatives:

NumKong:
numkong.rmsd f64 → f64                       █████████████████████████████████ 825.51 MP/s
numkong.rmsd f32 → f64                       ██████████████████▋               467.35 MP/s
numkong.kabsch f32 → f64                     █████████▉                        248.48 MP/s
numkong.umeyama f32 → f64                    █████████▉                        248.10 MP/s
numkong.kabsch f64 → f64                     █████████▌                        238.79 MP/s
numkong.umeyama f64 → f64                    ██████▍                           159.25 MP/s

Alternatives:
numpy-based RMSD f32 → f64                   ██▏                                50.49 MP/s
numpy-based RMSD f64 → f64                   █▉                                 46.74 MP/s
biopython SVDSuperimposer (Kabsch) f32 → f64                                     1.22 MP/s
biopython SVDSuperimposer (Kabsch) f64 → f64                                     1.19 MP/s

See mesh/README.md for details.

Replicating the Results

Rust

Every Rust benchmark is a Criterion harness behind a Cargo feature gate. Run one suite at a time or all at once:

# One suite — default 2048-element workload
RUSTFLAGS="-C target-cpu=native" \
cargo bench --features bench_similarity --bench bench_similarity

# All suites
RUSTFLAGS="-C target-cpu=native" \
cargo bench --features all

Tuning knobs (environment variables):

Variable Default Purpose
NUMWARS_DIMS 2048 Vector / matrix dimension shared by most suites
NUMWARS_DIMS_HEIGHT 2048 Row count for GEMM workloads (dots, maxsim)
NUMWARS_DIMS_WIDTH 2048 Column count for GEMM workloads (dots, maxsim)
NUMWARS_DIMS_DEPTH 2048 Shared (contraction) dimension for GEMM workloads
NUMWARS_FILTER (none) Regex to select benchmarks by name
NUMWARS_WARMUP_SECONDS 3.0 Criterion warm-up time
NUMWARS_PROFILE_SECONDS 10.0 Criterion measurement time
NUMWARS_SAMPLE_SIZE 50 Criterion sample count

Python

Install with uv and run any suite directly:

uv run --with "numkong,numpy,scipy,tabulate,ml_dtypes" \
python similarity/bench.py

Or install all extras and run from the repo root:

pip install -e ".[similarity,each,dots,geospatial,mesh,reduce,similarities]"
python dots/bench.py
python similarities/bench.py

Benchmark Suites

Related Projects

License

Apache 2.0. See LICENSE.

About

Mixed-precision numerics benchmarks in Rust and Python - covering GEMMs, SYRKs, DOTs, and higher-level BLAS and LAPACK-style functionality

Topics

Resources

License

Stars

Watchers

Forks

Contributors