Deterministic CUDA Kernels for Reproducible Deep Learning
BitExact is a research-driven CUDA library providing bit-exact deterministic GPU tensor operations. It ensures identical floating-point results across runs, batches, and devices, removing nondeterminism from key deep-learning computations.
The library is designed to be plug and play with PyTorch. This means it can serve as a drop-in replacement for selected PyTorch tensor operations while guaranteeing bit-level reproducibility.
BitExact is particularly suited for:
- Model reproducibility research - verifying training consistency across runs
- Numerical analysis and benchmarking - comparing model outputs with precision guarantees
- Deployment pipelines where deterministic inference is required for compliance or scientific validation
- Quick Start π
- API Reference π
- Design Reference βοΈ
- Performance Reference π¨
- Testing π§ͺ
- Project Structure ποΈ
- Contributing π‘
- Project Status β
- Acknowledgements π
import torch, bitexact
x = torch.randn(4, 4, device="cuda")
w = torch.ones(4, device="cuda")
y = bitexact.rms_norm(x, w)
print(y)| Category | Kernel Operation | Reference |
|---|---|---|
| Linear Algebra | Matrix Multiplication | MatMul |
| Normalization | RMS Normalization | RmsNorm |
| Normalization | Layer Normalization | RmsNorm |
| Reductions | Sum | Sum |
| Reductions | Mean | Mean |
| Reductions | Max | Max |
| Reductions | Min | Min |
| Activations | Sigmoid | Sigmoid |
More Determinsitic Kernels May Be Coming Soon
- Python
$\geq 3.9$ - CUDA
$\geq 12.0$ - PyTorch
$\geq2.1$ - A C++ Compiler (MSVC 2022 / gcc
$\geq 9$ )
git clone https://github.com/aaravkohli1/BitExact.git
cd BitExact
pip install . --no-build-isolationpip install bitexactWorking with CUDA is tricky, hopefully these tips help
- CUDA_HOME environment variable is not set: make sure CUDA is installed and CUDA_HOME is set before running pip install
- cannot find cl.exe: install Visual Studio Build Tools with C++ support (Use "Developer PowerShell for VS 2022" )
- Build takes a long time: this is normal - CUDA extensions compile during installation
| Operation | Throughput (vs PyTorch) | Notes |
|---|---|---|
| Matrix Multiplication | 0.47x | Slower than cuBLAS; PyTorchβs highly tuned GEMM outperforms deterministic reduction. |
| RMS Normalization | 5.09x | Fused mean, sqrt, and scaling operations reduce kernel launches and memory access. |
| Layer Normalization | 1.66x | Fused single-kernel variance reduces global memory passes and improves speed on small tensors. |
| Sum | 1.98x | Optimized shared-memory reduction with fixed traversal order for determinism. |
| Mean | 1.69x | Builds on the Sum kernel with deterministic normalization by element count. |
| Max | 1.75x | Deterministic warp-level reduction; avoids divergent branching used in PyTorch. |
| Min | 1.98x | Similar to Max; uses unified deterministic traversal for all elements. |
| Variance | 1.35x | Uses fused E[xΒ²] - (E[x])Β² formulation with deterministic accumulation. |
| Sigmoid | 0.92x | Identical arithmetic to PyTorch; near-equal performance and perfect bit equivalence. |
| Average | 1.88x | Tests performed on small-scale tensors; PyTorch is optimized for large batch sizes. |
(Benchmarked on NVIDIA GeForce RTX 4060 Ti, PyTorch 2.6.0, CUDA 12.5)
BitExactβs performance advantage comes primarily from kernel fusion and deterministic reduction order, which minimize synchronization and memory traffic. However, PyTorchβs fused kernels outperform in large-batch GEMM and high-throughput workloads. These results emphasize that BitExact prioritizes determinism and reproducibility over raw FLOPS.
To see how BitExact benchmarks on your machine, run:
python benchmarks/benchmark.pyExample output
BitExact vs PyTorch - Benchmark Suite
Operation Torch (ms) BitExact (ms) Speed Max Diff Match
-------------------------------------------------------------------------
MatMul 0.0336 0.0692 0.48x 1.07e-04 True
Sum 0.0086 0.0117 0.73x 1.14e-05 True
Mean 0.0083 0.0079 1.05x 1.12e-08 True
Max 0.0087 0.0117 0.74x 0.00e+00 True
Min 0.0097 0.0080 1.21x 0.00e+00 True
Sigmoid 0.0074 0.0073 1.01x 0.00e+00 True
RMSNorm 0.0430 0.0084 5.12x 1.91e-06 True
LayerNorm 0.0881 0.0547 1.61x 1.91e-06 True
Variance 0.0311 0.0266 1.17x 2.38e-07 True
Note: Matches use atol=1e-4, rtol=1e-6 tolerance (within FP32 rounding).
-------------------------------------------------------------------------
Summary
-------------------------------------------------------------------------
Operations faster than PyTorch: 6/9
All operations deterministic: True
Average speedup: 1.46x
=========================================================================All measurements use CUDA events for precise GPU timing with 10 warmup and 100 timed iterations. Run-to-run variance of 5-15% is typical due to GPU boost clocks, thermal state, and driver scheduling. Focus on relative speedup trends rather than absolute millisecond values.
BitExact includes deterministic equality tests for all kernels.
To run the test suite, ensure you have PyTest installed. To install PyTest, run:
pip install -U pytestThen you can run the test suite with:
pytest tests/Recommended Flags
-v- Verbose flag (shows results of each individual test)-s- Donβt capture output (allows setup logs from conftest.py)
Example:
pytest tests/ -v -sBecause many tests utilize randomized tensors, running the suite multiple times can help verify reproducibility and numerical stability. You can run the tests any number of times, the examples below simply use 3 as a placeholder.
Linux
for i in {1..3}; do pytest -v; doneWindows
for ($i = 1; $i -le 3; $i++) { pytest -v }Troubleshooting
- CUDA OOM: close other GPU workloads, then re-run. Cache is auto-cleared; if needed, re-run with
-sto confirm setup logs. - No GPU: tests require a CUDA-capable device; CPU fallbacks are not provided.
All tests verify bit-exact equivalence to PyTorchβs reference implementations and ensure reproducibility across multiple runs and devices.
The examples/deterministic_inference.py script demonstrates a small neural network using BitExact kernels (matmul, rms_norm, and sigmoid). Running the example verifies that the networkβs outputs are bit-for-bit identical across multiple runs, confirming complete GPU determinism.
Run the file with:
python examples/deterministic_inference.py
bitexact/
βββ bitexact/ # Python bindings and high-level API
β βββ __init__.py
β
βββ benchmarks/ # Benchmarking suite for performance comparison
β βββ benchmark.py
β βββ utils.py
β
βββ docs/ # Technical documentation
β βββ api.md
β βββ design.md
β
βββ examples/ # Minimal runnable examples
β βββ basic_usage.py # Simple demonstration of deterministic ops
β βββ deterministic_inference.py # Reproducible model inference pipeline
β
βββ src/ # Core CUDA/C++ source
β βββ bindings.cpp # PyTorch extension bindings (exposes kernels to Python)
β β
β βββ ops/ # Kernel implementations
β βββ matmul/ # Matrix multiplication kernels
β β βββ matmul.cu
β β βββ matmul.cuh
β β
β βββ reductions/ # Deterministic reduction kernels
β β βββ sum.cu
β β βββ sum.cuh
β β βββ mean.cu
β β βββ mean.cuh
β β βββ max.cu
β β βββ max.cuh
β β βββ min.cu
β β βββ min.cuh
β β βββ var.cu
β β βββ var.cuh
β β
β βββ normalization/ # Normalization kernels
β β βββ rms_norm.cu
β β βββ rms_norm.cuh
β β βββ layer_norm.cu
β β βββ layer_norm.cuh
β β
β βββ activations/ # Activation kernels
β β βββ sigmoid.cu
β β βββ sigmoid.cuh
β β
β βββ utils/ # Shared CUDA utilities
β βββ cuda_utils.cuh # Common device helpers (grid-stride loops, etc.)
β βββ dtype_utils.cuh # Type casting and precision utilities
β βββ reduction.cuh # Shared reduction patterns for deterministic ops
β
βββ tests/ # Pytest suite
β βββ conftest.py
β βββ test_determinism.py
β
βββ LICENSE # License file
βββ README.md # Project overview and documentation
βββ setup.py # Build and installation script
Contributions are welcome! If you have an idea for a Kernel, feel free to implement it (the largest missing one is attention).
Please ensure new kernels:
- Pass Deterministic equality tests (see testing suite).
- Use Warp-synchronous, non-atomic reduction patterns.
- Includes both .cu and .cuh files and a corresponding test.
This project was an experiment that followed a research article. I found it to be an interesting problem, so I spent a portion of my reading week making this library. I do find the problem of determinism to be really interesting so I will keep developing this library, but on no fixed schedule.
There are many ways the library could be expanded, outlined in the design document. If you are interested, feel free to make a contribution.
This project draws inspiration from research by Thinking Machines Lab on deterministic GPU computation and reproducible deep learning. Their exploration of bit-exact kernels and floating-point determinism informed the design philosophy of BitExact.