BitExact

Deterministic CUDA Kernels for Reproducible Deep Learning

BitExact is a research-driven CUDA library providing bit-exact deterministic GPU tensor operations. It ensures identical floating-point results across runs, batches, and devices, removing nondeterminism from key deep-learning computations.

The library is designed to be plug and play with PyTorch. This means it can serve as a drop-in replacement for selected PyTorch tensor operations while guaranteeing bit-level reproducibility.

BitExact is particularly suited for:

Model reproducibility research - verifying training consistency across runs
Numerical analysis and benchmarking - comparing model outputs with precision guarantees
Deployment pipelines where deterministic inference is required for compliance or scientific validation

Quick Start Example

import torch, bitexact

x = torch.randn(4, 4, device="cuda")
w = torch.ones(4, device="cuda")

y = bitexact.rms_norm(x, w)
print(y)

Current Features

Category	Kernel Operation	Reference
Linear Algebra	Matrix Multiplication	MatMul
Normalization	RMS Normalization	RmsNorm
Normalization	Layer Normalization	RmsNorm
Reductions	Sum	Sum
Reductions	Mean	Mean
Reductions	Max	Max
Reductions	Min	Min
Activations	Sigmoid	Sigmoid

More Determinsitic Kernels May Be Coming Soon

Installation

Prerequisites

Python $\geq 3.9$
CUDA $\geq 12.0$
PyTorch $\geq2.1$
A C++ Compiler (MSVC 2022 / gcc $\geq 9$)

From Source

git clone https://github.com/aaravkohli1/BitExact.git
cd BitExact
pip install . --no-build-isolation

PyPI

pip install bitexact

Troubleshooting

Working with CUDA is tricky, hopefully these tips help

CUDA_HOME environment variable is not set: make sure CUDA is installed and CUDA_HOME is set before running pip install
cannot find cl.exe: install Visual Studio Build Tools with C++ support (Use "Developer PowerShell for VS 2022" )
Build takes a long time: this is normal - CUDA extensions compile during installation

Performance at a Glance

Operation	Throughput (vs PyTorch)	Notes
Matrix Multiplication	0.47x	Slower than cuBLAS; PyTorch’s highly tuned GEMM outperforms deterministic reduction.
RMS Normalization	5.09x	Fused mean, sqrt, and scaling operations reduce kernel launches and memory access.
Layer Normalization	1.66x	Fused single-kernel variance reduces global memory passes and improves speed on small tensors.
Sum	1.98x	Optimized shared-memory reduction with fixed traversal order for determinism.
Mean	1.69x	Builds on the Sum kernel with deterministic normalization by element count.
Max	1.75x	Deterministic warp-level reduction; avoids divergent branching used in PyTorch.
Min	1.98x	Similar to Max; uses unified deterministic traversal for all elements.
Variance	1.35x	Uses fused E[x²] - (E[x])² formulation with deterministic accumulation.
Sigmoid	0.92x	Identical arithmetic to PyTorch; near-equal performance and perfect bit equivalence.
Average	1.88x	Tests performed on small-scale tensors; PyTorch is optimized for large batch sizes.

(Benchmarked on NVIDIA GeForce RTX 4060 Ti, PyTorch 2.6.0, CUDA 12.5)

Interpretation of Results

BitExact’s performance advantage comes primarily from kernel fusion and deterministic reduction order, which minimize synchronization and memory traffic. However, PyTorch’s fused kernels outperform in large-batch GEMM and high-throughput workloads. These results emphasize that BitExact prioritizes determinism and reproducibility over raw FLOPS.

Local Benchmarks

To see how BitExact benchmarks on your machine, run:

python benchmarks/benchmark.py

Example output

BitExact vs PyTorch - Benchmark Suite

Operation      Torch (ms)  BitExact (ms)      Speed     Max Diff    Match
-------------------------------------------------------------------------
MatMul             0.0336         0.0692      0.48x     1.07e-04     True
Sum                0.0086         0.0117      0.73x     1.14e-05     True
Mean               0.0083         0.0079      1.05x     1.12e-08     True
Max                0.0087         0.0117      0.74x     0.00e+00     True
Min                0.0097         0.0080      1.21x     0.00e+00     True
Sigmoid            0.0074         0.0073      1.01x     0.00e+00     True
RMSNorm            0.0430         0.0084      5.12x     1.91e-06     True
LayerNorm          0.0881         0.0547      1.61x     1.91e-06     True
Variance           0.0311         0.0266      1.17x     2.38e-07     True

Note: Matches use atol=1e-4, rtol=1e-6 tolerance (within FP32 rounding).
-------------------------------------------------------------------------
Summary
-------------------------------------------------------------------------
Operations faster than PyTorch: 6/9
All operations deterministic: True
Average speedup: 1.46x
=========================================================================

All measurements use CUDA events for precise GPU timing with 10 warmup and 100 timed iterations. Run-to-run variance of 5-15% is typical due to GPU boost clocks, thermal state, and driver scheduling. Focus on relative speedup trends rather than absolute millisecond values.

Testing

BitExact includes deterministic equality tests for all kernels.

To run the test suite, ensure you have PyTest installed. To install PyTest, run:

pip install -U pytest

Then you can run the test suite with:

pytest tests/

Recommended Flags

-v - Verbose flag (shows results of each individual test)
-s - Don’t capture output (allows setup logs from conftest.py)

Example:

pytest tests/ -v -s

Because many tests utilize randomized tensors, running the suite multiple times can help verify reproducibility and numerical stability. You can run the tests any number of times, the examples below simply use 3 as a placeholder.

Linux

for i in {1..3}; do pytest -v; done

Windows

for ($i = 1; $i -le 3; $i++) { pytest -v }

Troubleshooting

CUDA OOM: close other GPU workloads, then re-run. Cache is auto-cleared; if needed, re-run with -s to confirm setup logs.
No GPU: tests require a CUDA-capable device; CPU fallbacks are not provided.

All tests verify bit-exact equivalence to PyTorch’s reference implementations and ensure reproducibility across multiple runs and devices.

Deterministic Inference

The examples/deterministic_inference.py script demonstrates a small neural network using BitExact kernels (matmul, rms_norm, and sigmoid). Running the example verifies that the network’s outputs are bit-for-bit identical across multiple runs, confirming complete GPU determinism.

Run the file with:

python examples/deterministic_inference.py

Project Structure


bitexact/
├── bitexact/                         # Python bindings and high-level API
│   └── __init__.py
│
├── benchmarks/                       # Benchmarking suite for performance comparison
│   ├── benchmark.py
│   └── utils.py
│
├── docs/                             # Technical documentation
│   ├── api.md
│   └── design.md
│
├── examples/                         # Minimal runnable examples
│   ├── basic_usage.py                # Simple demonstration of deterministic ops
│   └── deterministic_inference.py    # Reproducible model inference pipeline
│
├── src/                              # Core CUDA/C++ source
│   ├── bindings.cpp                  # PyTorch extension bindings (exposes kernels to Python)
│   │
│   └── ops/                          # Kernel implementations
│       ├── matmul/                   # Matrix multiplication kernels
│       │   ├── matmul.cu
│       │   └── matmul.cuh
│       │
│       ├── reductions/               # Deterministic reduction kernels
│       │   ├── sum.cu
│       │   ├── sum.cuh
│       │   ├── mean.cu
│       │   ├── mean.cuh
│       │   ├── max.cu
│       │   ├── max.cuh
│       │   ├── min.cu
│       │   ├── min.cuh
│       │   ├── var.cu
│       │   └── var.cuh
│       │
│       ├── normalization/            # Normalization kernels
│       │   ├── rms_norm.cu
│       │   ├── rms_norm.cuh
│       │   ├── layer_norm.cu
│       │   └── layer_norm.cuh
│       │
│       ├── activations/              # Activation kernels
│       │   ├── sigmoid.cu
│       │   └── sigmoid.cuh
│       │
│       └── utils/                    # Shared CUDA utilities
│           ├── cuda_utils.cuh        # Common device helpers (grid-stride loops, etc.)
│           ├── dtype_utils.cuh       # Type casting and precision utilities
│           └── reduction.cuh         # Shared reduction patterns for deterministic ops
│
├── tests/                            # Pytest suite
│   ├── conftest.py
│   └── test_determinism.py
│
├── LICENSE                           # License file
├── README.md                         # Project overview and documentation
└── setup.py                          # Build and installation script

Contributions

Contributions are welcome! If you have an idea for a Kernel, feel free to implement it (the largest missing one is attention).

Please ensure new kernels:

Pass Deterministic equality tests (see testing suite).
Use Warp-synchronous, non-atomic reduction patterns.
Includes both .cu and .cuh files and a corresponding test.

Project Status

This project was an experiment that followed a research article. I found it to be an interesting problem, so I spent a portion of my reading week making this library. I do find the problem of determinism to be really interesting so I will keep developing this library, but on no fixed schedule.

There are many ways the library could be expanded, outlined in the design document. If you are interested, feel free to make a contribution.

Acknowledgements

This project draws inspiration from research by Thinking Machines Lab on deterministic GPU computation and reproducible deep learning. Their exploration of bit-exact kernels and floating-point determinism informed the design philosophy of BitExact.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BitExact

Quick Links

Quick Start Example

Current Features

Installation

Prerequisites

From Source

PyPI

Troubleshooting

Performance at a Glance

Interpretation of Results

Local Benchmarks

Testing

Deterministic Inference

Project Structure

Contributions

Project Status

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
benchmarks		benchmarks
bitexact		bitexact
docs		docs
examples		examples
include/batchinv		include/batchinv
src		src
tests		tests
.gitignore		.gitignore
LICENSE.md		LICENSE.md
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

BitExact

Quick Links

Quick Start Example

Current Features

Installation

Prerequisites

From Source

PyPI

Troubleshooting

Performance at a Glance

Interpretation of Results

Local Benchmarks

Testing

Deterministic Inference

Project Structure

Contributions

Project Status

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages