Skip to content

aaravkohli1/BitExact

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

86 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

BitExact

Deterministic CUDA Kernels for Reproducible Deep Learning

PyPI License: MIT CUDA PyTorch

BitExact is a research-driven CUDA library providing bit-exact deterministic GPU tensor operations. It ensures identical floating-point results across runs, batches, and devices, removing nondeterminism from key deep-learning computations.

The library is designed to be plug and play with PyTorch. This means it can serve as a drop-in replacement for selected PyTorch tensor operations while guaranteeing bit-level reproducibility.

BitExact is particularly suited for:

  • Model reproducibility research - verifying training consistency across runs
  • Numerical analysis and benchmarking - comparing model outputs with precision guarantees
  • Deployment pipelines where deterministic inference is required for compliance or scientific validation

Quick Links

Quick Start Example

import torch, bitexact

x = torch.randn(4, 4, device="cuda")
w = torch.ones(4, device="cuda")

y = bitexact.rms_norm(x, w)
print(y)

Current Features

Category Kernel Operation Reference
Linear Algebra Matrix Multiplication MatMul
Normalization RMS Normalization RmsNorm
Normalization Layer Normalization RmsNorm
Reductions Sum Sum
Reductions Mean Mean
Reductions Max Max
Reductions Min Min
Activations Sigmoid Sigmoid

More Determinsitic Kernels May Be Coming Soon

Installation

Prerequisites

  • Python $\geq 3.9$
  • CUDA $\geq 12.0$
  • PyTorch $\geq2.1$
  • A C++ Compiler (MSVC 2022 / gcc $\geq 9$)

From Source

git clone https://github.com/aaravkohli1/BitExact.git
cd BitExact
pip install . --no-build-isolation

PyPI

pip install bitexact

Troubleshooting

Working with CUDA is tricky, hopefully these tips help

  • CUDA_HOME environment variable is not set: make sure CUDA is installed and CUDA_HOME is set before running pip install
  • cannot find cl.exe: install Visual Studio Build Tools with C++ support (Use "Developer PowerShell for VS 2022" )
  • Build takes a long time: this is normal - CUDA extensions compile during installation

Performance at a Glance

Operation Throughput (vs PyTorch) Notes
Matrix Multiplication 0.47x Slower than cuBLAS; PyTorch’s highly tuned GEMM outperforms deterministic reduction.
RMS Normalization 5.09x Fused mean, sqrt, and scaling operations reduce kernel launches and memory access.
Layer Normalization 1.66x Fused single-kernel variance reduces global memory passes and improves speed on small tensors.
Sum 1.98x Optimized shared-memory reduction with fixed traversal order for determinism.
Mean 1.69x Builds on the Sum kernel with deterministic normalization by element count.
Max 1.75x Deterministic warp-level reduction; avoids divergent branching used in PyTorch.
Min 1.98x Similar to Max; uses unified deterministic traversal for all elements.
Variance 1.35x Uses fused E[xΒ²] - (E[x])Β² formulation with deterministic accumulation.
Sigmoid 0.92x Identical arithmetic to PyTorch; near-equal performance and perfect bit equivalence.
Average 1.88x Tests performed on small-scale tensors; PyTorch is optimized for large batch sizes.

(Benchmarked on NVIDIA GeForce RTX 4060 Ti, PyTorch 2.6.0, CUDA 12.5)

Interpretation of Results

BitExact’s performance advantage comes primarily from kernel fusion and deterministic reduction order, which minimize synchronization and memory traffic. However, PyTorch’s fused kernels outperform in large-batch GEMM and high-throughput workloads. These results emphasize that BitExact prioritizes determinism and reproducibility over raw FLOPS.

Local Benchmarks

To see how BitExact benchmarks on your machine, run:

python benchmarks/benchmark.py

Example output

BitExact vs PyTorch - Benchmark Suite

Operation      Torch (ms)  BitExact (ms)      Speed     Max Diff    Match
-------------------------------------------------------------------------
MatMul             0.0336         0.0692      0.48x     1.07e-04     True
Sum                0.0086         0.0117      0.73x     1.14e-05     True
Mean               0.0083         0.0079      1.05x     1.12e-08     True
Max                0.0087         0.0117      0.74x     0.00e+00     True
Min                0.0097         0.0080      1.21x     0.00e+00     True
Sigmoid            0.0074         0.0073      1.01x     0.00e+00     True
RMSNorm            0.0430         0.0084      5.12x     1.91e-06     True
LayerNorm          0.0881         0.0547      1.61x     1.91e-06     True
Variance           0.0311         0.0266      1.17x     2.38e-07     True

Note: Matches use atol=1e-4, rtol=1e-6 tolerance (within FP32 rounding).
-------------------------------------------------------------------------
Summary
-------------------------------------------------------------------------
Operations faster than PyTorch: 6/9
All operations deterministic: True
Average speedup: 1.46x
=========================================================================

All measurements use CUDA events for precise GPU timing with 10 warmup and 100 timed iterations. Run-to-run variance of 5-15% is typical due to GPU boost clocks, thermal state, and driver scheduling. Focus on relative speedup trends rather than absolute millisecond values.

Testing

BitExact includes deterministic equality tests for all kernels.

To run the test suite, ensure you have PyTest installed. To install PyTest, run:

pip install -U pytest

Then you can run the test suite with:

pytest tests/

Recommended Flags

  • -v - Verbose flag (shows results of each individual test)
  • -s - Don’t capture output (allows setup logs from conftest.py)

Example:

pytest tests/ -v -s

Because many tests utilize randomized tensors, running the suite multiple times can help verify reproducibility and numerical stability. You can run the tests any number of times, the examples below simply use 3 as a placeholder.

Linux

for i in {1..3}; do pytest -v; done

Windows

for ($i = 1; $i -le 3; $i++) { pytest -v }

Troubleshooting

  • CUDA OOM: close other GPU workloads, then re-run. Cache is auto-cleared; if needed, re-run with -s to confirm setup logs.
  • No GPU: tests require a CUDA-capable device; CPU fallbacks are not provided.

All tests verify bit-exact equivalence to PyTorch’s reference implementations and ensure reproducibility across multiple runs and devices.

Deterministic Inference

The examples/deterministic_inference.py script demonstrates a small neural network using BitExact kernels (matmul, rms_norm, and sigmoid). Running the example verifies that the network’s outputs are bit-for-bit identical across multiple runs, confirming complete GPU determinism.

Run the file with:

python examples/deterministic_inference.py

Project Structure


bitexact/
β”œβ”€β”€ bitexact/                         # Python bindings and high-level API
β”‚   └── __init__.py
β”‚
β”œβ”€β”€ benchmarks/                       # Benchmarking suite for performance comparison
β”‚   β”œβ”€β”€ benchmark.py
β”‚   └── utils.py
β”‚
β”œβ”€β”€ docs/                             # Technical documentation
β”‚   β”œβ”€β”€ api.md
β”‚   └── design.md
β”‚
β”œβ”€β”€ examples/                         # Minimal runnable examples
β”‚   β”œβ”€β”€ basic_usage.py                # Simple demonstration of deterministic ops
β”‚   └── deterministic_inference.py    # Reproducible model inference pipeline
β”‚
β”œβ”€β”€ src/                              # Core CUDA/C++ source
β”‚   β”œβ”€β”€ bindings.cpp                  # PyTorch extension bindings (exposes kernels to Python)
β”‚   β”‚
β”‚   └── ops/                          # Kernel implementations
β”‚       β”œβ”€β”€ matmul/                   # Matrix multiplication kernels
β”‚       β”‚   β”œβ”€β”€ matmul.cu
β”‚       β”‚   └── matmul.cuh
β”‚       β”‚
β”‚       β”œβ”€β”€ reductions/               # Deterministic reduction kernels
β”‚       β”‚   β”œβ”€β”€ sum.cu
β”‚       β”‚   β”œβ”€β”€ sum.cuh
β”‚       β”‚   β”œβ”€β”€ mean.cu
β”‚       β”‚   β”œβ”€β”€ mean.cuh
β”‚       β”‚   β”œβ”€β”€ max.cu
β”‚       β”‚   β”œβ”€β”€ max.cuh
β”‚       β”‚   β”œβ”€β”€ min.cu
β”‚       β”‚   β”œβ”€β”€ min.cuh
β”‚       β”‚   β”œβ”€β”€ var.cu
β”‚       β”‚   └── var.cuh
β”‚       β”‚
β”‚       β”œβ”€β”€ normalization/            # Normalization kernels
β”‚       β”‚   β”œβ”€β”€ rms_norm.cu
β”‚       β”‚   β”œβ”€β”€ rms_norm.cuh
β”‚       β”‚   β”œβ”€β”€ layer_norm.cu
β”‚       β”‚   └── layer_norm.cuh
β”‚       β”‚
β”‚       β”œβ”€β”€ activations/              # Activation kernels
β”‚       β”‚   β”œβ”€β”€ sigmoid.cu
β”‚       β”‚   └── sigmoid.cuh
β”‚       β”‚
β”‚       └── utils/                    # Shared CUDA utilities
β”‚           β”œβ”€β”€ cuda_utils.cuh        # Common device helpers (grid-stride loops, etc.)
β”‚           β”œβ”€β”€ dtype_utils.cuh       # Type casting and precision utilities
β”‚           └── reduction.cuh         # Shared reduction patterns for deterministic ops
β”‚
β”œβ”€β”€ tests/                            # Pytest suite
β”‚   β”œβ”€β”€ conftest.py
β”‚   └── test_determinism.py
β”‚
β”œβ”€β”€ LICENSE                           # License file
β”œβ”€β”€ README.md                         # Project overview and documentation
└── setup.py                          # Build and installation script

Contributions

Contributions are welcome! If you have an idea for a Kernel, feel free to implement it (the largest missing one is attention).

Please ensure new kernels:

  1. Pass Deterministic equality tests (see testing suite).
  2. Use Warp-synchronous, non-atomic reduction patterns.
  3. Includes both .cu and .cuh files and a corresponding test.

Project Status

This project was an experiment that followed a research article. I found it to be an interesting problem, so I spent a portion of my reading week making this library. I do find the problem of determinism to be really interesting so I will keep developing this library, but on no fixed schedule.

There are many ways the library could be expanded, outlined in the design document. If you are interested, feel free to make a contribution.

Acknowledgements

This project draws inspiration from research by Thinking Machines Lab on deterministic GPU computation and reproducible deep learning. Their exploration of bit-exact kernels and floating-point determinism informed the design philosophy of BitExact.


About

Batch-Invariant Deterministic CUDA Kernels for Reproducible Deep Learning

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors