ParaGrad

A hardware-agnostic parallel autograd engine with pluggable accelerator backends, built from scratch in C++17/CUDA.

What is this?

ParaGrad is a minimal deep learning framework built around one idea: reduce the entire compute layer to ~19 primitive tensor operations, define them as an abstract interface, and let each hardware target implement only those primitives. The autograd engine, optimizer, and training loop sit above this boundary and are completely hardware-agnostic.

Inspired by micrograd and tinygrad.

Requirements

GCC 9+ or Clang 10+, C++17
OpenMP (ships with GCC)
NVIDIA GPU + CUDA 11+ (for cuda and cublas backends)

Building

make info

CPU (default)

make all  
make test  
make debug

CUDA

make BACKEND=cuda all
make BACKEND=cuda test

cuBLAS

make BACKEND=cublas all
make BACKEND=cublas bench_cublas

Overriding GPU arch or CUDA path

make BACKEND=cuda CUDA_ARCH=sm_80 all 
make BACKEND=cuda CUDA_HOME=/usr/local/cuda all

On the MGHPCC cluster:

module load cuda/12.8.0
make BACKEND=cuda CUDA_ARCH=sm_70 all

Tests

make test           
make test_unit       
make test_integration  
make debug             
make BACKEND=cuda test

Test output: ── test_name ✓ for passes, full output on failure.

Examples

make examples

Language model (Shakespeare / TinyStories)

# Train
./build/cpu/train_lm data/shakespeare.txt

# Generate from checkpoint
./build/cpu/generate_lm checkpoint.pgr

# On the cluster (see scripts/)
sbatch scripts/train_shakespeare.sbatch
sbatch scripts/train_tinystories.sbatch

MNIST

Download data:

mkdir -p data && cd data
wget https://raw.githubusercontent.com/fgnt/mnist/master/train-images-idx3-ubyte.gz
wget https://raw.githubusercontent.com/fgnt/mnist/master/train-labels-idx1-ubyte.gz
wget https://raw.githubusercontent.com/fgnt/mnist/master/t10k-images-idx3-ubyte.gz
wget https://raw.githubusercontent.com/fgnt/mnist/master/t10k-labels-idx1-ubyte.gz
gunzip *.gz && cd ..

Train:

./build/cpu/mnist_train      
./build/cpu/mnist_train data/ 5 64 0.1 

# Thread scaling study
for t in 1 2 4 8; do
  OMP_NUM_THREADS=$t ./build/cpu/bench_backends
done

Architecture: 784 → 256 → 128 → 10, tanh activations, MSE loss.

Benchmarks

make benches                        # build all benchmark binaries
make bench                          # transformer-shaped GEMM (CPU)
make bench_all BACKEND=cuda         # full suite: backends + fusion + training
make bench_all BACKEND=cublas       # includes cuBLAS comparison

Results

CPU: Cascade Lake 8-core node (OMP_NUM_THREADS=8) | GPU: NVIDIA Tesla V100-SXM2-32GB (SM 7.0, 80 SMs) | CUDA: 12.8

Shapes are GPT-2-small/medium FFN and attention layers. fwd+bwd = 6×M×K×N flops.

Backend Comparison - Transformer-Shaped GEMM (fwd+bwd)

  shape              M×K×N           cpu ms   cpu GF/s   cuda ms  cuda GF/s  cublas ms  cublas GF/s
  ffn_up_s     64×768 ×3072          346.9      2.6        0.497    1823.4      0.119      7602.8
  ffn_down_s   64×3072×768           225.0      4.0        0.450    2013.6      0.125      7234.7
  attn_qkv     64×768 ×2304          170.1      4.0        0.335    2029.0      0.098      6968.5
  attn_out     64×768 ×768            57.1      4.0        0.124    1822.6      0.044      5113.4
  ffn_up_m     32×1024×4096          535.3      1.5        0.515    1564.1      0.136      5931.4
  ffn_down_m   32×4096×1024          289.3      2.8        0.485    1660.8      0.136      5915.3

  Speedup vs CPU:  CUDA 461–1039×   cuBLAS 1900–25000×
  cuBLAS vs CUDA:  3.1–4.2× (Tensor Core vs FP32 FFMA)

CPU Thread Scaling - Cascade Lake 8-core

  shape           T=1 GF/s   T=2 GF/s   T=4 GF/s   T=8 GF/s
  ffn_up_s          2.7        2.9        3.0        2.9
  ffn_down_s        3.6        4.3        4.6        4.6
  ffn_up_m          1.5        1.7        1.7        1.7
  ffn_down_m        2.6        3.0        3.1        3.2

  Saturates at T=4 - DRAM bandwidth-bound, not compute-bound.

Op Fusion - element-wise chain (neg→exp→tanh→gelu, fwd+bwd)

           n       cpu unfused   cpu fused   speedup   cuda unfused   cuda fused   speedup
        1024       0.023 ms     0.023 ms     0.99x       0.260 ms      0.260 ms     1.00x
       16384       0.277 ms     0.310 ms     0.90x       0.277 ms      0.276 ms     1.00x
       65536       1.379 ms     1.258 ms     1.10x       0.524 ms      0.523 ms     1.00x
      262144       5.311 ms     4.705 ms     1.13x       1.493 ms      1.494 ms     1.00x
     1048576      20.770 ms    19.230 ms     1.08x       3.443 ms      3.425 ms     1.01x

  Fusion speedup is near-unity on both backends at these workload sizes.
  The GEMM dominates step time; element-wise savings are negligible in total.
  Fusion is expected to pay off when chains are long and GEMMs are absent.

Op Fusion - matmul→bias_add→gelu epilogue

  shape                         cpu unfused   cpu fused   speedup   cuda unfused   cuda fused   speedup
  small  (64×256→1024)          21.48 ms     21.83 ms     0.98x      0.124 ms      0.126 ms     0.98x
  medium (64×768→3072)         128.10 ms    129.20 ms     0.99x      0.872 ms      0.876 ms     1.00x
  large  (32×1024→4096)        121.60 ms    119.40 ms     1.02x      0.482 ms      0.487 ms     0.99x

NVRTC JIT Amortisation (CUDA)

  unfused (baseline)              1.54 ms
  fused, cold (1st call)          0.58 ms   (NVRTC compile included)
  fused, warm (cached)            1.54 ms   ~1.00× vs unfused

  The warm cached kernel is statistically tied with the unfused path on these shapes.

End-to-End Training Throughput - tokens/sec (fused)

  mlp_small: 784→256→128→10, gelu activations

   batch    cpu tok/s    cuda tok/s   cublas tok/s
      32        4,079       68,558        64,012
      64        4,093      218,124       225,910
     128        6,405      337,692       365,362
     256        8,741      336,008       338,756
     512       10,715      352,494       369,127

  mlp_large: 512→2048→2048→512, gelu activations

   batch    cpu tok/s    cuda tok/s   cublas tok/s
      32           55       28,285        60,642
      64           54       36,769        62,559
     128          121       42,851        65,892
     256          216       47,090        70,052
     512          360       57,968       111,992

  Fused ≈ unfused throughout (consistent with fusion null result above).
  cuBLAS pulls 2.1–2.5× ahead of hand-written CUDA for mlp_large at large batch.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
benchmarks		benchmarks
examples		examples
include/paragrad		include/paragrad
outputs/benchmark_outputs		outputs/benchmark_outputs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ParaGrad

What is this?

Requirements

Building

CPU (default)

CUDA

cuBLAS

Overriding GPU arch or CUDA path

Tests

Examples

Language model (Shakespeare / TinyStories)

MNIST

Benchmarks

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ParaGrad

What is this?

Requirements

Building

CPU (default)

CUDA

cuBLAS

Overriding GPU arch or CUDA path

Tests

Examples

Language model (Shakespeare / TinyStories)

MNIST

Benchmarks

Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages