A tensor compiler in Go — autodiff is a graph rewrite, and kernels fuse across the forward/backward seam.
anneal is a from-scratch Go port of tinygrad's modern, rangeify-era core. It takes tensor programs, lowers them through a graph-rewrite compiler, and emits fused GPU kernels. It trains a small MLP, a small convolutional network, and a char-level nanoGPT end-to-end on real GPU hardware via WebGPU, and runs GPT-2-small forward with bit-identical output to HuggingFace's reference implementation.
It is a research project and a learning vehicle, built deliberately in phases. It is not (yet) a drop-in replacement for a production framework — see Status for exactly what v1 does and doesn't do.
Most autodiff libraries record a tape and replay it. anneal doesn't.
- It's a compiler, not an autodiff library. Everything — forward ops, gradients, movement ops — is a single immutable IR node (the
UOp). Computation is suspended until youRealize(), at which point the whole program is one graph the compiler can rewrite, schedule, and fuse. - Gradients are a rewrite pass.
Backward()doesn't build closures; it injects gradientUOps into the same graph as the forward pass. The scheduler then fuses kernels across the forward/backward boundary — an optimization that's structurally impossible with a tape. - Movement ops are range arithmetic, not copies. reshape, permute, expand, pad, shrink, and flip never move data. They become index math (the rangeify model), and the only thing that ever materializes a buffer is the scheduler.
- It runs in the browser. The same compiler builds to WASM and powers the live visualizer, which runs the real compiler, not a mock.
In the visualizer (and throughout the project) color encodes architecture:
anneal ships a single CLI, anneal, which is the fastest way to see it work.
# install the CLI
go install github.com/georgebuilds/anneal/cmd/anneal@latest
# or, from a clone:
git clone https://github.com/georgebuilds/anneal && cd anneal
go build ./cmd/annealThen:
anneal doctor # check your environment can reach a WebGPU device
anneal train mlp # train the MLP with a live TUI dashboard (also: conv, dynmlp --batch=N)
anneal train nanogpt # char-level transformer trained end to end on Shakespeare
anneal gpt2 sample "Hello" # forward GPT-2-small from HuggingFace weights, sample text
anneal graph # dump the UOp graph for a program
anneal kernels # show the scheduled, fused kernels and their WGSL
anneal explain add # explain the rewrite/gradient rules for an opanneal doctor is the right first command: anneal links the platform WebGPU driver at runtime (zero-CGO), so doctor tells you whether a usable device is present before anything else.
The tensor API will feel familiar if you've used tinygrad or numpy. The key difference is the lazy/realize boundary:
import "github.com/georgebuilds/anneal/tensor"
// ... build a model and a forward pass producing `loss` ...
loss.Backward() // injects gradient UOps into the same graph (teal → ember)
loss.Realize() // schedule, fuse across the seam (gold), compile to WGSL, runFor runnable, end-to-end code, including parameter setup, the training loop, optimizer steps, and generation, see examples/: mlp.go, conv.go, dynmlp.go, nanogpt.go (char-level transformer training), and gpt2/ (HF safetensors load + BPE + autoregressive sample). Those are the canonical reference for the current API surface.
uop/ UOp IR: arena, interning, ops enum, dtype
rewrite/ PatternMatcher, graph-rewrite driver, symbolic rules
shape/ View, ShapeTracker, movement ops
schedule/ rangeify, realize-map, bufferize, kernel split
codegen/ UOp tree → linear instrs → WGSL; opt.go (Opt seam, four kernel transforms), beam.go (BEAM autotuning)
backend/ device abstraction; webgpu/ first
tensor/ Tensor API, ops, autodiff (gradient.go), realize
nn/ Linear, Conv2d, MaxPool2D, Embedding, LayerNorm, CausalSelfAttention,
MLP, Block, GPT, activations, SGD, Adam, Parameter
cmd/anneal/ the CLI
viz/ the WASM visualizer
examples/ mlp.go, conv.go, dynmlp.go, nanogpt.go, gpt2/
internal/
assets/ SHA-pinned downloader for Shakespeare corpus and HF GPT-2 weights
The full architecture — the UOp arena and interning model, the rewrite driver, the rangeify indexing model, the 10-pass scheduler, and the design decisions behind them — lives in SPEC.md. Read it before making non-trivial changes.
The line between shipped capabilities and deferred ones is intentional, not accidental. That line has moved since the project started — dynamic-batch training and JIT have landed — but the harder items remain deliberate non-goals for now.
| Capability | Status |
|---|---|
| Reverse-mode autodiff | ✅ Full, via graph rewrite |
| Backend | ✅ WebGPU (native + WASM) |
| Shapes — static | ✅ |
| Shapes — dynamic batch (symbolic) | ✅ NewSymbolicBatchInput + RealizeWithBinding |
| Symbolic shapes — split/merge a symbolic axis, sym pad/shrink, multi-dim sym dispatch | ✅ Shipped |
| Dynamic seq-length tensor API | ⛔ Deferred (the capability is in; the seq-length input constructor is the open work) |
| JIT | ✅ Capture/replay (tensor.JIT) |
| Schedule cache | ✅ Memoized on structural key |
| Devices | Single device |
| Dtypes | f16 ✅ (with shader-f16); bf16 ✅ storage-only (f32 compute); fp8 ⛔ Deferred |
| Multi-device | ⛔ Deferred |
| Image dtypes | ⛔ Deferred |
| BEAM autotuning | ✅ Env-gated (ANNEAL_BEAM=1 to search); persistent disk cache |
The original milestone — train a small MLP and a small conv net end-to-end on GPU, with gradients produced by the rewrite pass and kernels fused across the forward/backward boundary — is met. Since then: dynamic-batch training (dynmlp, symbolic batch dim), general symbolic axis movement (split/merge a symbolic dim, sym pad/shrink, multi-dim sym dispatch with the symbolic axis in any position on both kernel-output and input buffers), JIT capture/replay, a schedule cache, epilogue fusion (Pass 5 now elides a reduce-output BUFFERIZE into a single downstream elementwise consumer), and BEAM autotuning (env-gated, disk-cached) have all shipped. The remaining deferrals listed above are intentional. Kernel autotuning: LOCAL applies to multi-dim symbolic kernels; TILE stays unavailable on symbolic axes because WGSL forbids non-const workgroup sizes, a hard platform ceiling; UPCAST and VECTORIZE bail outside matmul shapes for a pre-existing reason unrelated to symbolic. Symbolic kernels still run correctly via the identity codegen path.
Contributions are welcome, but anneal has a small set of hard invariants (immutable IR, identity equality via interning, no reflection in the rewrite hot path, no copies from movement ops, no SMT solver in indexing) that keep the design coherent. Please read CONTRIBUTING.md before opening a PR.
anneal is largely a port of, and owes its architecture to, tinygrad by the tinygrad authors. The reference target is a pinned tinygrad commit (see CONTRIBUTING.md); blog-era LazyBuffer/Linearizer descriptions of tinygrad do not describe this design.
GPU access is via gogpu/wgpu and goffi (zero-CGO).