Skip to content

georgebuilds/anneal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

128 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

anneal

anneal

A tensor compiler in Go — autodiff is a graph rewrite, and kernels fuse across the forward/backward seam.

status backend go model license

Visualizer · Architecture (SPEC) · Contributing


anneal is a from-scratch Go port of tinygrad's modern, rangeify-era core. It takes tensor programs, lowers them through a graph-rewrite compiler, and emits fused GPU kernels. It trains a small MLP, a small convolutional network, and a char-level nanoGPT end-to-end on real GPU hardware via WebGPU, and runs GPT-2-small forward with bit-identical output to HuggingFace's reference implementation.

It is a research project and a learning vehicle, built deliberately in phases. It is not (yet) a drop-in replacement for a production framework — see Status for exactly what v1 does and doesn't do.

What anneal is

Most autodiff libraries record a tape and replay it. anneal doesn't.

  • It's a compiler, not an autodiff library. Everything — forward ops, gradients, movement ops — is a single immutable IR node (the UOp). Computation is suspended until you Realize(), at which point the whole program is one graph the compiler can rewrite, schedule, and fuse.
  • Gradients are a rewrite pass. Backward() doesn't build closures; it injects gradient UOps into the same graph as the forward pass. The scheduler then fuses kernels across the forward/backward boundary — an optimization that's structurally impossible with a tape.
  • Movement ops are range arithmetic, not copies. reshape, permute, expand, pad, shrink, and flip never move data. They become index math (the rangeify model), and the only thing that ever materializes a buffer is the scheduler.
  • It runs in the browser. The same compiler builds to WASM and powers the live visualizer, which runs the real compiler, not a mock.

In the visualizer (and throughout the project) color encodes architecture:

forward   backward   fused

Quickstart

anneal ships a single CLI, anneal, which is the fastest way to see it work.

# install the CLI
go install github.com/georgebuilds/anneal/cmd/anneal@latest

# or, from a clone:
git clone https://github.com/georgebuilds/anneal && cd anneal
go build ./cmd/anneal

Then:

anneal doctor               # check your environment can reach a WebGPU device
anneal train mlp            # train the MLP with a live TUI dashboard (also: conv, dynmlp --batch=N)
anneal train nanogpt        # char-level transformer trained end to end on Shakespeare
anneal gpt2 sample "Hello"  # forward GPT-2-small from HuggingFace weights, sample text
anneal graph                # dump the UOp graph for a program
anneal kernels              # show the scheduled, fused kernels and their WGSL
anneal explain add          # explain the rewrite/gradient rules for an op

anneal doctor is the right first command: anneal links the platform WebGPU driver at runtime (zero-CGO), so doctor tells you whether a usable device is present before anything else.

Using anneal as a library

The tensor API will feel familiar if you've used tinygrad or numpy. The key difference is the lazy/realize boundary:

import "github.com/georgebuilds/anneal/tensor"

// ... build a model and a forward pass producing `loss` ...

loss.Backward()   // injects gradient UOps into the same graph (teal → ember)
loss.Realize()    // schedule, fuse across the seam (gold), compile to WGSL, run

For runnable, end-to-end code, including parameter setup, the training loop, optimizer steps, and generation, see examples/: mlp.go, conv.go, dynmlp.go, nanogpt.go (char-level transformer training), and gpt2/ (HF safetensors load + BPE + autoregressive sample). Those are the canonical reference for the current API surface.

Project layout

uop/         UOp IR: arena, interning, ops enum, dtype
rewrite/     PatternMatcher, graph-rewrite driver, symbolic rules
shape/       View, ShapeTracker, movement ops
schedule/    rangeify, realize-map, bufferize, kernel split
codegen/     UOp tree → linear instrs → WGSL; opt.go (Opt seam, four kernel transforms), beam.go (BEAM autotuning)
backend/     device abstraction; webgpu/ first
tensor/      Tensor API, ops, autodiff (gradient.go), realize
  nn/        Linear, Conv2d, MaxPool2D, Embedding, LayerNorm, CausalSelfAttention,
             MLP, Block, GPT, activations, SGD, Adam, Parameter
cmd/anneal/  the CLI
viz/         the WASM visualizer
examples/    mlp.go, conv.go, dynmlp.go, nanogpt.go, gpt2/
internal/
  assets/    SHA-pinned downloader for Shakespeare corpus and HF GPT-2 weights

The full architecture — the UOp arena and interning model, the rewrite driver, the rangeify indexing model, the 10-pass scheduler, and the design decisions behind them — lives in SPEC.md. Read it before making non-trivial changes.

Status

The line between shipped capabilities and deferred ones is intentional, not accidental. That line has moved since the project started — dynamic-batch training and JIT have landed — but the harder items remain deliberate non-goals for now.

Capability Status
Reverse-mode autodiff ✅ Full, via graph rewrite
Backend ✅ WebGPU (native + WASM)
Shapes — static
Shapes — dynamic batch (symbolic) NewSymbolicBatchInput + RealizeWithBinding
Symbolic shapes — split/merge a symbolic axis, sym pad/shrink, multi-dim sym dispatch ✅ Shipped
Dynamic seq-length tensor API ⛔ Deferred (the capability is in; the seq-length input constructor is the open work)
JIT ✅ Capture/replay (tensor.JIT)
Schedule cache ✅ Memoized on structural key
Devices Single device
Dtypes f16 ✅ (with shader-f16); bf16 ✅ storage-only (f32 compute); fp8 ⛔ Deferred
Multi-device ⛔ Deferred
Image dtypes ⛔ Deferred
BEAM autotuning ✅ Env-gated (ANNEAL_BEAM=1 to search); persistent disk cache

The original milestone — train a small MLP and a small conv net end-to-end on GPU, with gradients produced by the rewrite pass and kernels fused across the forward/backward boundary — is met. Since then: dynamic-batch training (dynmlp, symbolic batch dim), general symbolic axis movement (split/merge a symbolic dim, sym pad/shrink, multi-dim sym dispatch with the symbolic axis in any position on both kernel-output and input buffers), JIT capture/replay, a schedule cache, epilogue fusion (Pass 5 now elides a reduce-output BUFFERIZE into a single downstream elementwise consumer), and BEAM autotuning (env-gated, disk-cached) have all shipped. The remaining deferrals listed above are intentional. Kernel autotuning: LOCAL applies to multi-dim symbolic kernels; TILE stays unavailable on symbolic axes because WGSL forbids non-const workgroup sizes, a hard platform ceiling; UPCAST and VECTORIZE bail outside matmul shapes for a pre-existing reason unrelated to symbolic. Symbolic kernels still run correctly via the identity codegen path.

Contributing

Contributions are welcome, but anneal has a small set of hard invariants (immutable IR, identity equality via interning, no reflection in the rewrite hot path, no copies from movement ops, no SMT solver in indexing) that keep the design coherent. Please read CONTRIBUTING.md before opening a PR.

Credits

anneal is largely a port of, and owes its architecture to, tinygrad by the tinygrad authors. The reference target is a pinned tinygrad commit (see CONTRIBUTING.md); blog-era LazyBuffer/Linearizer descriptions of tinygrad do not describe this design.

GPU access is via gogpu/wgpu and goffi (zero-CGO).

License

About

Machine learning compiler in Go. A from-scratch tinygrad port: graph-rewrite IR, autodiff as a compiler pass, zero-CGO WebGPU backend.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Sponsor this project

  •  

Packages

 
 
 

Contributors

Languages