anneal

A tensor compiler in Go — autodiff is a graph rewrite, and kernels fuse across the forward/backward seam.

Visualizer · Architecture (SPEC) · Contributing

anneal is a from-scratch Go port of tinygrad's modern, rangeify-era core. It takes tensor programs, lowers them through a graph-rewrite compiler, and emits fused GPU kernels. It trains a small MLP, a small convolutional network, and a char-level nanoGPT end-to-end on real GPU hardware via WebGPU, and runs GPT-2-small forward with bit-identical output to HuggingFace's reference implementation.

It is a research project and a learning vehicle, built deliberately in phases. It is not (yet) a drop-in replacement for a production framework — see Status for exactly what v1 does and doesn't do.

What anneal is

Most autodiff libraries record a tape and replay it. anneal doesn't.

It's a compiler, not an autodiff library. Everything — forward ops, gradients, movement ops — is a single immutable IR node (the UOp). Computation is suspended until you Realize(), at which point the whole program is one graph the compiler can rewrite, schedule, and fuse.
Gradients are a rewrite pass. Backward() doesn't build closures; it injects gradient UOps into the same graph as the forward pass. The scheduler then fuses kernels across the forward/backward boundary — an optimization that's structurally impossible with a tape.
Movement ops are range arithmetic, not copies. reshape, permute, expand, pad, shrink, and flip never move data. They become index math (the rangeify model), and the only thing that ever materializes a buffer is the scheduler.
It runs in the browser. The same compiler builds to WASM and powers the live visualizer, which runs the real compiler, not a mock.

In the visualizer (and throughout the project) color encodes architecture:

Quickstart

anneal ships a single CLI, anneal, which is the fastest way to see it work.

# install the CLI
go install github.com/georgebuilds/anneal/cmd/anneal@latest

# or, from a clone:
git clone https://github.com/georgebuilds/anneal && cd anneal
go build ./cmd/anneal

Then:

anneal doctor               # check your environment can reach a WebGPU device
anneal train mlp            # train the MLP with a live TUI dashboard (also: conv, dynmlp --batch=N)
anneal train nanogpt        # char-level transformer trained end to end on Shakespeare
anneal gpt2 sample "Hello"  # forward GPT-2-small from HuggingFace weights, sample text
anneal graph                # dump the UOp graph for a program
anneal kernels              # show the scheduled, fused kernels and their WGSL
anneal explain add          # explain the rewrite/gradient rules for an op

anneal doctor is the right first command: anneal links the platform WebGPU driver at runtime (zero-CGO), so doctor tells you whether a usable device is present before anything else.

Using anneal as a library

The tensor API will feel familiar if you've used tinygrad or numpy. The key difference is the lazy/realize boundary:

import "github.com/georgebuilds/anneal/tensor"

// ... build a model and a forward pass producing `loss` ...

loss.Backward()   // injects gradient UOps into the same graph (teal → ember)
loss.Realize()    // schedule, fuse across the seam (gold), compile to WGSL, run

For runnable, end-to-end code, including parameter setup, the training loop, optimizer steps, and generation, see examples/: mlp.go, conv.go, dynmlp.go, nanogpt.go (char-level transformer training), and gpt2/ (HF safetensors load + BPE + autoregressive sample). Those are the canonical reference for the current API surface.

Project layout

uop/         UOp IR: arena, interning, ops enum, dtype
rewrite/     PatternMatcher, graph-rewrite driver, symbolic rules
shape/       View, ShapeTracker, movement ops
schedule/    rangeify, realize-map, bufferize, kernel split
codegen/     UOp tree → linear instrs → WGSL; opt.go (Opt seam, four kernel transforms), beam.go (BEAM autotuning)
backend/     device abstraction; webgpu/ first
tensor/      Tensor API, ops, autodiff (gradient.go), realize
  nn/        Linear, Conv2d, MaxPool2D, Embedding, LayerNorm, CausalSelfAttention,
             MLP, Block, GPT, activations, SGD, Adam, Parameter
cmd/anneal/  the CLI
viz/         the WASM visualizer
examples/    mlp.go, conv.go, dynmlp.go, nanogpt.go, gpt2/
internal/
  assets/    SHA-pinned downloader for Shakespeare corpus and HF GPT-2 weights

The full architecture — the UOp arena and interning model, the rewrite driver, the rangeify indexing model, the 10-pass scheduler, and the design decisions behind them — lives in SPEC.md. Read it before making non-trivial changes.

Status

The line between shipped capabilities and deferred ones is intentional, not accidental. That line has moved since the project started — dynamic-batch training and JIT have landed — but the harder items remain deliberate non-goals for now.

Capability	Status
Reverse-mode autodiff	✅ Full, via graph rewrite
Backend	✅ WebGPU (native + WASM)
Shapes — static	✅
Shapes — dynamic batch (symbolic)	✅ `NewSymbolicBatchInput` + `RealizeWithBinding`
Symbolic shapes — split/merge a symbolic axis, sym pad/shrink, multi-dim sym dispatch	✅ Shipped
Dynamic seq-length tensor API	⛔ Deferred (the capability is in; the seq-length input constructor is the open work)
JIT	✅ Capture/replay (`tensor.JIT`)
Schedule cache	✅ Memoized on structural key
Devices	Single device
Dtypes	f16 ✅ (with shader-f16); bf16 ✅ storage-only (f32 compute); fp8 ⛔ Deferred
Multi-device	⛔ Deferred
Image dtypes	⛔ Deferred
BEAM autotuning	✅ Env-gated (ANNEAL_BEAM=1 to search); persistent disk cache

The original milestone — train a small MLP and a small conv net end-to-end on GPU, with gradients produced by the rewrite pass and kernels fused across the forward/backward boundary — is met. Since then: dynamic-batch training (dynmlp, symbolic batch dim), general symbolic axis movement (split/merge a symbolic dim, sym pad/shrink, multi-dim sym dispatch with the symbolic axis in any position on both kernel-output and input buffers), JIT capture/replay, a schedule cache, epilogue fusion (Pass 5 now elides a reduce-output BUFFERIZE into a single downstream elementwise consumer), and BEAM autotuning (env-gated, disk-cached) have all shipped. The remaining deferrals listed above are intentional. Kernel autotuning: LOCAL applies to multi-dim symbolic kernels; TILE stays unavailable on symbolic axes because WGSL forbids non-const workgroup sizes, a hard platform ceiling; UPCAST and VECTORIZE bail outside matmul shapes for a pre-existing reason unrelated to symbolic. Symbolic kernels still run correctly via the identity codegen path.

Contributing

Contributions are welcome, but anneal has a small set of hard invariants (immutable IR, identity equality via interning, no reflection in the rewrite hot path, no copies from movement ops, no SMT solver in indexing) that keep the design coherent. Please read CONTRIBUTING.md before opening a PR.

Credits

anneal is largely a port of, and owes its architecture to, tinygrad by the tinygrad authors. The reference target is a pinned tinygrad commit (see CONTRIBUTING.md); blog-era LazyBuffer/Linearizer descriptions of tinygrad do not describe this design.

GPU access is via gogpu/wgpu and goffi (zero-CGO).

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.github		.github
backend		backend
cmd		cmd
codegen		codegen
docs		docs
examples		examples
internal/assets		internal/assets
rewrite		rewrite
schedule		schedule
shape		shape
tensor		tensor
tui		tui
uop		uop
viz		viz
.gitignore		.gitignore
.golangci.yml		.golangci.yml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SPEC.md		SPEC.md
build_slice2a.txt		build_slice2a.txt
go.mod		go.mod
go.sum		go.sum
lint_output_slice1.txt		lint_output_slice1.txt
lint_output_slice2a.txt		lint_output_slice2a.txt
lint_output_slice2a_before.txt		lint_output_slice2a_before.txt
lint_output_slice2c.txt		lint_output_slice2c.txt
package-lock.json		package-lock.json
package.json		package.json
test_full_slice2a.txt		test_full_slice2a.txt
value_oracle_slice2a.txt		value_oracle_slice2a.txt
value_oracle_slice2c.txt		value_oracle_slice2c.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

anneal

What anneal is

Quickstart

Using anneal as a library

Project layout

Status

Contributing

Credits

License

About

Uh oh!

Releases 2

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

anneal

What anneal is

Quickstart

Using anneal as a library

Project layout

Status

Contributing

Credits

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages