A Go-native deep learning framework built on libtorch.
Same GPU kernels as PyTorch. No Python. No GIL. Just Go.
Status: Succeeded by floDl
goDl is no longer actively developed. Go's garbage collector cannot deterministically manage GPU (VRAM) memory, creating fundamental limitations for training workloads. floDl is a complete Rust port where
Dropprovides deterministic resource management. goDl remains available as-is under MIT license.
Graph Builder • Quick Start • Features • PyTorch → goDl • Tutorials • Architecture
goDl's fluent graph builder lets you describe complex architectures as readable data flow — no boilerplate, no graph construction commands.
model, err := graph.From(nn.MustLinear(2, 16)). // input projection
Through(nn.NewGELU()). // activation
Through(nn.MustLayerNorm(16)). // normalization
Also(nn.MustLinear(16, 16)). // residual connection
Through(nn.MustLinear(16, 2)). // output projection
Build()That's a trainable model. Also adds the residual — input flows through the
Linear and gets added to its output. Build() returns a graph.Graph that
implements nn.Module — you can nest it inside other graphs.
Things get interesting when architectures get complex:
g, err := graph.From(encoder).Tag("encoded"). // tag for later
Split(headA, headB, headC).Merge(graph.Mean()). // multi-head + merge
Loop(refinementBlock).For(3).Tag("refined"). // iterate 3 times
Gate(router, expertA, expertB).Using("encoded"). // soft routing with context
Switch(selector, lightPath, heavyPath).Using("refined"). // hard routing
Through(graph.StateAdd()).Using("memory").Tag("memory"). // recurrent state
Loop(decoder).While(haltCondition, 10). // adaptive computation
Through(outputHead).
Build()Every construct — Split/Merge, Also, Loop, Gate, Switch, Map,
Tag/Using — composes cleanly. Sub-graphs nest like any module. Forward
references (Using before Tag) carry state across calls, enabling recurrent
architectures without special-casing.
The graph executes nodes at the same topological level in parallel via goroutines — independent branches run concurrently without any extra code.
See the Graph Builder Tutorial and the full showcase that exercises every builder method.
Requirements: Docker (with NVIDIA Container Toolkit for GPU support).
git clone https://github.com/fab2s/goDl.git
cd goDl
make image # build dev container (Go + libtorch + CUDA)
make test # run all 482 tests (CPU + CUDA)
make test-cpu # run without GPU
make doc # local doc server (pkg.go.dev style)
make shell # interactive shell in container// Task: learn cumulative sum — [a, b] → [a, a+b]
// Build the model.
model, err := graph.From(nn.MustLinear(2, 16)).
Through(nn.NewGELU()).
Through(nn.MustLayerNorm(16)).
Also(nn.MustLinear(16, 16)).
Through(nn.MustLinear(16, 2)).
Build()
// Set up training.
optimizer := nn.NewAdam(model.Parameters(), 0.01)
model.SetTraining(true)
// Training loop.
for loader.Next() {
input, target := loader.Batch()
pred := model.Forward(autograd.NewVariable(input, true))
loss := nn.MSELoss(pred, autograd.NewVariable(target, false))
optimizer.ZeroGrad()
loss.Backward()
nn.ClipGradNorm(model.Parameters(), 1.0)
optimizer.Step()
}See examples/train/ for the complete
runnable version with data generation and evaluation.
| Layer | What it does |
|---|---|
| Tensor | Immutable, chainable API with error propagation. CPU and CUDA. |
| Autograd | Reverse-mode automatic differentiation. Full backward for every op. |
| NN Modules | Linear, Conv2d, ConvTranspose2d, LayerNorm, BatchNorm, Dropout, Embedding, GRUCell, LSTMCell |
| Activations | ReLU, Sigmoid, Tanh, GELU, SiLU, Softmax |
| Losses | MSELoss, CrossEntropyLoss, BCEWithLogitsLoss, L1Loss, SmoothL1Loss, KLDivLoss |
| Optimizers | SGD (with momentum), Adam, AdamW |
| LR Scheduling | StepDecay, Cosine, Warmup (composable), ReduceOnPlateau |
| Mixed Precision | Float16/BFloat16 dtype casting, GradScaler for loss scaling |
| Method | What it does |
|---|---|
From(m).Through(m) |
Linear chain |
Input(names...) |
Auxiliary graph inputs, accessible via Using(name) — multi-input graphs |
Split(m...).Merge(op) |
Parallel branches, merged by Add(), Mean(), or Cat(dim) |
Also(m) |
Residual connection: input + m(input) |
Tag(name) / Using(refs...) |
Named references — backward (same pass) or forward (across calls) |
Loop(body).For(n) |
Fixed iteration with BPTT |
Loop(body).While(cond, max) |
Condition before body (0..max iterations) |
Loop(body).Until(cond, max) |
Condition after body (1..max iterations) |
Gate(router, experts...) |
Soft routing — all experts execute, weighted combination |
Switch(selector, branches...) |
Hard routing — only selected branch executes |
Map(body).Each() |
Apply body to each element along dim 0 |
Map(body).Over(tag) |
Iterate over a tagged tensor |
Map(body).Slices(n) |
Decompose last dim into n slices, map, recompose |
.Batched() |
Fast path for Map — full batch in one call |
TagGroup(name) |
Name parallel branches: Split(...).TagGroup("head") → "head_0", "head_1", ... |
g.ForwardCtx(ctx, inputs...) |
Context-aware execution — timeouts, cancellation for loops and maps |
| Tool | What it does |
|---|---|
nn.ClipGradNorm |
L2 norm gradient clipping |
nn.ClipGradValue |
Element-wise gradient clamping |
g.Freeze(tags...) / g.Unfreeze(tags...) |
Freeze parameters by tag name |
nn.SaveParameters / nn.LoadParameters |
Binary checkpoint format (file path or io.Writer) |
KaimingUniform/Normal, XavierUniform/Normal |
Weight initialization |
data.Loader |
Batched data loading with parallel prefetch and shuffle |
| LR schedulers | StepDecay, Cosine, Warmup, ReduceOnPlateau (composable) |
nn.GradScaler |
Dynamic loss scaling for mixed precision (float16) training |
nn.CastParameters |
Cast model parameters to any dtype (Float16, BFloat16, etc.) |
Beyond Module and TrainToggler, modules can implement optional
interfaces that the graph recognizes automatically:
| Interface | Method | What happens |
|---|---|---|
Resettable |
Reset(batchSize int64, device tensor.Device) |
Graph auto-calls before each Forward — modules with per-forward state (attention location, counter, accumulator) reset cleanly on the correct device |
Traced |
Trace() *Variable |
Loop executor collects return value before first iteration and after each step — g.Traces(tag) returns the full trajectory |
NamedInputModule |
ForwardNamed(stream, refs) |
Loop and node Using refs arrive as a named map instead of positional args |
RefValidator |
RefNames() []string |
Build-time validation that exactly the expected Using refs are wired |
SubModuler |
SubModules() []Module |
Declares child modules — framework walks the tree for device placement, training mode, state detachment, and reset |
DeviceMover |
MoveToDevice(device) |
Moves non-parameter tensors (running stats, buffers) when SetDevice is called |
Detachable |
Detach() |
Breaks gradient chains on retained state — called recursively by DetachState |
These compose: a loop body that implements Resettable + Traced +
NamedInputModule gets auto-reset, per-iteration trace collection, and
named ref forwarding — all handled by the graph, no manual wiring.
For composite user modules, implementing SubModuler is all it takes —
the framework handles parameter collection, device moves, training mode,
and state detachment recursively.
Tags double as observation points — collect metrics during training, flush
to epoch history, and query trends to drive training decisions. Record
injects external metrics (losses, hit rates) into the same pipeline:
for epoch := range epochs {
for _, batch := range loader {
pred := g.Forward(batch.Input)
g.Collect("hidden") // from graph tag
loss := nn.CrossEntropyLoss(pred, target)
g.Record("loss", loss.Item()) // external metric
g.Record("hit_rate", computeHitRate(pred, target)) // any float64
}
g.Flush() // promotes batch means → epoch history (Collected + Recorded)
if g.Trend("loss").Stalled(5, 1e-4) {
scheduler.Decay()
}
if g.Trend("loss").Improving(3) {
g.Unfreeze("decoder")
}
}| Method | What it does |
|---|---|
g.Tagged(tag) |
Access a tagged node's output after Forward |
g.Traces(tag) |
Access per-iteration side outputs from Traced loop bodies |
g.Log(tags...) |
Print current tagged values (hookable via OnLog) |
g.Collect(tags...) / g.Flush(tags...) |
Batch → epoch metric collection (from graph tags) |
g.Record(tag, values...) |
Inject external metrics into the same Collect/Flush pipeline |
g.Trend(tag) |
Epoch-level trend: Slope, Stalled, Improving, Converged |
g.Trends(tags...) |
Group trends: AllImproving, AnyStalled, MeanSlope (expands TagGroups) |
g.Sub(tag) |
Reach into a sub-graph's metrics — no extra Forward needed |
g.ETA(totalEpochs) |
Estimated remaining wall-clock time from flush cadence |
g.FlushCount() / g.Elapsed() |
How many flushes and total wall time since first |
g.WriteLog(path, total, tags...) |
Human-readable text log with per-epoch metrics and ETA |
trend.Latest() |
Most recent epoch value (convenience for Values()[len-1]) |
TagGroup names parallel branches from Split with auto-suffixed tags.
Trends and TimingTrends expand groups for aggregate queries:
g, _ := graph.From(encoder).
Split(headA, headB, headC).TagGroup("head"). // head_0, head_1, head_2
Merge(graph.Mean()).
Build()
// Training loop with group observation.
for epoch := range epochs {
for _, batch := range loader {
g.Forward(batch.Input)
g.Collect("head_0", "head_1", "head_2")
}
g.Flush()
if g.Trends("head").AllImproving(5) {
fmt.Println("all heads improving")
}
if g.Trends("head").AnyStalled(5, 1e-4) {
fmt.Println("at least one head stalled")
}
fmt.Printf("mean slope: %.4f\n", g.Trends("head").MeanSlope(5))
}ForwardCtx threads Go's context.Context through the graph. Loops and
maps check for cancellation between iterations — enabling wall-clock
timeouts that Python cannot express:
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
result := g.ForwardCtx(ctx, input) // loops abort if time runs outThis is a Go-native advantage. Python's GIL prevents cooperative
cancellation inside a forward pass. torch.compile breaks on dynamic
control flow. In goDl, context propagation is zero-cost when unused
(context.Background() checks compile to a nil return) and naturally
composes with loops, maps, and parallel branches.
Combined with the observation layer, this enables patterns like trend-driven loop halts, adaptive computation time with hard deadlines, and graceful training interruption — all impossible or clunky in Python.
fmt.Println(g.DOT()) // Graphviz DOT with parameter counts
svg, _ := g.SVG("model.svg") // render to SVG
// Timing-annotated: nodes colored green→yellow→red by execution time.
g.EnableProfiling()
g.Forward(input)
g.SVGWithProfile("profile.svg")
// Training curves as self-contained HTML (open in any browser).
g.PlotHTML("training.html", "loss", "head") // expands TagGroups
g.ExportTrends("metrics.csv", "loss") // CSV for external tools
g.WriteLog("training.log", 100, "loss") // text log with ETANode shapes indicate type (input, output, loop, map, switch, activation, normalization). Parameter counts appear on each node. Forward-ref state loops are shown as dotted edges. Profiled graphs show per-node durations and parallelism efficiency per level.
Every differentiable path is verified against finite-difference gradients:
- 40 autograd op-level checks (every op + compositions)
- 10 module-level checks (every NN module, input + parameter gradients)
- 11 exact optimizer step verifications (SGD, Adam, AdamW)
- 482 tests total, all passing with race detector
Python adds ~3-5 us of framework overhead to every GPU operation (interpreter, GIL, argument parsing, dispatch chain). For large operations like a 1024x1024 matmul, this is noise. For architectures built on many small sequential operations — recurrent steps, iterative refinement, multi-head attention with independent heads — this overhead dominates. The GPU starves between kernel launches.
Python's Global Interpreter Lock prevents parallel kernel dispatch. Independent model branches must dispatch kernels sequentially from a single thread, even when the GPU has dozens of idle Streaming Multiprocessors.
torch.compile partially addresses this by tracing and fusing operations, but
it breaks on data-dependent control flow and requires recompilation when loop
counts or branch structure change — exactly the dynamic architectures that
need help most.
Not C++ because writing and iterating on model architectures in C++ is slow and error-prone. Go provides compiled-language performance with a much shorter feedback loop: fast compilation, simple tooling, readable code.
Not Rust because Rust's main advantage — the borrow checker — cannot reason
about tensor memory in libtorch's C allocator. Meanwhile Go has concrete
advantages: goroutines are simpler than async for parallel dispatch, compilation
is seconds not minutes, the code reads close to pseudocode, and the tooling
(go test, go build, go vet) just works.
A framework nobody uses solves nothing. Go hits the right trade-off between performance and accessibility for this domain.
See docs/design/cuda-dispatch.md for the full dispatch overhead analysis.
+-----------------------------------------------------------+
| User Code / Model Definitions |
+-----------------------------------------------------------+
| graph/ Fluent builder, parallel execution, DOT/SVG |
+-----------------------------------------------------------+
| nn/ Modules, losses, optimizers, checkpoints |
+-----------------------------------------------------------+
| autograd/ Reverse-mode AD, gradient tracking |
+-----------------------------------------------------------+
| tensor/ Immutable chainable API, CPU + CUDA |
+-----------------------------------------------------------+
| internal/libtorch/ CGo bindings to libtorch C++ |
+-----------------------------------------------------------+
| libtorch / CUDA / ROCm / MPS / CPU |
+-----------------------------------------------------------+
The same GPU kernels that power PyTorch run the actual math. goDl replaces everything above them: the dispatch path, autograd tracking, operator composition, and execution scheduling.
Since goDl binds libtorch — not CUDA directly — it inherits libtorch's backend support: NVIDIA (CUDA), AMD (ROCm), Intel (XPU), Apple Silicon (MPS), and CPU. Switching hardware is a build flag, not a code change.
Coming from PyTorch? The PyTorch → goDl Migration Guide has side-by-side examples for every common operation — tensors, autograd, modules, losses, optimizers, training loops, and the graph builder.
Step-by-step guides from basics to advanced, each with runnable examples:
- Tensors — creation, ops, chaining, error handling
- Autograd — variables, gradients, backward pass
- Modules — Linear, Conv2d, normalization, RNN cells
- Training — losses, optimizers, data loading, full loop
- Graph Builder — the fluent API from simple to complex
- Advanced Graphs — forward refs, loops, gates, switches
- Visualization — DOT/SVG output, reading diagrams
- Utilities — checkpoints, clipping, freezing, initialization
- Roadmap — phased development plan
- CUDA Dispatch Analysis — overhead breakdown and performance thesis
- Trajectory Thesis — geometric intuition behind the project
examples/train/— complete training loop (data loading, loss, backward, optimizer)examples/showcase/— every graph builder method in one graph
goDl is open-sourced software licensed under the MIT license.
