A lightweight, scalar-valued automatic differentiation engine and neural network library implemented from scratch in Modern C++ (C++17). Inspired by Andrej Karpathy's micrograd, built for clarity and systems correctness.
Most ML engineers use PyTorch or TensorFlow without understanding what happens under the hood. This project implements the core machinery from scratch:
- A
Valuenode that tracks data, gradients, and graph edges - Automatic construction of a Directed Acyclic Graph (DAG) during the forward pass
- A topological sort + chain rule backward pass for exact gradient computation
- A full MLP built on top of the engine
No external libraries. Pure C++ STL.
Value (node)
├── data → the scalar value
├── grad → accumulated gradient
├── _backward → std::function for local gradient computation
└── _prev → shared_ptr edges to parent nodes
Neuron → Layer → MLP
Memory model: Every node is heap-allocated via std::shared_ptr. The graph owns its nodes through _prev edges — no manual memory management, no leaks. std::enable_shared_from_this ensures safe self-referencing during the backward pass.
Backward pass: Topological sort via DFS, then reverse traversal applying each node's _backward lambda. Raw pointers used internally for traversal (no ownership) — shared_ptr overhead only where lifetime management is needed.
| Operation | Backward Rule |
|---|---|
a + b |
∂/∂a = 1, ∂/∂b = 1 |
a * b |
∂/∂a = b, ∂/∂b = a |
tanh(a) |
∂/∂a = 1 - tanh²(a) |
relu(a) |
∂/∂a = 1 if a > 0, else 0 |
a + 2.0 |
scalar overload |
3.0 * a |
scalar overload |
A non-linear problem that cannot be solved by a single linear layer.
epoch 0 | loss: 5.57
epoch 500 | loss: 0.046
epoch 900 | loss: 0.019
[0, 0] pred: -0.95 target: -1 ✅
[0, 1] pred: 0.93 target: 1 ✅
[1, 0] pred: 0.94 target: 1 ✅
[1, 1] pred: -0.93 target: -1 ✅
Two interleaved spirals — visualized as ASCII decision boundary in terminal.
............................................................
....................++++++++++++++++++++....................
.................+++++++++++++++++++++++....................
..............++++++++++++++++++++++++......................
.........++++++++++++++++++++++++++.........................
++++++++++++++++++++++++++++++++............................
Requirements: g++ with C++17 support
git clone https://github.com/fourlhs/micrograd-cpp
cd micrograd-cpp
g++ -std=c++17 -Wall -I include src/main.cpp -o micrograd
./microgradWith CMake:
mkdir build && cd build
cmake ..
make
./microgradmicrograd-cpp/
├── include/
│ ├── value.h ← Value node, operators, backward
│ ├── nn.h ← Neuron, Layer, MLP, zero_grad
│ └── spiral.h ← dataset generation, ASCII visualization
├── src/
│ └── main.cpp ← XOR and Spiral demos
├── CMakeLists.txt
└── README.md
Why shared_ptr for graph nodes?
A single Value can be a parent to multiple nodes (e.g. a * a). Shared ownership prevents premature deallocation while the graph is alive.
Why += for gradients?
If a node appears multiple times in the graph, its gradient contributions must be accumulated — not overwritten.
Why topological sort before backward? Gradients must flow from output to input in dependency order. Topological sort guarantees every node receives its full gradient before passing it to its children.
Why raw pointers inside backward()?
The topological sort needs to visit nodes without affecting their lifetime. Raw pointers express "I observe this, I don't own it" — no unnecessary reference count overhead.
- micrograd by Andrej Karpathy