Deep learning from scratch. No TensorFlow, no PyTorch. Just NumPy and math.
I got tired of treating neural networks like black boxes, so I implemented everything from the ground up. If you want to actually understand what's happening inside these models instead of just calling .fit(), you're in the right place.
single-layer-perceptron.py → Your first neural network
mlp.py → Going deeper with multiple layers
vanilla-rnn.py → Sequential data, take one
lstm.py → Sequential data, but it actually works
seq2seq.py → Encoder-decoder architecture
attention.py → The thing that made transformers possible
nested-attention.py → New perspective from NeurIPS 2025 paper
One layer. Forward pass, backward pass, weight updates. Solves XOR (barely). If you're starting from zero, start here.
Multiple hidden layers. Different activation functions. Proper weight initialization. Actually learns complex patterns.
What you get:
- Configurable depth (add as many layers as you want)
- ReLU, Sigmoid, Tanh activations
- He/Xavier initialization
- Works on XOR, logic gates, whatever
Your introduction to sequences. Has memory, processes one step at a time, backpropagates through time.
Problems it solves:
- Predict next value in sequence
- Remember binary patterns
- Learn character-level patterns
Warning: Gradients explode. I added clipping. You're welcome.
RNNs but they can actually remember things long-term. Four gates doing their magic.
The gates:
- Forget gate: What to throw away
- Input gate: What to remember
- Cell gate: New information
- Output gate: What to output
Why it's better:
- Remembers stuff across 15+ timesteps
- Doesn't suffer from vanishing gradients
- Can reverse sequences, add numbers from different positions
Test cases:
# Remembers first value after 14 noise steps
Input: [0.75, noise, noise, ..., noise]
Output: 0.75 # It actually remembers!
# Sequence reversal
Input: [1, 2, 3, 4]
Output: [4, 3, 2, 1] # PerfectTwo LSTMs talking to each other. Encoder reads, decoder writes.
How it works:
- Encoder LSTM processes input → creates "thought vector"
- Thought vector = compressed understanding of input
- Decoder LSTM generates output from thought vector
Uses teacher forcing during training (feeds it correct answers instead of its own predictions, makes training faster)
Examples:
- Reverse sequences:
[1,2,3] → [3,2,1] - Translate:
[1,2] → ['A','B'](numbers to letters)
The bottleneck problem: Everything must pass through one vector. That's where attention comes in.
Solves seq2seq's bottleneck. Decoder can now "look at" all encoder states, not just the last one.
The idea:
- Compute attention weights: how much to focus on each input position
- Create context vector: weighted sum of encoder states
- Decoder uses context at each step
What you see:
Attention Visualization:
1 2 3 4
4 | █ █ ██ ████ ← Focuses on position 4
3 | █ █ ██ ████ ← Also focuses on 4
2 | █ █ ██ ████
1 | █ █ ██ ████
For reversal, attention should go right-to-left. For copying, diagonal. You can actually see what the model is thinking.
Implemented: Bahdanau (additive) attention. The classic one from 2015.
This one's different. Based on a NeurIPS 2025 paper that reframes everything as nested optimization problems.
The insight: Linear attention's memory update M_t = M_{t-1} + v_t*k_t^T is actually gradient descent on an optimization problem.
Two levels:
- Level 1 (slow): Projection weights W_k, W_v, W_q → trained on dataset
- Level 2 (fast): Memory matrix M_t → updated every timestep
Continuum Memory System:
- Fast memory: updates every step
- Medium memory: updates every 2-4 steps
- Slow memory: updates every 8+ steps
Why this matters:
- Shows what's actually happening inside attention
- Suggests new architectures with more levels
- Matches how the brain works (multi-timescale processing)
numpyThat's it. Pure Python + NumPy.
python mlp.py
python lstm.py
python attention.py
# etc.Each file is standalone. No imports between files. Run whatever you want.
If you're learning:
- Start with
single-layer-perceptron.py- understand forward/backward pass - Then
mlp.py- see how depth helps - Then
vanilla-rnn.py- sequences and BPTT - Then
lstm.py- gates and long-term memory - Then
seq2seq.py- encoder-decoder - Then
attention.py- see where transformers came from - Finally
nested-attention.py- new theoretical perspective
Fundamentals:
- Forward pass (easy)
- Backward pass (the hard part)
- Gradient descent
- Why certain activations work
- Weight initialization matters
RNN stuff:
- Hidden states
- Backpropagation through time
- Why gradients explode/vanish
- How gradient clipping saves you
Advanced:
- Why LSTM gates work
- Teacher forcing
- Attention mechanisms
- Multi-timescale learning
- What "memory" actually means
Because reading papers is one thing. Implementing from scratch is another. When you have to compute every gradient by hand, you actually understand what's happening.
Also, most "from scratch" implementations cheat and use autograd. This doesn't. Every gradient is derived and implemented manually.
Gradient checking: Not included, but you should add it if you modify anything. Finite differences are your friend.
Batch size: Mostly 1 for simplicity. You can extend to mini-batches.
Optimizers: Mostly vanilla SGD. I added momentum and Adam variants in nested learning because that's what the paper does.
Performance: This is educational code. It's slow. Don't train GPT on this.
- LSTM: Hochreiter & Schmidhuber, 1997
- Seq2Seq: Sutskever et al., 2014
- Attention: Bahdanau et al., 2015
- Nested Learning: Behrouz et al., NeurIPS 2025
MLP on XOR:
Epoch 5000, Loss: 0.0002
Input: [0,0], Predicted: 0.0001, True: 0
Input: [0,1], Predicted: 0.9998, True: 1
Input: [1,0], Predicted: 0.9997, True: 1
Input: [1,1], Predicted: 0.0003, True: 0
Accuracy: 100%
LSTM remembering:
First value (to remember): 0.75
[14 timesteps of noise]
Final prediction: 0.7498
Error: 0.0002
Attention on reversal:
Input: [1, 2, 3, 4]
Expected: [4, 3, 2, 1]
Predicted: [4, 3, 2, 1] ✓
Attention shows model looking at rightmost positions first
- Attention model learns the task but doesn't always show perfect diagonal/monotonic alignment
- That's normal - the model finds the easiest solution (using hidden state + attention)
- For strict attention alignment you'd need architectural constraints
Found a bug? Better way to explain something? PR it.
Want to add GRU, Transformer, whatever? Go for it.
Apache License.
If you're using this to learn, actually run the code. Change hyperparameters. Break things. See what happens when you remove gradient clipping. Watch it diverge. That's how you learn.
Don't just read the code. Type it out yourself. Seriously.
No frameworks were harmed in the making of this repository.