elite-ball-knowledge

Deep learning from scratch. No TensorFlow, no PyTorch. Just NumPy and math.

What's this?

I got tired of treating neural networks like black boxes, so I implemented everything from the ground up. If you want to actually understand what's happening inside these models instead of just calling .fit(), you're in the right place.

What's inside

single-layer-perceptron.py  → Your first neural network
mlp.py                      → Going deeper with multiple layers
vanilla-rnn.py              → Sequential data, take one
lstm.py                     → Sequential data, but it actually works
seq2seq.py                  → Encoder-decoder architecture
attention.py                → The thing that made transformers possible
nested-attention.py         → New perspective from NeurIPS 2025 paper

single-layer-perceptron.py

One layer. Forward pass, backward pass, weight updates. Solves XOR (barely). If you're starting from zero, start here.

mlp.py

Multiple hidden layers. Different activation functions. Proper weight initialization. Actually learns complex patterns.

What you get:

Configurable depth (add as many layers as you want)
ReLU, Sigmoid, Tanh activations
He/Xavier initialization
Works on XOR, logic gates, whatever

vanilla-rnn.py

Your introduction to sequences. Has memory, processes one step at a time, backpropagates through time.

Problems it solves:

Predict next value in sequence
Remember binary patterns
Learn character-level patterns

Warning: Gradients explode. I added clipping. You're welcome.

lstm.py

RNNs but they can actually remember things long-term. Four gates doing their magic.

The gates:

Forget gate: What to throw away
Input gate: What to remember
Cell gate: New information
Output gate: What to output

Why it's better:

Remembers stuff across 15+ timesteps
Doesn't suffer from vanishing gradients
Can reverse sequences, add numbers from different positions

Test cases:

# Remembers first value after 14 noise steps
Input:  [0.75, noise, noise, ..., noise]
Output: 0.75  # It actually remembers!

# Sequence reversal
Input:  [1, 2, 3, 4]
Output: [4, 3, 2, 1]  # Perfect

seq2seq.py

Two LSTMs talking to each other. Encoder reads, decoder writes.

How it works:

Encoder LSTM processes input → creates "thought vector"
Thought vector = compressed understanding of input
Decoder LSTM generates output from thought vector

Uses teacher forcing during training (feeds it correct answers instead of its own predictions, makes training faster)

Examples:

Reverse sequences: [1,2,3] → [3,2,1]
Translate: [1,2] → ['A','B'] (numbers to letters)

The bottleneck problem: Everything must pass through one vector. That's where attention comes in.

attention.py

Solves seq2seq's bottleneck. Decoder can now "look at" all encoder states, not just the last one.

The idea:

Compute attention weights: how much to focus on each input position
Create context vector: weighted sum of encoder states
Decoder uses context at each step

What you see:

Attention Visualization:
     1    2    3    4
  4 | █    █    ██   ████    ← Focuses on position 4
  3 | █    █    ██   ████    ← Also focuses on 4
  2 | █    █    ██   ████
  1 | █    █    ██   ████

For reversal, attention should go right-to-left. For copying, diagonal. You can actually see what the model is thinking.

Implemented: Bahdanau (additive) attention. The classic one from 2015.

nested-attention.py

This one's different. Based on a NeurIPS 2025 paper that reframes everything as nested optimization problems.

The insight: Linear attention's memory update M_t = M_{t-1} + v_t*k_t^T is actually gradient descent on an optimization problem.

Two levels:

Level 1 (slow): Projection weights W_k, W_v, W_q → trained on dataset
Level 2 (fast): Memory matrix M_t → updated every timestep

Continuum Memory System:

Fast memory: updates every step
Medium memory: updates every 2-4 steps
Slow memory: updates every 8+ steps

Why this matters:

Shows what's actually happening inside attention
Suggests new architectures with more levels
Matches how the brain works (multi-timescale processing)

Requirements

numpy

That's it. Pure Python + NumPy.

Running the code

python mlp.py
python lstm.py
python attention.py
# etc.

Each file is standalone. No imports between files. Run whatever you want.

Learning order

If you're learning:

Start with single-layer-perceptron.py - understand forward/backward pass
Then mlp.py - see how depth helps
Then vanilla-rnn.py - sequences and BPTT
Then lstm.py - gates and long-term memory
Then seq2seq.py - encoder-decoder
Then attention.py - see where transformers came from
Finally nested-attention.py - new theoretical perspective

What you'll learn

Fundamentals:

Forward pass (easy)
Backward pass (the hard part)
Gradient descent
Why certain activations work
Weight initialization matters

RNN stuff:

Hidden states
Backpropagation through time
Why gradients explode/vanish
How gradient clipping saves you

Advanced:

Why LSTM gates work
Teacher forcing
Attention mechanisms
Multi-timescale learning
What "memory" actually means

Why I built this

Because reading papers is one thing. Implementing from scratch is another. When you have to compute every gradient by hand, you actually understand what's happening.

Also, most "from scratch" implementations cheat and use autograd. This doesn't. Every gradient is derived and implemented manually.

Implementation notes

Gradient checking: Not included, but you should add it if you modify anything. Finite differences are your friend.

Batch size: Mostly 1 for simplicity. You can extend to mini-batches.

Optimizers: Mostly vanilla SGD. I added momentum and Adam variants in nested learning because that's what the paper does.

Performance: This is educational code. It's slow. Don't train GPT on this.

Papers

LSTM: Hochreiter & Schmidhuber, 1997
Seq2Seq: Sutskever et al., 2014
Attention: Bahdanau et al., 2015
Nested Learning: Behrouz et al., NeurIPS 2025

Results you'll see

MLP on XOR:

Epoch 5000, Loss: 0.0002
Input: [0,0], Predicted: 0.0001, True: 0
Input: [0,1], Predicted: 0.9998, True: 1
Input: [1,0], Predicted: 0.9997, True: 1
Input: [1,1], Predicted: 0.0003, True: 0
Accuracy: 100%

LSTM remembering:

First value (to remember): 0.75
[14 timesteps of noise]
Final prediction: 0.7498
Error: 0.0002

Attention on reversal:

Input:     [1, 2, 3, 4]
Expected:  [4, 3, 2, 1]
Predicted: [4, 3, 2, 1] ✓

Attention shows model looking at rightmost positions first

Known issues

Attention model learns the task but doesn't always show perfect diagonal/monotonic alignment
That's normal - the model finds the easiest solution (using hidden state + attention)
For strict attention alignment you'd need architectural constraints

Contributing

Found a bug? Better way to explain something? PR it.

Want to add GRU, Transformer, whatever? Go for it.

License

Apache License.

Final note

If you're using this to learn, actually run the code. Change hyperparameters. Break things. See what happens when you remove gradient clipping. Watch it diverge. That's how you learn.

Don't just read the code. Type it out yourself. Seriously.

No frameworks were harmed in the making of this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

elite-ball-knowledge

What's this?

What's inside

single-layer-perceptron.py

mlp.py

vanilla-rnn.py

lstm.py

seq2seq.py

attention.py

nested-attention.py

Requirements

Running the code

Learning order

What you'll learn

Why I built this

Implementation notes

Papers

Results you'll see

Known issues

Contributing

License

Final note

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
attention.py		attention.py
lstm.py		lstm.py
mlp.py		mlp.py
nested-attention.py		nested-attention.py
seq2seq.py		seq2seq.py
single-layer-perceptron.py		single-layer-perceptron.py
vanilla-rnn.py		vanilla-rnn.py

License

binaryecheos/elite-ball-knowledge

Folders and files

Latest commit

History

Repository files navigation

elite-ball-knowledge

What's this?

What's inside

single-layer-perceptron.py

mlp.py

vanilla-rnn.py

lstm.py

seq2seq.py

attention.py

nested-attention.py

Requirements

Running the code

Learning order

What you'll learn

Why I built this

Implementation notes

Papers

Results you'll see

Known issues

Contributing

License

Final note

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages