Skip to content

binaryecheos/elite-ball-knowledge

Repository files navigation

elite-ball-knowledge

Deep learning from scratch. No TensorFlow, no PyTorch. Just NumPy and math.

What's this?

I got tired of treating neural networks like black boxes, so I implemented everything from the ground up. If you want to actually understand what's happening inside these models instead of just calling .fit(), you're in the right place.

What's inside

single-layer-perceptron.py  → Your first neural network
mlp.py                      → Going deeper with multiple layers
vanilla-rnn.py              → Sequential data, take one
lstm.py                     → Sequential data, but it actually works
seq2seq.py                  → Encoder-decoder architecture
attention.py                → The thing that made transformers possible
nested-attention.py         → New perspective from NeurIPS 2025 paper

single-layer-perceptron.py

One layer. Forward pass, backward pass, weight updates. Solves XOR (barely). If you're starting from zero, start here.

mlp.py

Multiple hidden layers. Different activation functions. Proper weight initialization. Actually learns complex patterns.

What you get:

  • Configurable depth (add as many layers as you want)
  • ReLU, Sigmoid, Tanh activations
  • He/Xavier initialization
  • Works on XOR, logic gates, whatever

vanilla-rnn.py

Your introduction to sequences. Has memory, processes one step at a time, backpropagates through time.

Problems it solves:

  • Predict next value in sequence
  • Remember binary patterns
  • Learn character-level patterns

Warning: Gradients explode. I added clipping. You're welcome.

lstm.py

RNNs but they can actually remember things long-term. Four gates doing their magic.

The gates:

  • Forget gate: What to throw away
  • Input gate: What to remember
  • Cell gate: New information
  • Output gate: What to output

Why it's better:

  • Remembers stuff across 15+ timesteps
  • Doesn't suffer from vanishing gradients
  • Can reverse sequences, add numbers from different positions

Test cases:

# Remembers first value after 14 noise steps
Input:  [0.75, noise, noise, ..., noise]
Output: 0.75  # It actually remembers!

# Sequence reversal
Input:  [1, 2, 3, 4]
Output: [4, 3, 2, 1]  # Perfect

seq2seq.py

Two LSTMs talking to each other. Encoder reads, decoder writes.

How it works:

  1. Encoder LSTM processes input → creates "thought vector"
  2. Thought vector = compressed understanding of input
  3. Decoder LSTM generates output from thought vector

Uses teacher forcing during training (feeds it correct answers instead of its own predictions, makes training faster)

Examples:

  • Reverse sequences: [1,2,3] → [3,2,1]
  • Translate: [1,2] → ['A','B'] (numbers to letters)

The bottleneck problem: Everything must pass through one vector. That's where attention comes in.

attention.py

Solves seq2seq's bottleneck. Decoder can now "look at" all encoder states, not just the last one.

The idea:

  • Compute attention weights: how much to focus on each input position
  • Create context vector: weighted sum of encoder states
  • Decoder uses context at each step

What you see:

Attention Visualization:
     1    2    3    4
  4 | █    █    ██   ████    ← Focuses on position 4
  3 | █    █    ██   ████    ← Also focuses on 4
  2 | █    █    ██   ████
  1 | █    █    ██   ████

For reversal, attention should go right-to-left. For copying, diagonal. You can actually see what the model is thinking.

Implemented: Bahdanau (additive) attention. The classic one from 2015.

nested-attention.py

This one's different. Based on a NeurIPS 2025 paper that reframes everything as nested optimization problems.

The insight: Linear attention's memory update M_t = M_{t-1} + v_t*k_t^T is actually gradient descent on an optimization problem.

Two levels:

  • Level 1 (slow): Projection weights W_k, W_v, W_q → trained on dataset
  • Level 2 (fast): Memory matrix M_t → updated every timestep

Continuum Memory System:

  • Fast memory: updates every step
  • Medium memory: updates every 2-4 steps
  • Slow memory: updates every 8+ steps

Why this matters:

  • Shows what's actually happening inside attention
  • Suggests new architectures with more levels
  • Matches how the brain works (multi-timescale processing)

Requirements

numpy

That's it. Pure Python + NumPy.

Running the code

python mlp.py
python lstm.py
python attention.py
# etc.

Each file is standalone. No imports between files. Run whatever you want.

Learning order

If you're learning:

  1. Start with single-layer-perceptron.py - understand forward/backward pass
  2. Then mlp.py - see how depth helps
  3. Then vanilla-rnn.py - sequences and BPTT
  4. Then lstm.py - gates and long-term memory
  5. Then seq2seq.py - encoder-decoder
  6. Then attention.py - see where transformers came from
  7. Finally nested-attention.py - new theoretical perspective

What you'll learn

Fundamentals:

  • Forward pass (easy)
  • Backward pass (the hard part)
  • Gradient descent
  • Why certain activations work
  • Weight initialization matters

RNN stuff:

  • Hidden states
  • Backpropagation through time
  • Why gradients explode/vanish
  • How gradient clipping saves you

Advanced:

  • Why LSTM gates work
  • Teacher forcing
  • Attention mechanisms
  • Multi-timescale learning
  • What "memory" actually means

Why I built this

Because reading papers is one thing. Implementing from scratch is another. When you have to compute every gradient by hand, you actually understand what's happening.

Also, most "from scratch" implementations cheat and use autograd. This doesn't. Every gradient is derived and implemented manually.

Implementation notes

Gradient checking: Not included, but you should add it if you modify anything. Finite differences are your friend.

Batch size: Mostly 1 for simplicity. You can extend to mini-batches.

Optimizers: Mostly vanilla SGD. I added momentum and Adam variants in nested learning because that's what the paper does.

Performance: This is educational code. It's slow. Don't train GPT on this.

Papers

  • LSTM: Hochreiter & Schmidhuber, 1997
  • Seq2Seq: Sutskever et al., 2014
  • Attention: Bahdanau et al., 2015
  • Nested Learning: Behrouz et al., NeurIPS 2025

Results you'll see

MLP on XOR:

Epoch 5000, Loss: 0.0002
Input: [0,0], Predicted: 0.0001, True: 0
Input: [0,1], Predicted: 0.9998, True: 1
Input: [1,0], Predicted: 0.9997, True: 1
Input: [1,1], Predicted: 0.0003, True: 0
Accuracy: 100%

LSTM remembering:

First value (to remember): 0.75
[14 timesteps of noise]
Final prediction: 0.7498
Error: 0.0002

Attention on reversal:

Input:     [1, 2, 3, 4]
Expected:  [4, 3, 2, 1]
Predicted: [4, 3, 2, 1] ✓

Attention shows model looking at rightmost positions first

Known issues

  • Attention model learns the task but doesn't always show perfect diagonal/monotonic alignment
  • That's normal - the model finds the easiest solution (using hidden state + attention)
  • For strict attention alignment you'd need architectural constraints

Contributing

Found a bug? Better way to explain something? PR it.

Want to add GRU, Transformer, whatever? Go for it.

License

Apache License.

Final note

If you're using this to learn, actually run the code. Change hyperparameters. Break things. See what happens when you remove gradient clipping. Watch it diverge. That's how you learn.

Don't just read the code. Type it out yourself. Seriously.


No frameworks were harmed in the making of this repository.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages