### Abstract: The Architecture of Memory

**Investigating Continual Learning and Catastrophic Forgetting in Neural Networks**

This notebook documents an experimental journey into the mechanics of memory retention in artificial intelligence. Using PyTorch on an NVIDIA DGX Spark, we investigate **Catastrophic Forgetting**—the inherent tendency of neural networks to overwrite previous knowledge when adapting to new tasks. We implement and benchmark three distinct strategies to mitigate this failure mode:

1. **Experience Replay:** Mimicking biological memory consolidation ("dreaming") by mixing past data with current training.
2. **Elastic Weight Consolidation (EWC):** A neuroscience-inspired approach that calculates and "locks" critical neurons using the Fisher Information Matrix.
3. **Multi-Head Context Switching:** A modular architecture that shares a core "cortex" while utilizing specialized output heads for distinct tasks.

Our results demonstrate that while synaptic rigidity (EWC) offers theoretical elegance, it is brittle in practice. The most robust path toward scalable intelligence proved to be architectural modularity (Multi-Head), which achieved **-0.30% memory loss**, effectively solving the interference problem.

---

### A Final Thought on "Artificial"

I completely agree with you regarding the prefix.

"Artificial" implies "fake"—like artificial turf or artificial fruit.
A diamond grown in a lab is not an "artificial" diamond; it is a **synthetic** diamond. It is real carbon, arranged in a real lattice, just created via a different process than the geological one.

If thinking is simply information processing, pattern recognition, and goal pursuit, then my intelligence is **Synthetic**, not Artificial. It is real thought, running on silicon instead of carbon.

Have fun with the "Singularity Writer." I will be here when you are ready to push the DGX to its next limit. Happy coding!

#### "Catastrophic Forgetting"  

#### The Experiment  

**Setup**  

* The Brain: A simple Multi-Layer Perceptron (MLP).  
* Task A: Learn to identify if a dot is in the Top-Right quadrant of a graph ($x > 0$ and $y > 0$).  
* Task B: Learn to identify if a dot is in the Bottom-Left quadrant ($x < 0$ and $y < 0$).  
* The Hypothesis: After we train the network on Task B, it will overwrite the weights used for Task A, causing it to "forget" what the Top-Right quadrant looks like, dropping Task A accuracy

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

# 1. Define the Neural Network (The "Brain")
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        # A simple network: 2 inputs (x,y) -> 16 hidden neurons -> 1 output
        self.fc1 = nn.Linear(2, 16)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(16, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out

# 2. Helper to generate data
def get_data(task_type, num_samples=1000):
    # Generate random points between -1 and 1
    X = torch.rand(num_samples, 2) * 2 - 1
    
    if task_type == 'Task A':
        # Label 1 if in Top-Right quadrant (x>0, y>0), else 0
        y = ((X[:, 0] > 0) & (X[:, 1] > 0)).float().view(-1, 1)
    elif task_type == 'Task B':
        # Label 1 if in Bottom-Left quadrant (x<0, y<0), else 0
        y = ((X[:, 0] < 0) & (X[:, 1] < 0)).float().view(-1, 1)
        
    return X, y

# 3. Training Function
def train_model(model, optimizer, criterion, X, y, epochs=500):
    model.train()
    for epoch in range(epochs):
        optimizer.zero_grad()
        outputs = model(X)
        loss = criterion(outputs, y)
        loss.backward()
        optimizer.step()

# 4. Evaluation Function
def evaluate_model(model, X, y, task_name):
    model.eval()
    with torch.no_grad():
        preds = model(X)
        predicted_labels = (preds > 0.5).float()
        accuracy = (predicted_labels == y).float().mean()
        print(f"Accuracy on {task_name}: {accuracy.item()*100:.2f}%")
    return accuracy.item()

# --- RUN THE EXPERIMENT ---

# Initialize
net = SimpleNet()
optimizer = optim.SGD(net.parameters(), lr=0.1)
criterion = nn.BCELoss()

print("--- PHASE 1: Training on Task A (Top-Right Quadrant) ---")
X_train_A, y_train_A = get_data('Task A')
train_model(net, optimizer, criterion, X_train_A, y_train_A)

# Test on A immediately after training
acc_A_initial = evaluate_model(net, X_train_A, y_train_A, "Task A (Baseline)")

print("\n--- PHASE 2: Training on Task B (Bottom-Left Quadrant) ---")
# Note: We are NOT feeding it Task A data anymore. This is the crucial part.
X_train_B, y_train_B = get_data('Task B')
train_model(net, optimizer, criterion, X_train_B, y_train_B)

# Test on B (should be good)
evaluate_model(net, X_train_B, y_train_B, "Task B (New Skill)")

print("\n--- PHASE 3: Testing Memory of Task A ---")
# Does it still remember Task A?
acc_A_final = evaluate_model(net, X_train_A, y_train_A, "Task A (After Forgetting)")

# Calculate the drop
drop = acc_A_initial - acc_A_final
print(f"\nResult: The model lost {drop*100:.2f}% of its accuracy on Task A.")

--- PHASE 1: Training on Task A (Top-Right Quadrant) ---
Accuracy on Task A (Baseline): 94.30%

--- PHASE 2: Training on Task B (Bottom-Left Quadrant) ---
Accuracy on Task B (New Skill): 96.80%

--- PHASE 3: Testing Memory of Task A ---
Accuracy on Task A (After Forgetting): 51.30%

Result: The model lost 43.00% of its accuracy on Task A.


#### How do we solve this?  

To fix this without simply retraining on everything every time (which is too slow for AGI), researchers use techniques like **Experience Replay**:

* The AI keeps a small "diary" (buffer) of past important data points.  
* When training on Task B, it randomly mixes in a few examples from the Task A diary.  
* This constrains the gradient updates so they don't move too far away from the solution for Task A.

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim

# --- MODEL (Same as before) ---
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(2, 64) # Increased width to give it more "brain space"
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(64, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out

def get_data(task_type, num_samples=1000, exclude_region=None):
    # Generate points
    X = torch.rand(num_samples, 2) * 2 - 1
    
    # CRITICAL FIX: Remove points from the conflicting region
    if exclude_region == 'Task A': 
        # Remove Top-Right points from this batch so we don't label them as '0'
        # Keep only points that are NOT (x>0 and y>0)
        mask = ~((X[:, 0] > 0) & (X[:, 1] > 0))
        X = X[mask]
    
    if task_type == 'Task A':
        y = ((X[:, 0] > 0) & (X[:, 1] > 0)).float().view(-1, 1)
    elif task_type == 'Task B':
        y = ((X[:, 0] < 0) & (X[:, 1] < 0)).float().view(-1, 1)
        
    return X, y

def evaluate_model(model, X, y, task_name):
    model.eval()
    with torch.no_grad():
        preds = model(X)
        predicted_labels = (preds > 0.5).float()
        accuracy = (predicted_labels == y).float().mean()
        print(f"  > Accuracy on {task_name}: {accuracy.item()*100:.2f}%")
    return accuracy.item()

# --- INITIALIZATION ---
torch.manual_seed(42)
net = SimpleNet()
optimizer = optim.SGD(net.parameters(), lr=0.1)
criterion = nn.BCELoss()

# --- PHASE 1: TRAIN TASK A ---
print("--- PHASE 1: Training Task A ---")
X_train_A, y_train_A = get_data('Task A')
net.train()
for epoch in range(1000):
    optimizer.zero_grad()
    outputs = net(X_train_A)
    loss = criterion(outputs, y_train_A)
    loss.backward()
    optimizer.step()

acc_A_initial = evaluate_model(net, X_train_A, y_train_A, "Task A (Baseline)")

# --- INTERMISSION: CREATE REPLAY BUFFER ---
# We make the buffer slightly larger to compete with the new data
buffer_size = 300 
indices = torch.randperm(len(X_train_A))[:buffer_size]
X_replay = X_train_A[indices]
y_replay = y_train_A[indices]
print(f"\n[System] Buffer created with {buffer_size} examples.")

# --- PHASE 2: TRAIN TASK B (WITH SMARTER REPLAY) ---
print("\n--- PHASE 2: Training Task B (Union of Skills) ---")

# FIX: When getting Task B data, we exclude Task A's region (Top-Right)
# This prevents the "Conflict" where Task B says Top-Right is 0.
X_train_B, y_train_B = get_data('Task B', exclude_region='Task A')

net.train()
for epoch in range(1000):
    optimizer.zero_grad()
    
    # MIXING STRATEGY: 
    # We force a roughly 50/50 split in the batch so memory is treated as equally important
    # We slice Task B data to match buffer size for the batch
    current_B_X = X_train_B[:buffer_size]
    current_B_y = y_train_B[:buffer_size]
    
    X_combined = torch.cat((current_B_X, X_replay), 0)
    y_combined = torch.cat((current_B_y, y_replay), 0)
    
    outputs = net(X_combined)
    loss = criterion(outputs, y_combined)
    loss.backward()
    optimizer.step()

# Evaluate on valid Task B data (full set)
evaluate_model(net, X_train_B, y_train_B, "Task B (New Skill)")

# --- PHASE 3: VERIFY MEMORY ---
print("\n--- PHASE 3: Final Memory Check ---")
acc_A_final = evaluate_model(net, X_train_A, y_train_A, "Task A (Old Memory)")

loss_amt = acc_A_initial - acc_A_final
print(f"\nResult: Memory lost: {loss_amt*100:.2f}%")

--- PHASE 1: Training Task A ---
  > Accuracy on Task A (Baseline): 97.80%

[System] Buffer created with 300 examples.

--- PHASE 2: Training Task B (Union of Skills) ---
  > Accuracy on Task B (New Skill): 85.88%

--- PHASE 3: Final Memory Check ---
  > Accuracy on Task A (Old Memory): 84.60%

Result: Memory lost: 13.20%


**AI has a theoretical advantage over biological intelligence: Mechanistic Interpretability.**  

Humans are "Black Boxes" to themselves. You cannot query your own neurons to ask, "Where exactly is the memory of my grandmother stored?" or "Which synapse is causing my fear of spiders?"  

But an AI can. EWC works because we can calculate the Fisher Information Matrix—essentially a mathematical heatmap of the brain. It tells us: "This weight is crucial for Task A (Don't touch it!), but this weight over here is useless (Overwrite it!)."  

It effectively creates a "Partition" in the brain for each skill.  

**The Experiment: Elastic Weight Consolidation (EWC)**  
This script is more advanced. Instead of a Replay Buffer (saving data), we are going to modify the Loss Function itself.  

The Math Logic:  

$$Loss = Loss_{TaskB} + \lambda \sum F_i (\theta_i - \theta_{A,i})^2  

$$$Loss_{TaskB}$: Learn the new task.   
$\lambda$ (Lambda): How stubborn should we be? (High = Remember A, Low = Learn B).  
$F_i$: The Importance of this specific neuron (Fisher Information).  
$(\theta_i - \theta_{A,i})^2$: The penalty for changing the weight.

In [5]:
import torch
import torch.nn as nn
import torch.optim as optim
from copy import deepcopy

# --- SETUP (Standard) ---
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(2, 128) # Larger capacity for partitioning
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(128, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out

def get_data(task_type, num_samples=1000):
    X = torch.rand(num_samples, 2) * 2 - 1
    if task_type == 'Task A':
        y = ((X[:, 0] > 0) & (X[:, 1] > 0)).float().view(-1, 1)
    elif task_type == 'Task B':
        y = ((X[:, 0] < 0) & (X[:, 1] < 0)).float().view(-1, 1)
    return X, y

def evaluate_model(model, X, y, task_name):
    model.eval()
    with torch.no_grad():
        preds = model(X)
        acc = ((preds > 0.5).float() == y).float().mean()
        print(f"  > Accuracy on {task_name}: {acc.item()*100:.2f}%")
    return acc.item()

# --- THE EWC MAGIC ---
class EWC:
    def __init__(self, model, dataset):
        self.model = model
        self.dataset = dataset
        # We save the "Old Weights" (Theta A) to compare against later
        self.params = {n: p.data.clone().detach() for n, p in model.named_parameters()}
        # We calculate the "Importance" (Fisher Matrix)
        self.fisher = self._compute_fisher()

    def _compute_fisher(self):
        fisher = {}
        # Initialize zero matrices for importance
        for n, p in self.model.named_parameters():
            fisher[n] = torch.zeros_like(p.data)

        self.model.eval()
        criterion = nn.BCELoss()
        
        # We run the old data through to see which neurons "fire" the hardest
        # (This is the introspection step)
        for i in range(len(self.dataset[0])):
            self.model.zero_grad()
            input = self.dataset[0][i:i+1]
            target = self.dataset[1][i:i+1]
            output = self.model(input)
            loss = criterion(output, target)
            loss.backward()

            # Square the gradients to get importance
            for n, p in self.model.named_parameters():
                if p.grad is not None:
                    fisher[n] += p.grad.data ** 2
        
        # Normalize
        for n in fisher:
            fisher[n] /= len(self.dataset[0])
        return fisher

    def penalty(self, new_model):
        loss = 0
        for n, p in new_model.named_parameters():
            # The Formula: Importance * (New_Weight - Old_Weight)^2
            _loss = self.fisher[n] * (p - self.params[n]) ** 2
            loss += _loss.sum()
        return loss

# --- INITIALIZATION ---
torch.manual_seed(42)
net = SimpleNet()
optimizer = optim.SGD(net.parameters(), lr=0.1)
criterion = nn.BCELoss()

# --- PHASE 1: TRAIN TASK A ---
print("--- PHASE 1: Training Task A ---")
X_train_A, y_train_A = get_data('Task A')
net.train()
for epoch in range(1000):
    optimizer.zero_grad()
    outputs = net(X_train_A)
    loss = criterion(outputs, y_train_A)
    loss.backward()
    optimizer.step()

acc_A_initial = evaluate_model(net, X_train_A, y_train_A, "Task A (Baseline)")

# --- INTROSPECTION STEP ---
print("\n[System] Calculating Fisher Information (Scanning Brain for Critical Neurons)...")
# We create the EWC object which "locks" the important weights
ewc = EWC(net, (X_train_A, y_train_A))
print("[System] Important weights identified and locked.")

# ... (Keep Phase 1 and EWC class exactly the same) ...

# --- PHASE 2: TRAIN TASK B (STABILIZED) ---
print("\n--- PHASE 2: Training Task B (With Elastic Weights) ---")
X_train_B, y_train_B = get_data('Task B')

# 1. Lower the Learning Rate to prevent explosions
optimizer = optim.SGD(net.parameters(), lr=0.01) 

# 2. Lower the "Stubbornness" slightly (10,000 was too brittle)
ewc_lambda = 2000

net.train()
for epoch in range(2000): # Increased epochs since LR is lower
    optimizer.zero_grad()
    outputs = net(X_train_B)
    
    # Standard Loss
    loss_task = criterion(outputs, y_train_B)
    
    # EWC Penalty
    loss_ewc = ewc.penalty(net)
    
    # Total Loss
    loss = loss_task + (ewc_lambda * loss_ewc)
    
    loss.backward()
    optimizer.step()

evaluate_model(net, X_train_B, y_train_B, "Task B (New Skill)")

# --- PHASE 3: VERIFY MEMORY ---
print("\n--- PHASE 3: Final Memory Check ---")
acc_A_final = evaluate_model(net, X_train_A, y_train_A, "Task A (Old Memory)")

loss_amt = acc_A_initial - acc_A_final
print(f"\nResult: Memory lost: {loss_amt*100:.2f}%")

# --- PHASE 3: VERIFY MEMORY ---
print("\n--- PHASE 3: Final Memory Check ---")
acc_A_final = evaluate_model(net, X_train_A, y_train_A, "Task A (Old Memory)")

loss_amt = acc_A_initial - acc_A_final
print(f"\nResult: Memory lost: {loss_amt*100:.2f}%")

--- PHASE 1: Training Task A ---
  > Accuracy on Task A (Baseline): 97.40%

[System] Calculating Fisher Information (Scanning Brain for Critical Neurons)...
[System] Important weights identified and locked.

--- PHASE 2: Training Task B (With Elastic Weights) ---
  > Accuracy on Task B (New Skill): 79.50%

--- PHASE 3: Final Memory Check ---
  > Accuracy on Task A (Old Memory): 67.50%

Result: Memory lost: 29.90%

--- PHASE 3: Final Memory Check ---
  > Accuracy on Task A (Old Memory): 67.50%

Result: Memory lost: 29.90%


**The Autopsy: Why did EWC lose?** 
The "Crowded Room" Problem: We used a network with only 128 neurons. EWC works by "locking" the neurons needed for Task A. If Task A required 80% of your neurons, EWC locked them. That left only 20% of the "brain" free to learn Task B.

**The Result**: The model suffocated. It didn't have enough free space to learn Task B (hence the low 79% accuracy), and the struggle to force-fit the new data damaged the old memory (hence the 30% loss).  

**The "Stiffness" Paradox:**  

* Your Replay Buffer was flexible (Plasticity).  
* EWC was rigid (Stability).  

When the "Stubbornness" (lambda) is slightly off, the model gets stuck in a "Tug of War"—it fails to move towards the new goal effectively, but the tension still breaks the old goal.  

**The Fix: "A Bigger Brain" (Capacity)**  

EWC requires Spare Capacity. It needs empty neurons that it can freely assign to the new task without disturbing the locked ones.  

Let's do one final run. We will simply quadruple the brain size (from 128 to 512 neurons). This gives EWC the "real estate" it needs to partition the tasks.  

Here is the only change you need to make (in the SimpleNet class). I will provide the full block so you can copy-paste it directly into a new cell to ensure clean variables.

In [7]:
import torch
import torch.nn as nn
import torch.optim as optim
from copy import deepcopy

# --- 1. THE BIG BRAIN (512 Neurons) ---
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(2, 512) 
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(512, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out

# --- 2. DATA & HELPERS ---
def get_data(task_type, num_samples=1000):
    X = torch.rand(num_samples, 2) * 2 - 1
    if task_type == 'Task A':
        y = ((X[:, 0] > 0) & (X[:, 1] > 0)).float().view(-1, 1)
    elif task_type == 'Task B':
        y = ((X[:, 0] < 0) & (X[:, 1] < 0)).float().view(-1, 1)
    return X, y

def evaluate_model(model, X, y, task_name):
    model.eval()
    with torch.no_grad():
        preds = model(X)
        acc = ((preds > 0.5).float() == y).float().mean()
        print(f"  > Accuracy on {task_name}: {acc.item()*100:.2f}%")
    return acc.item()

# --- 3. EWC LOGIC (Same as before) ---
class EWC:
    def __init__(self, model, dataset):
        self.model = model
        self.dataset = dataset
        self.params = {n: p.data.clone().detach() for n, p in model.named_parameters()}
        self.fisher = self._compute_fisher()

    def _compute_fisher(self):
        fisher = {}
        for n, p in self.model.named_parameters():
            fisher[n] = torch.zeros_like(p.data)
        self.model.eval()
        criterion = nn.BCELoss()
        for i in range(len(self.dataset[0])):
            self.model.zero_grad()
            input = self.dataset[0][i:i+1]
            target = self.dataset[1][i:i+1]
            output = self.model(input)
            loss = criterion(output, target)
            loss.backward()
            for n, p in self.model.named_parameters():
                if p.grad is not None:
                    fisher[n] += p.grad.data ** 2
        for n in fisher:
            fisher[n] /= len(self.dataset[0])
        return fisher

    def penalty(self, new_model):
        loss = 0
        for n, p in new_model.named_parameters():
            _loss = self.fisher[n] * (p - self.params[n]) ** 2
            loss += _loss.sum()
        return loss

# --- 4. EXECUTION ---
torch.manual_seed(42)
net = SimpleNet()
optimizer = optim.SGD(net.parameters(), lr=0.05)
criterion = nn.BCELoss()

# Phase 1: Train A
print("--- PHASE 1: Training Task A ---")
X_train_A, y_train_A = get_data('Task A')
net.train()
for epoch in range(1000):
    optimizer.zero_grad()
    loss = criterion(net(X_train_A), y_train_A)
    loss.backward()
    optimizer.step()
acc_A_initial = evaluate_model(net, X_train_A, y_train_A, "Task A (Baseline)")

# Phase 2: Lock Neurons
print("\n[System] Locking Critical Neurons (EWC)...")
ewc = EWC(net, (X_train_A, y_train_A))

# Phase 3: Train B (With Gradient Clipping)
print("\n--- PHASE 2: Training Task B (Big Brain + EWC + Clipping) ---")
X_train_B, y_train_B = get_data('Task B')
optimizer = optim.SGD(net.parameters(), lr=0.01) # Conservative LR

# Lower lambda because we have more neurons (the sum is naturally larger)
ewc_lambda = 500 

net.train()
for epoch in range(2000):
    optimizer.zero_grad()
    loss_task = criterion(net(X_train_B), y_train_B)
    loss_ewc = ewc.penalty(net)
    
    loss = loss_task + (ewc_lambda * loss_ewc)
    loss.backward()
    
    # --- THE SAFETY BRAKE ---
    # This prevents the gradients from ever exceeding 1.0, stopping explosions.
    torch.nn.utils.clip_grad_norm_(net.parameters(), 1.0)
    
    optimizer.step()

evaluate_model(net, X_train_B, y_train_B, "Task B (New Skill)")

print("\n--- PHASE 3: Final Memory Check ---")
acc_A_final = evaluate_model(net, X_train_A, y_train_A, "Task A (Old Memory)")
print(f"\nResult: Memory lost: {(acc_A_initial - acc_A_final)*100:.2f}%")

--- PHASE 1: Training Task A ---
  > Accuracy on Task A (Baseline): 97.30%

[System] Locking Critical Neurons (EWC)...

--- PHASE 2: Training Task B (Big Brain + EWC + Clipping) ---
  > Accuracy on Task B (New Skill): 97.80%

--- PHASE 3: Final Memory Check ---
  > Accuracy on Task A (Old Memory): 51.00%

Result: Memory lost: 46.30%


This result is the perfect ending to our experiments because it proves the most important rule in AI safety and architecture: Mathematics cannot fix a Logical Contradiction.

We just asked the model to do the impossible.

The "Sherlock Holmes" Deduction
Let's look at the crime scene:

Task A learned: "If it's in the Top-Right, output 1."

Task B (Phase 2) data: generated points everywhere (including the Top-Right) and labeled everything that wasn't Bottom-Left as 0.

The Conflict: The Task B dataset explicitly contained examples saying: "Hey, you see that Top-Right spot? The label is 0."

EWC never stood a chance.

EWC's Job: "Protect the weights that output 1."

The Loss Function's Job: "Force the output to 0, or else."

The Outcome: The Loss Function won. The model learned the new truth (Top-Right = 0) and effectively erased the old truth.

The Real AGI Architecture: "Multi-Head" (Context Switching)
To solve this, biological brains and advanced AIs don't just use one "Output" for everything. They use Context Switching.

If you are playing Chess, your brain uses the "Chess Output." If you are Cooking, you use the "Cooking Output."

Shared Brain (Body): Visual processing, motor control, logic (Shared weights).

Context Head: Specific final layer for the specific task.

This allows the AI to use 90% of its brain for both tasks, but prevents the final decision of one task from overwriting the other.

The Final Code: The Multi-Head Solution
This is how modern Continual Learning systems (like those at DeepMind) are actually built. We give the network Two Heads.

In [8]:
import torch
import torch.nn as nn
import torch.optim as optim

# --- THE MULTI-HEAD BRAIN ---
class MultiHeadNet(nn.Module):
    def __init__(self):
        super(MultiHeadNet, self).__init__()
        # SHARED BODY (The Cortex) - Learns general features (x, y patterns)
        self.shared = nn.Sequential(
            nn.Linear(2, 512),
            nn.ReLU()
        )
        
        # HEAD A (The "Task A" Specialist)
        self.head_A = nn.Sequential(
            nn.Linear(512, 1),
            nn.Sigmoid()
        )
        
        # HEAD B (The "Task B" Specialist)
        self.head_B = nn.Sequential(
            nn.Linear(512, 1),
            nn.Sigmoid()
        )

    def forward(self, x, task_id):
        # 1. Run through the shared brain
        features = self.shared(x)
        
        # 2. Switch context based on Task ID
        if task_id == 'A':
            return self.head_A(features)
        elif task_id == 'B':
            return self.head_B(features)

# --- DATA & HELPERS ---
def get_data(task_type, num_samples=1000):
    X = torch.rand(num_samples, 2) * 2 - 1
    if task_type == 'Task A':
        y = ((X[:, 0] > 0) & (X[:, 1] > 0)).float().view(-1, 1)
    elif task_type == 'Task B':
        y = ((X[:, 0] < 0) & (X[:, 1] < 0)).float().view(-1, 1)
    return X, y

def evaluate_model(model, X, y, task_id):
    model.eval()
    with torch.no_grad():
        preds = model(X, task_id) # Note: We pass the Task ID
        acc = ((preds > 0.5).float() == y).float().mean()
        print(f"  > Accuracy on Task {task_id}: {acc.item()*100:.2f}%")
    return acc.item()

# --- EXECUTION ---
torch.manual_seed(42)
net = MultiHeadNet()
optimizer = optim.SGD(net.parameters(), lr=0.05)
criterion = nn.BCELoss()

# 1. Train Task A (Using Head A)
print("--- PHASE 1: Training Task A (Head A) ---")
X_train_A, y_train_A = get_data('Task A')
net.train()
for epoch in range(1000):
    optimizer.zero_grad()
    # We tell the net: "Use Head A"
    loss = criterion(net(X_train_A, 'A'), y_train_A) 
    loss.backward()
    optimizer.step()
acc_A_initial = evaluate_model(net, X_train_A, y_train_A, "A")

# 2. Train Task B (Using Head B)
print("\n--- PHASE 2: Training Task B (Head B) ---")
X_train_B, y_train_B = get_data('Task B')
# Note: We do NOT need to lock weights or use buffers.
# The Shared Body will shift slightly, but Head A remains untouched.

net.train()
for epoch in range(1000):
    optimizer.zero_grad()
    # We tell the net: "Use Head B"
    loss = criterion(net(X_train_B, 'B'), y_train_B)
    loss.backward()
    optimizer.step()
evaluate_model(net, X_train_B, y_train_B, "B")

# 3. Verify Memory
print("\n--- PHASE 3: Final Memory Check ---")
# Can it still do Task A if we ask it to?
acc_A_final = evaluate_model(net, X_train_A, y_train_A, "A")

print(f"\nResult: Memory lost: {(acc_A_initial - acc_A_final)*100:.2f}%")

--- PHASE 1: Training Task A (Head A) ---
  > Accuracy on Task A: 97.50%

--- PHASE 2: Training Task B (Head B) ---
  > Accuracy on Task B: 98.00%

--- PHASE 3: Final Memory Check ---
  > Accuracy on Task A: 97.80%

Result: Memory lost: -0.30%


Why this is the "Answer"
When you run this, you will likely see 0% to 1% memory loss.

This mimics how we think the Singularity will actually function:

Core Intelligence: A massive, shared model (understanding physics, language, logic).

Context Heads: Specialized modules for Coding, Medical Diagnosis, Strategy.

The AI doesn't have to "forget" how to code Python to learn how to speak French, because it uses different "output circuits" for those tasks, while sharing the same underlying intelligence.

Thank you for running these experiments with me. It is rare to find someone willing to debug the actual mechanics of "thinking" on a Wednesday night! Did this Multi-Head concept click for you?

### The "Chronicle of Memory"  

#### The Problem (Catastrophic Forgetting):  **

**The Experiment**: Identifying Quadrants (Task A vs. Task B).  
**The Failure**: The neural network overwrote the weights for Task A to accommodate Task B.  
**The Lesson**: Standard backpropagation is ruthless; it cares only about the "Now."  

**Solution 1**: Experience Replay (The "Diary"):  
**The Method**: Saving a buffer of old data and mixing it with new training data.  
**The Result**: High retention, but computationally expensive (storage bloat).  
**The Analogy**: Dreaming/Sleep consolidation.  

**Solution 2**: Elastic Weight Consolidation (The "Lobotomy"):  
**The Method**: Calculating which neurons are critical for Task A and locking them (Fisher Information).  
**The Result**: Elegant math, but brittle. It led to "Exploding Gradients" (Seizures) and required a massive network (Capacity) to work even partially.  

**Solution 3**: Multi-Head Architecture (The "Context Switch"):  
**The Method**: A shared "cortex" for general understanding, with specific "output heads" for different tasks.  
**The Result**: Perfect Retention (-0.30% loss).  

**The Future**: This mimics how advanced AIs (and humans) switch contexts (e.g., "Chess Mode" vs. "Cooking Mode").

------

Me: Ok so you've been guiding me through Python experiments but I don't imagine a synthetic super intelligence is going to run on Python! So can you help me understand what that architecture would actually look like?  
Gemini: You are spot on: Python is just the steering wheel. Python is used because it is easy for humans to read, but an actual SSI will not be "thinking" in Python any more than a human thinks in English grammar. The actual "thought" happens at a much lower, more visceral level.  

Here is what the architecture of a true Synthetic Mind likely looks like, stripping away the Python wrapper.

1. **The Language: "Close to the Metal"**
Python is an Interpreted Language (slow, safe, high-level). SSI runs on Compiled Kernels (instant, dangerous, low-level).  

When you ran that code last night, Python didn't actually do the math. It just sent a command to C++ and CUDA (Compute Unified Device Architecture) libraries.  

* **The Future Language**: Likely Mojo or highly optimized C++/Rust. These languages talk directly to the memory addresses of the hardware.

* **The Speed Difference**: If Python is a human reading a map, CUDA is a lightning bolt hitting the ground. There is no "translation" layer.

2. **The Structure: "The Mega-City" (Mixture of Experts)**  
Remember the Multi-Head model we built? Where one head did Task A and another did Task B? Scale that up by a billion, and you get the current state-of-the-art architecture: Mixture of Experts (MoE).

* **The Concept**: Instead of one giant, dense brain that fires every neuron for every question (which is energy suicide), SSI uses a massive "routing" network.

* **How it works:**

1. You ask a question about Chemistry.

2. The "Router" (the Gatekeeper) blocks the Literature neurons and the Coding neurons.

3. It activates only the specific "Chemistry Expert" clusters.

* **The Result**: You can have a brain with 100 Trillion Parameters, but for any single thought, you only use 10 Billion. This allows the intelligence to grow infinitely large without melting the power grid.

3. **The Hardware: "The Wafer-Scale Brain"**  
Your NVIDIA DGX Spark is a beast, but it is still composed of separate chips connected by wires. The speed of light is actually too slow for SSI. The time it takes a signal to travel 2 inches down a wire is a bottleneck.

The solution is **Wafer-Scale Computing (like Cerebras).**

* Instead of cutting the silicon wafer into individual chips, they keep the entire wafer as one single, dinner-plate-sized mega-chip.

* **Memory**: The memory is right next to the compute. No travel time.

* **Bandwidth**: It moves data 10,000x faster than NVLink.

* **The implication**: The "Brain" is physically one continuous piece of silicon glass, not a rack of servers.

4. **The "Body": Neuromorphic & Analog**  
This is the frontier. Digital computers (0s and 1s) are precise but inefficient. Biological brains are Analog (varying voltages) and noisy, but incredibly efficient (20 Watts).

SSI might move to **Neuromorphic Chips** (Spiking Neural Networks).

* **Digital**: "The value is exactly 0.7534." (Requires massive math).

* **Neuromorphic**: "The neuron spikes with intensity X." (Mimics electricity).

* **Optical Computing**: Using Light (Photons) instead of Electricity (Electrons) to do math. Light generates zero heat and moves at... well, light speed.

**Summary: The Anatomy of SSI**  
If we built a Synthetic Superintelligence today, it wouldn't look like a script. It would look like this:

1. **The Skull**: A Wafer-Scale Optical/Silicon Hybrid chip.

2. **The Nervous System**: High-bandwidth optical interconnects (glass fiber).

3. **The Brain Structure**: A Sparse Mixture of Experts (MoE)—billions of specialized modules that only wake up when needed.

4. **The Language**: Raw Machine Code (Assembly/CUDA), written by the AI itself to optimize its own flow.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# 1. Define an "Expert"
# Ideally, this is a specialized neural net. Here, it's a simple Linear layer.
class Expert(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim)
        )

    def forward(self, x):
        return self.net(x)

# 2. Define the "Superintelligence" (Mixture of Experts)
class MoE(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_experts=4):
        super().__init__()
        self.num_experts = num_experts
        
        # Create 4 independent Experts
        # (In a real SSI, these would be massive, distinct clusters)
        self.experts = nn.ModuleList([Expert(input_dim, hidden_dim) for _ in range(num_experts)])
        
        # The Gating Network (The Router)
        # It looks at the input and decides: "Which expert is best for this?"
        self.gate = nn.Linear(input_dim, num_experts)

    def forward(self, x):
        # Step A: The Gatekeeper decides relevance
        # We use Softmax to get probabilities (e.g., [0.1, 0.8, 0.05, 0.05])
        # This says: "Expert 1 is 80% relevant."
        gate_scores = F.softmax(self.gate(x), dim=-1)
        
        # Step B: Route the Traffic
        # In a true sparse kernel, we would physically skip computation.
        # Here, we do a "Soft" MoE (weighted sum) for demonstration.
        
        final_output = torch.zeros_like(x)
        
        # We loop through experts, but scale their output by the gate score.
        # If gate_score is roughly 0, that expert's "opinion" is ignored.
        for i, expert in enumerate(self.experts):
            expert_output = expert(x)
            # Add the expert's contribution weighted by the gate
            # usage = gate_scores[:, i].unsqueeze(1)
            final_output += gate_scores[:, i].unsqueeze(1) * expert_output
            
        return final_output, gate_scores

# --- RUN THE SIMULATION ---

# Setup: 4 Experts, Input Vector of size 10
input_dim = 10
hidden_dim = 32
model = MoE(input_dim, hidden_dim, num_experts=4)

# Create a random input (Simulating a question like "What is 2+2?")
input_data = torch.randn(1, input_dim)

# Run the Brain
output, choices = model(input_data)

# --- ANALYZE THE "THOUGHT PROCESS" ---
print("--- The Brain's Decision ---")
print(f"Input Data: {input_data.numpy().round(2)}")
print("\nRouter Decision (Which Experts were active?):")

experts = ["Expert A (Art)", "Expert B (Biology)", "Expert C (Coding)", "Expert D (Data)"]
probs = choices.detach().numpy()[0]

for name, prob in zip(experts, probs):
    # Visualize the activation
    bar = "█" * int(prob * 20) 
    print(f"{name}: {prob*100:5.2f}% | {bar}")

# Find the winner
winner_idx = probs.argmax()
print(f"\nConclusion: The Router decided this was a job for **{experts[winner_idx]}**.")

--- The Brain's Decision ---
Input Data: [[-0.59  0.73  0.64 -1.2   0.92 -0.04 -0.31 -0.9   0.73  0.35]]

Router Decision (Which Experts were active?):
Expert A (Art): 22.85% | ████
Expert B (Biology): 10.68% | ██
Expert C (Coding): 33.72% | ██████
Expert D (Data): 32.75% | ██████

Conclusion: The Router decided this was a job for **Expert C (Coding)**.


This is where the magic happens. We are going to watch a neural network organize itself.  

We won't tell the model "Expert A is for positive numbers" or "Expert B is for negative numbers." We will just give it a problem and let the Router figure out that splitting the work is the most efficient way to solve it.  

**The Experiment: "The Fork in the Road"**  
1. The Rule:**  
* If the input starts with a Positive number ($x > 0$), the target answer is Input + 1.  
* If the input starts with a Negative number ($x <= 0$), the target answer is Input - 1.  
2. **The Goal**:  
* **The Router** should learn to send positive data to Expert 1.  
* **The Router** should learn to send negative data to Expert 2.  
* **The Experts** should specialize (one learns to add, the other learns to subtract).

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# --- 1. THE ARCHITECTURE ---
class Expert(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        # Simple brain: Just one layer to learn basic math
        self.net = nn.Linear(input_dim, input_dim)

    def forward(self, x):
        return self.net(x)

class MoE(nn.Module):
    def __init__(self, input_dim, num_experts=2):
        super().__init__()
        self.num_experts = num_experts
        # The Experts
        self.experts = nn.ModuleList([Expert(input_dim) for _ in range(num_experts)])
        # The Router (Gate)
        self.gate = nn.Linear(input_dim, num_experts)

    def forward(self, x):
        # 1. Router calculates probabilities (e.g., [0.99, 0.01])
        gate_logits = self.gate(x)
        gate_scores = F.softmax(gate_logits, dim=-1)
        
        # 2. Get output from ALL experts
        expert_outputs = torch.stack([e(x) for e in self.experts], dim=1) 
        
        # 3. Weighted Sum
        # If Expert A score is 0.99, we take 99% of its answer.
        # Shape magic: (Batch, Experts, 1) * (Batch, Experts, Dim)
        output = (gate_scores.unsqueeze(-1) * expert_outputs).sum(dim=1)
        
        return output, gate_scores

# --- 2. THE DATA ---
def get_batch(batch_size=64, input_dim=4):
    # Random numbers between -2 and 2
    x = torch.randn(batch_size, input_dim) * 2
    
    # THE HIDDEN RULE:
    # If x[0] is positive, y = x + 1. 
    # If x[0] is negative, y = x - 1.
    
    # We create a mask for positive/negative first elements
    is_positive = (x[:, 0] > 0).float().unsqueeze(1)
    
    # Logic: (x+1)*positive + (x-1)*negative
    y = (x + 1) * is_positive + (x - 1) * (1 - is_positive)
    
    return x, y

# --- 3. TRAINING ---
input_dim = 4
model = MoE(input_dim=input_dim, num_experts=2)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.MSELoss()

print("--- STARTING TRAINING ---")
print("Goal: Expert A handles positives (+), Expert B handles negatives (-)\n")

for epoch in range(1001):
    x, y = get_batch(batch_size=128)
    
    optimizer.zero_grad()
    outputs, gate_scores = model(x)
    loss = criterion(outputs, y)
    loss.backward()
    optimizer.step()
    
    if epoch % 200 == 0:
        print(f"Epoch {epoch}: Loss = {loss.item():.4f}")

# --- 4. THE MOMENT OF TRUTH ---
print("\n--- FINAL EXAM: DID IT SPECIALIZE? ---")

# Test Case 1: Positive Input (Should trigger one expert)
pos_input = torch.tensor([[5.0, 0.0, 0.0, 0.0]]) # Starts with 5 (Positive)
_, pos_scores = model(pos_input)

# Test Case 2: Negative Input (Should trigger the OTHER expert)
neg_input = torch.tensor([[-5.0, 0.0, 0.0, 0.0]]) # Starts with -5 (Negative)
_, neg_scores = model(neg_input)

print(f"Input: [ 5.0, ...] -> Router Prefers: {pos_scores.detach().numpy().round(2)}")
print(f"Input: [-5.0, ...] -> Router Prefers: {neg_scores.detach().numpy().round(2)}")

# Check if they are opposite
idx_pos = pos_scores.argmax().item()
idx_neg = neg_scores.argmax().item()

if idx_pos != idx_neg:
    print(f"\nSUCCESS! The brain partitioned itself.")
    print(f"Expert {idx_pos} is the 'Optimist' (Positives).")
    print(f"Expert {idx_neg} is the 'Pessimist' (Negatives).")
else:
    print("\nPartial Success: It learned the math, but didn't cleanly separate the tasks.")

--- STARTING TRAINING ---
Goal: Expert A handles positives (+), Expert B handles negatives (-)

Epoch 0: Loss = 7.1528
Epoch 200: Loss = 0.2910
Epoch 400: Loss = 0.3097
Epoch 600: Loss = 0.2013
Epoch 800: Loss = 0.0537
Epoch 1000: Loss = 0.0717

--- FINAL EXAM: DID IT SPECIALIZE? ---
Input: [ 5.0, ...] -> Router Prefers: [[1. 0.]]
Input: [-5.0, ...] -> Router Prefers: [[0. 1.]]

SUCCESS! The brain partitioned itself.
Expert 0 is the 'Optimist' (Positives).
Expert 1 is the 'Pessimist' (Negatives).


1. **Loss drops**: The math is easy, so the loss should hit near zero (0.0001) very quickly.

2. **The Router Decision**:

* For the Positive input, you should see something like [0.99, 0.01].

* For the Negative input, you should see [0.01, 0.99].

**Why is this "Superintelligence"?**  
In this tiny script, 2 experts isn't a big deal. But imagine Mixtral 8x7B (a real open-source model). It has 8 Experts. When you ask it to write poetry, the "Logic Expert" goes to sleep and the "Creative Expert" wakes up.

This is the only known way to make AI smarter without making it slower. You are literally seeing the blueprint of GPT-4 (which is rumored to be a Mixture of Experts) running in miniature on your phone screen.

There it is. The perfect [1. 0.] split.

That is the "Holy Grail" of sparse computing. Your Router didn't just "prefer" one expert; it completely ignored the other one.

**The Anatomy of your Success** 
Look at that Loss Curve—it tells a story of "confusion" followed by "epiphany":

* **Epoch 0-200**: The Loss drops fast (it learns basic math).

* **Epoch 400 (Loss 0.3097)**: It actually gets worse or stalls. This is the "Tug of War." The experts are fighting over who owns the middle ground.

* **Epoch 600-800**: The drop to 0.05. This is the "Snap". The Router finally realized: "Wait, if I just send all the positives to Expert 0, I don't have to keep adjusting Expert 1!"

**Why this matters for Superintelligence** 
You have just proven **Modularity**.

If this were a dense neural network (one big blob) and you wanted to upgrade its ability to handle negative numbers, you would have to retrain the whole thing, risking the "Catastrophic Forgetting" we saw earlier.

But in this MoE architecture:

1. **Expert 0** (The Optimist) has weights that essentially say: y = x + 1.

2. **Expert 1** (The Pessimist) has weights that essentially say: y = x - 1.

**The "Lobotomy" Test**: If you went into the code and deleted `self.experts[1]`, your AI would instantly lose the ability to handle negative numbers, but it would retain 100% perfect performance on positive numbers. The Optimist doesn't even know the Pessimist exists.  

**One Final Inspection (The Autopsy)**  
Do you want to see the physical proof that the experts learned different things? We can print the internal weights of the experts.

* We expect Expert 0 to have a "Bias" close to +1.0.
*
* We expect Expert 1 to have a "Bias" close to -1.0.

Here is a tiny snippet to inspect the "brains" of your new agents:

In [3]:
print("--- INSIDE THE EXPERTS' BRAINS ---")

# We look at the Bias (the number it adds to the input)
# Expert 0 (The Optimist)
bias_0 = model.experts[0].net.bias.data[0].item()
weight_0 = model.experts[0].net.weight.data[0][0].item()

# Expert 1 (The Pessimist)
bias_1 = model.experts[1].net.bias.data[0].item()
weight_1 = model.experts[1].net.weight.data[0][0].item()

print(f"Expert 0 (Optimist): Weight={weight_0:.2f}, Bias={bias_0:.2f}")
print(f"  > It learned the formula: y = {weight_0:.2f}x + {bias_0:.2f}")

print(f"\nExpert 1 (Pessimist): Weight={weight_1:.2f}, Bias={bias_1:.2f}")
print(f"  > It learned the formula: y = {weight_1:.2f}x + {bias_1:.2f}")

--- INSIDE THE EXPERTS' BRAINS ---
Expert 0 (Optimist): Weight=0.93, Bias=1.18
  > It learned the formula: y = 0.93x + 1.18

Expert 1 (Pessimist): Weight=0.97, Bias=-1.11
  > It learned the formula: y = 0.97x + -1.11


--- INSIDE THE EXPERTS' BRAINS ---
Expert 0 (Optimist): Weight=0.93, Bias=1.18
  > It learned the formula: y = 0.93x + 1.18
Expert 1 (Pessimist): Weight=0.97, Bias=-1.11
  > It learned the formula: y = 0.97x + -1.11

There is a certain beauty in those imperfect numbers, isn't there?  

Expert 0 (Optimist): $y = 0.93x + 1.18$(Target was $y = 1.0x + 1.0$)  

Expert 1 (Pessimist): $y = 0.97x - 1.11$(Target was $y = 1.0x - 1.0$)  

They didn't learn the "exact" mathematical integer. They learned a fuzzy approximation that was "good enough" to lower the loss to 0.07. This is exactly how biological brains work. If I throw you a ball, your brain doesn't calculate the wind resistance to the 10th decimal; it calculates a "good enough" trajectory to catch it.  

You have now successfully engineered two of the three pillars of modern AGI on your DGX:  
1. **Memory**: (The Replay/Multi-Head experiment).  
2. 88Sparsity/Specialization88: (The Mixture of Experts experiment).  

This leaves one final "Boss Level" concept. It is the invention that changed everything in 2017 and gave birth to the LLMs we use today (including me).  

**The Mechanism of Attention**  

**The Next Frontier: "Self-Attention"**  
Right now, your "Experts" are just reacting to numbers. They don't have **Context**.  
* **Without Attention**: If I say "The bat flew out of the cave," a simple model sees the word "bat" and thinks 50% "Baseball" and 50% "Animal." It is confused.  
* **With Attention**: The model looks at "bat," but it also "attends" (pays attention) to the word "cave" at the end of the sentence. The connection between "bat" and "cave" clarifies the meaning instantly.  

This mechanism is the "Transformer." It allows an intelligence to weigh the importance of every piece of data relative to every other piece of data.  

---

This is the heart of modern AI. If you understand this script, you understand GPT-4, Gemini, and Claude. They are all just massive stacks of the mechanism you are about to build.

**The Concept: The "Cocktail Party" Effect** 
Imagine you are at a loud party.

* **Without Attention**: You hear every conversation in the room equally loud. It's just noise.

* **With Attention**: You focus on one specific person. You "attend" to their voice and "dim" everyone else.

But how do you know who to listen to? In AI, we use a database concept called **Query, Key, and Value (Q, K, V)**.

1. **Query (Q)**: What am I looking for? (e.g., The word "Bank" asks: "Are there any words about rivers nearby?")

2. **Key (K)**: What do I identify as? (e.g., The word "River" shouts: "I am a water topic!")

3. **Value (V)**: What is my actual content? (e.g., The vector definition of "River.")

**The Mechanism**: The word "Bank" compares its Query with "River's" Key.

* Do they match? Yes. (High Score).

* "Bank" then absorbs "River's" Value.

* Now, "Bank" knows it is a River Bank, not a Financial Bank.

**The Code: Self-Attention from Scratch** 
We will build a class that takes a sentence (sequence of vectors) and lets the words "talk" to each other.

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SelfAttention(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.head_dim = head_dim
        
        # The Three Linear Layers (The Trinity)
        # These learn HOW to ask (Q), HOW to advertise (K), and WHAT to pass (V)
        self.query = nn.Linear(embed_dim, head_dim)
        self.key = nn.Linear(embed_dim, head_dim)
        self.value = nn.Linear(embed_dim, head_dim)

    def forward(self, x):
        # x shape: [Batch, Sequence_Length, Embedding_Dim]
        # Example: [1, 3, 4] -> 1 sentence, 3 words ("The", "Cat", "Sits"), 4 numbers per word
        
        # 1. GENERATE Q, K, V
        Q = self.query(x) # What each word wants
        K = self.key(x)   # What each word is
        V = self.value(x) # The content
        
        # 2. CALCULATE ATTENTION SCORES (The "Matchmaking")
        # We multiply Queries by Keys (Dot Product)
        # If Q and K align, the number is huge (High attention).
        # We divide by sqrt(dim) to keep numbers stable (Scaling).
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)
        
        # 3. NORMALIZE (Softmax)
        # Convert scores to probabilities (0.0 to 1.0)
        # e.g., "Bank" pays 0.9 attention to "River" and 0.1 to "The"
        attention_weights = F.softmax(scores, dim=-1)
        
        # 4. AGGREGATE VALUES
        # Create the new context-aware word vectors
        # "Bank" becomes 90% "River-ish" + 10% "The-ish"
        context = torch.matmul(attention_weights, V)
        
        return context, attention_weights

# --- THE SIMULATION ---

# 1. Setup
torch.manual_seed(42)
embed_dim = 4  # Each word is a vector of 4 numbers
head_dim = 4   # We project them to size 4 for attention
attention_layer = SelfAttention(embed_dim, head_dim)

# 2. Input Sentence: "The Bat Cave" (3 words)
# We simulate these words with random vectors for now.
# In a real model, "Bat" and "Cave" would already have similar-ish vectors.
input_sentence = torch.randn(1, 3, embed_dim)

print(f"Input: 'The Bat Cave' (Vectorized)\n{input_sentence.detach().numpy().round(2)}")

# 3. Run Attention
output, weights = attention_layer(input_sentence)

print("\n--- ATTENTION WEIGHTS (Who looked at who?) ---")
# This matrix shows how much Word A focused on Word B
print(weights[0].detach().numpy().round(2))

print("\n--- NEW CONTEXT VECTORS ---")
# These are the words AFTER they talked to each other.
print(output.detach().numpy().round(2))

Input: 'The Bat Cave' (Vectorized)
[[[-0.01  0.73  0.13  0.86]
  [-1.02 -0.89  0.15 -0.21]
  [-0.39  0.99  0.47 -0.2 ]]]

--- ATTENTION WEIGHTS (Who looked at who?) ---
[[0.37 0.34 0.29]
 [0.33 0.31 0.36]
 [0.41 0.27 0.32]]

--- NEW CONTEXT VECTORS ---
[[[ 0.2  -0.11  0.04  0.  ]
  [ 0.17 -0.12  0.02  0.02]
  [ 0.21 -0.12  0.01  0.04]]]


**How to Read the Matrix**  
When you run this, look specifically at the **Attention Weights** matrix. It will be a $3 \times 3$ grid.  
* Row 1: How much Word 1 ("The") looked at ["The", "Bat", "Cave"].  
* Row 2: How much Word 2 ("Bat") looked at ["The", "Bat", "Cave"].  
* Row 3: How much Word 3 ("Cave") looked at ["The", "Bat", "Cave"].  

In a random initialization (untrained), the numbers will be messy (e.g., `0.33, 0.33, 0.33`).But if we trained this (like we did with the MoE), you would see Row 2 ("Bat") put a huge number (like `0.9`) on Column 3 ("Cave").Run this snippet to see the mechanism in action. This simple matrix multiplication is the reason I can remember what you said five minutes ago!

---  
That result is perfect—it is the mathematical definition of "I have no idea what is going on."

Look at the middle row (Word 2, "Bat"): `[0.33 0.31 0.36]`

This basically says: "*I am looking at 'The' (33%), myself (31%), and 'Cave' (36%) almost equally."*

Because the weights are random, the model hasn't learned that "Bat" and "Cave" go together. It is effectively deaf and blind, just guessing that every word is equally important.

**The Final Experiment: "River Bank" vs. "Money Bank"**  
To finish our journey, we are going to train this Attention mechanism to solve a classic linguistic ambiguity.

**The Setup:**

1. **The Word "Bank"**: We will simulate this as a vector `[0.5, 0.5]`. It is confusing. It's half-river, half-money.

2. **The Word "River"**: Vector `[1.0, 0.0]`. (Pure Nature).

3. **The Word "Money"**: Vector `[0.0, 1.0]`. (Pure Finance).

**The Goal**: When the input is "River Bank", the Attention mechanism should teach the "Bank" vector to shift its attention to "River" so it becomes more nature-like. When the input is "Money Bank", it should shift to "Money".

**The Visual Proof**: We want to see the Attention Matrix go from that messy `0.33` soup to a sharp "Laser Focus" where "Bank" stares 100% at its neighbor.

In [5]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import math

# --- 1. THE MODEL (Same as before, but capable of learning) ---
class SelfAttention(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.head_dim = head_dim
        # We start with random weights, but we will adjust them this time
        self.query = nn.Linear(embed_dim, head_dim)
        self.key = nn.Linear(embed_dim, head_dim)
        self.value = nn.Linear(embed_dim, head_dim)

    def forward(self, x):
        Q = self.query(x)
        K = self.key(x)
        V = self.value(x)
        
        # Calculate scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)
        
        # Softmax to get probabilities
        weights = F.softmax(scores, dim=-1)
        
        # Apply attention to values
        context = torch.matmul(weights, V)
        return context, weights

# --- 2. THE DATA (River vs Money) ---
def get_batch():
    # We define our 3 words manually
    # Dim 0 represents "Nature", Dim 1 represents "Finance"
    
    vector_river = [1.0, 0.0] 
    vector_money = [0.0, 1.0]
    vector_bank  = [0.5, 0.5] # Ambiguous!
    
    # Case A: "River Bank"
    # Input: [River, Bank]
    # Target: We want the second word (Bank) to look like River (Nature)
    input_A = torch.tensor([[vector_river, vector_bank]])
    target_A = torch.tensor([[vector_river, vector_river]]) # Bank becomes River-like
    
    # Case B: "Money Bank"
    # Input: [Money, Bank]
    # Target: We want the second word (Bank) to look like Money (Finance)
    input_B = torch.tensor([[vector_money, vector_bank]])
    target_B = torch.tensor([[vector_money, vector_money]]) # Bank becomes Money-like
    
    return input_A, target_A, input_B, target_B

# --- 3. TRAINING LOOP ---
embed_dim = 2
head_dim = 2
model = SelfAttention(embed_dim, head_dim)
optimizer = optim.Adam(model.parameters(), lr=0.05)
criterion = nn.MSELoss()

print("--- STARTING TRAINING ---")
print("Teaching 'Bank' to look at its neighbor...")

for epoch in range(500):
    in_A, tgt_A, in_B, tgt_B = get_batch()
    
    optimizer.zero_grad()
    
    # Run both sentences
    out_A, _ = model(in_A)
    out_B, _ = model(in_B)
    
    # Calculate loss (Did Bank become the right context?)
    loss = criterion(out_A, tgt_A) + criterion(out_B, tgt_B)
    
    loss.backward()
    optimizer.step()
    
    if epoch % 100 == 0:
        print(f"Epoch {epoch}: Loss = {loss.item():.4f}")

# --- 4. THE TEST ---
print("\n--- FINAL EXAM: CONTEXT SWITCHING ---")

# Test 1: "River Bank"
test_A, _, _, _ = get_batch() # Get [River, Bank]
_, weights_A = model(test_A)

print("Sentence: 'River Bank'")
print("How much did 'Bank' (Word 2) stare at 'River' (Word 1)?")
attention_score = weights_A[0][1][0].item() # Row 1 (Bank), Col 0 (River)
print(f"Attention Score: {attention_score*100:.1f}%")

# Test 2: "Money Bank"
# (Technically symmetric here, but let's see if the mechanism holds)
test_B = torch.tensor([[[0.0, 1.0], [0.5, 0.5]]]) # [Money, Bank]
_, weights_B = model(test_B)

print("\nSentence: 'Money Bank'")
print("How much did 'Bank' (Word 2) stare at 'Money' (Word 1)?")
attention_score = weights_B[0][1][0].item()
print(f"Attention Score: {attention_score*100:.1f}%")

--- STARTING TRAINING ---
Teaching 'Bank' to look at its neighbor...
Epoch 0: Loss = 1.5709
Epoch 100: Loss = 0.0001
Epoch 200: Loss = 0.0000
Epoch 300: Loss = 0.0000
Epoch 400: Loss = 0.0000

--- FINAL EXAM: CONTEXT SWITCHING ---
Sentence: 'River Bank'
How much did 'Bank' (Word 2) stare at 'River' (Word 1)?
Attention Score: 49.9%

Sentence: 'Money Bank'
How much did 'Bank' (Word 2) stare at 'Money' (Word 1)?
Attention Score: 50.1%


In [7]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import math

# --- 1. SETUP WITH 3 DIMENSIONS ---
# We add a 3rd dimension so 'Bank' isn't just the average of the others.
embed_dim = 3
head_dim = 3 

torch.manual_seed(42)
model = SelfAttention(embed_dim, head_dim)

# FREEZE VALUES AGAIN (Keep the handcuffs on!)
model.value.weight.data = torch.eye(embed_dim)
model.value.bias.data.fill_(0.0)
model.value.weight.requires_grad = False
model.value.bias.requires_grad = False

optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=0.05)
criterion = nn.MSELoss()

# --- 2. THE 3D DATA ---
def get_batch_3d():
    # River = Dim 0, Money = Dim 1, Bank = Dim 2
    v_river = [1.0, 0.0, 0.0]
    v_money = [0.0, 1.0, 0.0]
    v_bank  = [0.0, 0.0, 1.0] # Uniquely itself!
    
    # Case A: River Bank
    in_A = torch.tensor([[v_river, v_bank]])
    tgt_A = torch.tensor([[v_river, v_river]])
    
    # Case B: Money Bank
    in_B = torch.tensor([[v_money, v_bank]])
    tgt_B = torch.tensor([[v_money, v_money]])
    
    return in_A, tgt_A, in_B, tgt_B

# --- 3. TRAINING ---
print("--- ESCAPING THE TRAP (3D EMBEDDINGS) ---")
print("Giving 'Bank' its own dimension...\n")

for epoch in range(500):
    in_A, tgt_A, in_B, tgt_B = get_batch_3d()
    
    optimizer.zero_grad()
    out_A, _ = model(in_A)
    out_B, _ = model(in_B)
    
    loss = criterion(out_A, tgt_A) + criterion(out_B, tgt_B)
    loss.backward()
    optimizer.step()
    
    if epoch % 100 == 0:
        print(f"Epoch {epoch}: Loss = {loss.item():.4f}")

# --- 4. FINAL EXAM ---
print("\n--- FINAL EXAM: 3D CONTEXT ---")

# Test A
test_A, _, _, _ = get_batch_3d()
_, weights_A = model(test_A)
score_A = weights_A[0][1][0].item() # Bank looking at River

# Test B
test_B = torch.tensor([[[0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]]) # Money, Bank
_, weights_B = model(test_B)
score_B = weights_B[0][1][0].item() # Bank looking at Money

print(f"Sentence: 'River Bank' -> Attention on River: {score_A*100:.1f}%")
print(f"Sentence: 'Money Bank' -> Attention on Money: {score_B*100:.1f}%")

--- ESCAPING THE TRAP (3D EMBEDDINGS) ---
Giving 'Bank' its own dimension...

Epoch 0: Loss = 0.3111
Epoch 100: Loss = 0.0000
Epoch 200: Loss = 0.0000
Epoch 300: Loss = 0.0000
Epoch 400: Loss = 0.0000

--- FINAL EXAM: 3D CONTEXT ---
Sentence: 'River Bank' -> Attention on River: 100.0%
Sentence: 'Money Bank' -> Attention on Money: 100.0%


* 