# 'Seeing' digits: an interpretability dive into MLPs and CNNs trained on MNIST data

## 1. The beginning
I wanted to start a project/notebook focused specifically on interpretability—both to get a taste of the tools used in this area and to better understand the thought process behind analyzing what neural networks actually learn. Since my micrograd build ended with a failed MNIST experiment, this felt like the perfect place to pick things back up.

> "Failed" might be the wrong word. Micrograd technically ran on MNIST, but it was painfully slow, and the resulting computation graph was almost comical. That said, building something from scratch and seeing it run—even slowly—was a great experiment. It gave me a better undestanding of the operations (like autodiff and backprop) that I will be using regualarly, and to see how actual ML frameworks like PyTorch optimize these operations.

In this project, I’ll use basic PyTorch structures to mirror the MLP I built in micrograd, but with a modern framework and better tools. Along the way, I’ll add:

- Capturing internal activations (via hooks)
- Trying out dimensionality reduction methods like PCA
- Visualizing how neurons respond to different inputs

I don’t claim to fully understand all of these tools yet, but part of the point is to learn by doing, document that process clearly, and build a foundation I can use in future projects. Once the MLP is running I’ll upgrade to a CNN to see what kinds of differences emerge, both in performance and (hopefully) in how the network learns to represent digits.

The broader goal is to build a strong foundation in the interpretability toolkit so I can apply these ideas to current (and future!) projects—like my toy AlphaZero agent, my LLM alignment project (to come!), and maybe even visual experiments with TransformerLens.

## 2. The Multi-layer perceptron (MLP)

### Overview:
I'm currently writing a more in-depth educational post on MLPs, building them from the ground up with full mathematical motivation. For this notebook, I’ll keep things fairly lightweight and just outline the key structures and operations so we can understand what’s happening with minimal background.

An *MLP* is one of the simplest machine learning models—made up of a few key building blocks: basic linear algebra (vector sums and matrix multiplication), a nonlinear activation function, and a mechanism for learning via backpropagation. By stacking layers of *neurons* (perceptrons) where each output depends on all inputs from the previous layer, we build what's called a **fully connected neural network**. Each operation we use is differentiable, which means we can not only pass data forward through the network but also compute how to adjust the model using gradients and optimization as we go through training data.



<!--I am currently writing a much more indepth explaination of MLPs (from the ground up with mathematical motivations) for educational/learning purposes so I will leave most of that detail for a blog post. Here we will just outline the important structures so we know what is going on with little background required. 

An *MLP* is a basic machine learning model that requires only a few building blocks and little math knowledge (or none... technically). Really, it is just basic linear algebra (vector sums/matrix multiplication) and some non-linear activation functions that make up a perceptron (aka a neuron). We then arrange a bunch of neurons into a layer and stack a bunch of layers together to give us a fully connected system where each operation is differentiable--in the sense of calculus--so we can not only pass information forward but also pass information backwards to adjust the model. Let's dig into the specifics breifly starting with a neuron:-->

### 2.1. The perceptron (aka the neuron): 

We should just think of a perceptron (neuron) as a function that takes in many inputs and outputs a single number. Inside a neuron are a list of numbers (initially randomized and then updated as the model trains) called *weights*--one for each input to the neuron--and an optional bias (single number). The neuron takes in a list of numbers called the input in the form of a vector, multiplies each input with a weight and adds up all the results, then adds a bias to the result. This gives us a single number $y$, to which we can then apply a non-linear function $f$ to to get the output of the neuron $\hat{y}$.
> **Example**. Our input will be three dimensional: $x_1=4, x_2=8, x_3=7$. Our weights will be variables $w_1, w_2, w_3$ and bias $b$. The first step is to multiply all the inputs and weights together, take the sum and add a bias: $$x_1\cdot w_1 + x_2\cdot w_2 + x_3\cdot w_3 + b= \boxed{4w_1 + 8w_2 + 7w_3 + b = y}.$$ Note that the output $y$ is a single number. Lastly, we take some non-linear function $f$ and apply it to $y$. For this example we take the hyperbolic tangent $$f(x) = \tanh(x) = \frac{e^{2x}-1}{e^{2x}+1}$$ (where $e^x$ is the exponential function). This function takes values between $-1$ and $1$, and so we think of the values close to $1$ as indicating a strong activation and values close to $-1$ as very little activation (for the neuron analogy anyways!). The end result, i.e. the output of this neuron will be
> $$ \hat{y} = \frac{e^{2y} - 1}{e^{2y} + 1} = f\left(\sum_{i=1}^3 (w_i\cdot x_i) + b\right)$$.

So, the neuron takes in many inputs (say $n$) and outputs a single number $\hat{y}$ which is generally the result of a specific function which introduces non-linearity. The reason for the *activation* function is that we wouldn't be able to model any non-linear data if we didn't have this function... everything would be the result of linear algebra (hence linear!)--even in a complicated model with many layers and MANY neurons--and so we really do need this for practical purposes. There are tons of options for this non-linear function and each has benefits/drawbacks: see [wikipedia](https://en.wikipedia.org/wiki/Activation_function) for more on activation functions. 

In [206]:
import numpy as np
import torch


# OPTION 1: direct(ish) port from micrograd
# the neuron class takes in a number of inputs, nin, initializes the (randn = normally distributed) random weights and bias
# calling the Neuron requires a list of nin inputs, x, and outputs tanh(y) if nonlin=True, otherwise outputs the linear result y = x*w + b

class Neuron1:
    def __init__(self, nin, nonlin=True):
        self.w = torch.randn(nin, requires_grad=True)
        self.b = torch.randn(1,   requires_grad=True)
        self.nonlin = nonlin

    def __call__(self, x):
        y = x @ self.w + self.b
        return torch.tanh(y) if self.nonlin else y

    def parameters(self):
        return [self.w, self.b]

In [171]:
N1 = Neuron1(3)

print(f'Neuron1 parameters: {N1.parameters()}')

x = torch.tensor([4,8,7]).float()

print(f'For the input {x} we obtain an output of: {N1(x).item()}')

Neuron1 parameters: [tensor([-0.8774,  1.8562, -1.3348], requires_grad=True), tensor([1.2367], requires_grad=True)]
For the input tensor([4., 8., 7.]) we obtain an output of: 0.9968927502632141


In [125]:
import torch.nn as nn

# Option 2: utilize the nn.Module class and its efficiencies
# These look almost identical as written except we don't need to manually create weights and bias, nor do we directly compute the linear output
# We also don't need to define the neurons parameters as this is all done inside of the nn.Module! 

class Neuron2(nn.Module):
    def __init__(self, nin, nonlin=True):
        super().__init__()
        self.linear = nn.Linear(nin, 1)
        self.nonlin = nonlin

    def forward(self, x):
        y = self.linear(x)
        return torch.tanh(y) if self.nonlin else y

In [213]:
N2 = Neuron2(3)

print("Neuron2 parameters:")
for name, param in N2.named_parameters():
    print(f"  {name}:{param.data.numpy().flatten()}")

x = torch.tensor([4,8,7]).float()

print(f'For the input {x} we obtain an output of: {N2(x).item()}')

Neuron2 parameters:
  linear.weight:[-0.3273866  -0.45594507  0.06844634]
  linear.bias:[-0.5026901]
For the input tensor([4., 8., 7.]) we obtain an output of: -0.9999056458473206


### 2.2. The layer:

**Key idea:** A *layer* is a collection of neurons that all receive the same input vector and independently compute their outputs.

Each neuron has its own set of weights and bias, and produces a single output like we outlined above. So, if a layer has $m$ neurons and receives an $n$-dimensional input, the output will be an $m$-dimensional vector. Here is a little sketch of the flow through a layer:

$$ \begin{align*}
    (x_1, x_2, \dots, x_n)  \rightarrow &\fbox{Layer}  \rightarrow  (\hat{y}_1, \hat{y}_2, \dots, \hat{y}_m)
    \end{align*}
$$
> **Example (continued):** Above we had a simple example of a neuron with $\tanh$ activation. To form a layer we can just link a few neurons together to produce an output. Let's use 2 neurons to keep things simple: each neuron has 3 weights and a bias and these will generally not be the same. We can add subscripts to distinguish between the weights and bias of each neuron:
    $$
        \begin{align*}
                n_1 : (w_{1,1},\; w_{1,2},\; w_{1,3},\;\; b_1) \\
                n_2 : (w_{2,1},\; w_{2,2},\; w_{2,3},\;\; b_2)
        \end {align*}
    $$
> so the first number represents that neuron and the second represents the weight in that neuron. Assuming the same inputs as the previous example, the result of this layer will be
    $$
        \begin{align*}
                \hat{y}_1 : \tanh(4w_{1,1} + 8w_{1,2} + 7w_{1,3} + b_1) = f\left( \sum_{i=1}^3 (w_{1,i}\cdot x_i ) + b_1\right) \\
                \hat{y}_2 : \tanh(4w_{2,1} + 8w_{2,2} + 7w_{2,3} + b_2) = f\left( \sum_{i=1}^3 (w_{2,i}\cdot x_i ) + b_2\right)
        \end {align*}
    $$
> or, using matrices/vectors we can write the weights, bias and input as:
    $$
    \begin{align*}
    W_1 = \begin{pmatrix} w_{1,1} & w_{1,2} & w_{1,3} \\ w_{2,1} & w_{2,2} & w_{2,3} \end{pmatrix}, 
    && B_1 = \begin{pmatrix} b_1 \\ b_2 \end{pmatrix}, 
    && X = \begin{pmatrix} 4 \\ 8 \\ 7 \end{pmatrix}.
    \end{align*}
    $$
> And using this, the operations of layer 1 can be written succinctly using linear algebra:
    $$ f\left( W_1\cdot X + B_1\right). $$

The notation that we used at the end of the example generalizes nicely to arbitrary dimensions. If there are $n$ inputs to the layer, and $m$ neurons in the layer then $W_1$ is a matrix (or array) with $m$-rows and $n$ columns, $X$ is a column vector (a 1-dimensional array) with $n$-rows, and $B_1$ is a column vector with $m$-rows. The activation function then applies to each row in the result $Y$ which is a column vector with $m$-rows (one for each neurons output):
    $$ f(W_1\cdot X + B_1) = f(Y) = \hat(Y). $$

### 2.3. The Multi-Layer Perceptron:

An MLP is just a series of *fully connected* layers. The only requirement is that the output size of one layer matches the input size of the next—otherwise, the data can't flow forward.

Here’s a diagram of a simple MLP with 3 inputs, two hidden layers (with 4 and 6 neurons respectively), and a single output neuron:
> <img src="basic_mlp.png" style="width: 500px; height: 400px;">

### 2.4: Training an MLP

#### 2.4.1: On the level of neurons
1) **Feedforward**: The neuron computes a weighted sum of its inputs plus a bias term:
       $$y = x\cdot w + b$$
   It then optionally applies a nonlinear activation function (e.g. tanh, ReLU) to produce the output:
       $$\hat{y} = activation(y)$$

2) **Loss calculation**: The output from the forward pass is compared to the target label using a loss function, which outputs a scalar that quantifies how far off the prediction is.
> For example, for a regression task we might use mean squared error:
    $$ \mathcal{L} = (y_{pred} - y_{true})^2$$

3) **Compute gradients (via backpropagation)**:
Using the chain rule from calculus, we can compute the gradient of the loss with respect to each parameter (i.e. how much changing that parameter would affect the loss). These gradients tell us how to adjust each weight and bias to make the loss smaller.
This process — applying the chain rule backward through the entire computation graph — is called **backpropagation**. In PyTorch, it's triggered by calling ```loss.backward()```. This automatically populates the ```.grad``` field of each parameter with its respective gradient.

> **Note**: The gradient points in the direction of steepest increase in the loss function. Since we want to minimize the loss, we take a step in the *opposite* direction-this is **gradient descent**.
   
4) **Update the parameters**:
After computing gradients, we update the weights and biases using **gradient descent**: each parameter is adjusted by subtracting a small fraction (i.e. the learning rate, $\alpha$) of its gradient:
> Mathematically we are doing the following:
    $$ w_i\rightarrow w_i - \alpha \frac{\partial \mathcal{L}}{\partial w_i} $$
    $$ b\rightarrow b - \alpha \frac{\partial \mathcal{L}}{\partial b}$$

    In PyTorch, this can be done manually:

    > ```python
    with torch.no_grad():
        for p in model.parameters():
            p -= alpha * p.grad
    ```

5) **Repeat**:  
This process is repeated for many iterations (epochs) across the dataset, gradually refining the parameters to minimize the loss and improve model predictions.

### 2.5: Implementation

#### 2.5.1. Neurons in pytorch (two ways!)

## 3. MLP Implementation on MNIST

## 4. Interpretability Experiments

## 5. The Convolutional neural network (CNN)

## 6. MNIST upgraded via CNN

## 7. Reflections + future directions