# How Neural Networks Work

An MLP consists of multiple layers of neurons: an input layer, one or more hidden layers, and an output layer. Each neuron in one layer is connected to all neurons in the next layer. The network transforms inputs through these layers using a series of linear and nonlinear functions to predict the target output.

## Structure of  Neural Network
For an MLP with:
- Input layer: $ X = [x_1, x_2, \dots, x_n] $
- Hidden layers: With activations $ a^{(l)} $ and weights $ W^{(l)} $ and biases $ b^{(l)} $, where $ l $ denotes the layer.
- Output layer: $ \hat{y} = f(a^{(L)}) $, where $ L $ is the final layer, and $ f $ is the activation function.

The MLP performs the following operations:
- Linear transformation: $ z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)} $
- Nonlinear activation: $ a^{(l)} = \sigma(z^{(l)}) $

For the output layer:
- Output: $ \hat{y} = \sigma(W^{(L)} a^{(L-1)} + b^{(L)}) $

### Loss Function
The loss $ \mathcal{L} $ measures the difference between the predicted output $ \hat{y} $ and the true label $ y $. A common loss function is Mean Squared Error (MSE) for regression or Cross-Entropy Loss for classification:

$$
\mathcal{L} = \frac{1}{2} (\hat{y} - y)^2 \quad \text{(for regression)}
$$

### Back propagation Algorithm
Back propagation updates the weights of the MLP using the gradient of the loss function with respect to the model’s parameters (weights and biases). To calculate these gradients, we use the **chain rule** of calculus.

#### Chain Rule
If a function $ y $ depends on $ u $, and $ u $ depends on $ x $, the derivative of $ y $ with respect to $ x $ can be found using:

$$
\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}
$$
In the MLP, backpropagation applies this rule to calculate the gradients layer by layer, starting from the output and propagating backward through the network.

#### Steps in Backpropagation:
1. **Forward Pass**: Compute activations for each layer, $ a^{(l)} $, and store them.
2. **Loss Calculation**: Compute the loss $ \mathcal{L}(\hat{y}, y) $.
3. **Backpropagate Errors**:
   - Calculate the gradient of the loss with respect to the output $ \hat{y} $.
   - Propagate the error backwards using the chain rule to update the weights.

The gradients are computed as:

$$
\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} (a^{(l-1)})^T
$$

where $ \delta^{(l)} $ is the error term for layer $ l $, which is computed recursively:

$$
\delta^{(L)} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \sigma'(z^{(L)})
$$

$$
\delta^{(l)} = (\delta^{(l+1)} W^{(l+1)}) \cdot \sigma'(z^{(l)})
$$

4. **Update Weights**: After calculating the gradients, update the weights using gradient descent:
   
$$
W^{(l)} = W^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial W^{(l)}}
$$
where $ \eta $ is the learning rate.

### Toy Example

Let's take a simple 2-layer MLP with one input, one hidden layer (2 neurons), and one output. The input is $ x = 1 $, the target is $ y = 0 $, and we use a sigmoid activation function.

1. **Initialization**:
   - Input to hidden weights: $ W^{(1)} = [[0.15, 0.2], [0.25, 0.3]] $
   - Hidden to output weights: $ W^{(2)} = [0.4, 0.45] $
   - Biases: $ b^{(1)} = [0.35], b^{(2)} = 0.6 $

2. **Forward pass**:
   
   $$
   z^{(1)} = W^{(1)} \cdot x + b^{(1)} = [0.5, 0.6]
   $$

   $$
   a^{(1)} = \sigma(z^{(1)}) = [0.622, 0.645]
   $$

   $$
   z^{(2)} = W^{(2)} \cdot a^{(1)} + b^{(2)} = 1.105
   $$

   $$
   \hat{y} = \sigma(z^{(2)}) = 0.751
   $$

3. **Loss calculation**:
   
   $$
   \mathcal{L} = \frac{1}{2} (\hat{y} - y)^2 = 0.282
   $$

4. **Backward pass**:
   - Output layer error:
  
     $$
     \delta^{(2)} = (\hat{y} - y) \cdot \sigma'(z^{(2)}) = 0.139
     $$
   - Hidden layer error:
  
     $$
     \delta^{(1)} = \delta^{(2)} W^{(2)} \cdot \sigma'(z^{(1)}) = [0.013, 0.014]
     $$

5. **Update weights** (with learning rate $ \eta = 0.5 $):
   
   $$
   W^{(2)} = W^{(2)} - \eta \delta^{(2)} a^{(1)} = [0.37, 0.42]
   $$
   
   $$
   W^{(1)} = W^{(1)} - \eta \delta^{(1)} x = [[0.144, 0.19], [0.236, 0.286]]
   $$
   

In [8]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define a simple MLP model
class SimpleMLP(nn.Module):
    def __init__(self):
        super(SimpleMLP, self).__init__()
        self.fc1 = nn.Linear(1, 2)  # 1 input, 2 hidden neurons
        self.fc2 = nn.Linear(2, 1)  # 2 hidden neurons, 1 output
    
    def forward(self, x):
        x = torch.sigmoid(self.fc1(x))
        x = torch.sigmoid(self.fc2(x))
        return x

# Initialize model, loss function, and optimizer
model = SimpleMLP()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.5)

# Sample data
x = torch.tensor([[1.0]])  # Input
y = torch.tensor([[0.0]])  # Target

# Training loop
epochs = 500
for epoch in range(epochs):
    # Forward pass
    output = model(x)
    loss = criterion(output, y)
    
    # Backward pass
    optimizer.zero_grad()  # Clear previous gradients
    loss.backward()  # Compute gradients
    optimizer.step()  # Update weights
    
    if epoch % 100 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

# Final output
print(f'Final prediction: {model(x).item()}')

Epoch 0, Loss: 0.1016080379486084
Epoch 100, Loss: 0.004436768125742674
Epoch 200, Loss: 0.0020264554768800735
Epoch 300, Loss: 0.0012776795774698257
Epoch 400, Loss: 0.0009210538701154292
Final prediction: 0.02673455886542797




## Computation Graph (Auto diff)?

 ```{image}  https://github.com/akkefa/ml-notes/releases/download/v0.1.0/computational_graph.png
:align: center
:alt: Computation Graph
:width: 80%
```

Backpropagation calculates the gradient of the loss function with respect to each weight in the neural network. It does this by:
1. **Forward Pass:** The network makes predictions, and the loss is computed.
2. **Backward Pass (Backpropagation):** The gradient of the loss w.r.t. each parameter is computed by following the chain rule along the computation graph.

### Key Concepts:
- **Chain Rule:** The chain rule is used to compute the derivative of a composite function. If a variable $ z $ depends on $ y $, and $ y $ depends on $ x $, then:
  
  $$
  \frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}
  $$

- **Computation Graph:** Backpropagation operates on a computation graph where:
  - Each node represents an operation or variable.
  - Arrows represent data dependencies.
  
  During backpropagation, the gradient flows backward through the edges, from the loss to each parameter.

---

### Detailed Example of Backpropagation with a Simple Neural Network

Let’s consider a simple network with one input $ x $, one weight $ w $, and an output $ y $:

$$
y = w \cdot x
$$
$$
L = \frac{1}{2} (y - t)^2
$$

Where $ t $ is the target value and $ L $ is the loss function.

#### Forward Pass:
1. Compute the output: 
   
   $$
   y = w \cdot x
   $$
2. Compute the loss:
   
   $$
   L = \frac{1}{2} (y - t)^2
   $$

#### Backward Pass (Backpropagation):
We want to compute the gradient of the loss $ L $ w.r.t the weight $ w $, i.e., $ \frac{dL}{dw} $.

Using the chain rule:
1. Compute $ \frac{dL}{dy} $:
   
   $$
   \frac{dL}{dy} = (y - t)
   $$
2. Compute $ \frac{dy}{dw} $:
   
   $$
   \frac{dy}{dw} = x
   $$
3. Combine using the chain rule:
   
   $$
   \frac{dL}{dw} = \frac{dL}{dy} \cdot \frac{dy}{dw} = (y - t) \cdot x
   $$

This is the gradient that backpropagation computes and uses to update the weight $ w $ during training.

---

### Computation Graph for Backpropagation

Let's build the computation graph for the above example:

1. $ z1 = w \cdot x $ (Multiplication)
2. $ z2 = z1 - t $ (Subtraction)
3. $ L = \frac{1}{2} z2^2 $ (Squared error)

Each node in the graph represents an operation or variable, and we can propagate gradients backward from the loss function to compute the gradients of all inputs.


### Example in PyTorch: Backpropagation

Let’s implement the above example using PyTorch's autograd:

**Explanation:**
1. We define a scalar $ x $, weight $ w $, and target $ t $.
2. We compute the output $ y = w \cdot x $ and the loss $ L = \frac{1}{2}(y - t)^2 $.
3. Using `loss.backward()`, PyTorch computes the gradient of the loss w.r.t. $ w $ via backpropagation. The result is stored in `w.grad`.


In [9]:
import torch

# Initialize input, weight, and target
x = torch.tensor(2.0, requires_grad=False)
w = torch.tensor(1.5, requires_grad=True)
t = torch.tensor(5.0, requires_grad=False)

# Forward pass: Compute output and loss
y = w * x
loss = 0.5 * (y - t)**2

# Backward pass: Compute gradients
loss.backward()

# Output the gradient of the loss w.r.t. w
print(f"Gradient of the loss with respect to w: {w.grad}")

Gradient of the loss with respect to w: -4.0
