# Back Propogation

Backpropagation is a widely used algorithm for training neural networks. It enables the computation of gradients with respect to the network's parameters, such as weights and biases, by propagating the error (difference between predicted and actual output) backward through the network. These gradients are then used to update the parameters using gradient descent optimization.

The backpropagation algorithm consists of the following steps:

- **Forward Propagation:** Compute the outputs of each layer in the network using the current weights and biases.
- **Loss Calculation:** Compute the loss between the predicted output and the true output using the chosen loss function.
- **Backward Pass and Gradient Calculation:** Propagate the error backward through the network to compute the gradients of the weights and biases.
- **Parameter Updates:** Update the weights and biases by subtracting the gradients multiplied by the learning rate. This adjusts the parameters to minimize the loss function.
- **Repeat:** Iterate over steps 1-4 until convergence or a maximum number of iterations.

## Loss Functions
Before diving into backpropagation, let's discuss the choice of loss functions. The loss function measures the discrepancy between the predicted output of the neural network and the true output. Common loss functions include:

1. **Mean Squared Error (MSE):** MSE is typically used for regression problems.

    $MSE = \frac{1}{N} * \sum(y_{pred} - y_{true})^2$

    where `y_pred` is the predicted output, `y_true` is the true output, and `N` is the number of samples.

2. **Cross-Entropy Loss:** Cross-entropy loss is commonly used for classification problems.

    $CrossEntropy = -\frac{1}{N} * \sum y_{true} * log(y_{pred})$

    where `y_pred` is the predicted output (probability distribution), `y_true` is the true output (one-hot encoded), and `N` is the number of samples.

## Backward Pass and Gradient Calculation

During the backward pass, the gradients of the loss with respect to the parameters are computed using the chain rule. Here are the equations for computing the gradients in a standard feedforward neural network with a single hidden layer:

![Back Propogation](./../../assets/backprop.jpg)

### Notation

$x_{ij} => \begin{cases}
    i: i^{th}\ data\ point \\
    j: index\ of\ dimension/feature
\end{cases}$

$f_{ij}/O_{ij} => \begin{cases}
    i: Current\ Layer\ Index \\
    j: Corresponding\ Neuron\ Index\ for\ that\ Layer
\end{cases}$

$w_{ij}^k => \begin{cases}
    i: From (Corresponding\ Neuron\ Index\ for\ Current\ Layer) \\
    j: To (Corresponding\ Neuron\ Index\ for\ Next\ Layer) \\
    k: Next\ Layer\ Index
\end{cases}$

$b_{ij} => \begin{cases}
    i: Next\ Layer\ Index \\
    j: To (Corresponding\ Neuron\ Index\ for\ Next\ Layer) \\
\end{cases}$

$
w^1 = \begin{bmatrix}
w_{11}^1 & w_{12}^1 \\
w_{21}^1 & w_{22}^1 \\
w_{31}^1 & w_{32}^1 \\
\end{bmatrix}_{3X2} \quad\quad\quad\quad\quad
w^2 = \begin{bmatrix}
w_{11}^2 \\
w_{21}^2
\end{bmatrix}_{2X1}
$

$
b^1 = \begin{bmatrix}
b_{11} & b_{12}
\end{bmatrix}_{1X2} \quad\quad\quad\quad\quad
b^2 = \begin{bmatrix}
b_{21}
\end{bmatrix}_{1X1}
$

$Shape\ of\ w^i = (Number\ of\ Neurons\ in\ Currect\ Layer)\ X\ (Number\ of\ Neurons\ in\ Next\ Layer)$

$Shape\ of\ b^i = 1\ X\ (Number\ of\ Neurons\ in\ Next\ Layer)$

### Gradients for Output Layer (w.r.t $w^2$)

In the context of a neural network, the gradient represents the slope or direction of steepest ascent of a loss function with respect to the model's parameters (weights and biases). By following the opposite direction of the gradient, we can update the parameters to minimize the loss function and improve the model's performance.

$
\frac{∂L}{∂w_{11}^2} = \frac{∂L}{∂O_{21}} * \frac{∂O_{21}}{∂w_{11}^2} \quad\quad\quad\quad\quad
\frac{∂L}{∂w_{21}^2} = \frac{∂L}{∂O_{21}} * \frac{∂O_{21}}{∂w_{21}^2} \\
\frac{∂L}{∂b_2} = \frac{∂L}{∂O_{21}} * \frac{∂O_{21}}{∂b_2}
$

This can be represented in equation as:

```scss
delta_output = (y_pred - y_true) * activation_derivative(output)
dW_output = 1/N * (hidden_output.T @ delta_output)
db_output = 1/N * sum(delta_output)
```

where `y_pred` is the predicted output, `y_true` is the true output, `activation_derivative` is the derivative of the activation function applied to the output, `hidden_output` is the output of the hidden layer, `delta_output` is the error term for the output layer, `dW_output` is the gradient of the weights connecting the hidden layer to the output layer, and `db_output` is the gradient of the biases in the output layer.

### Gradients for Hidden Layer (w.r.t $w^1$)

$
\frac{∂L}{∂w_{11}^1} = \frac{∂L}{∂O_{21}} * \frac{∂O_{21}}{∂O_{11}} * \frac{∂O_{11}}{∂w_{11}^1} \quad\quad\quad\quad\quad\quad\quad
\frac{∂L}{∂w_{12}^2} = \frac{∂L}{∂O_{21}} * \frac{∂O_{21}}{∂O_{12}^2} * \frac{∂O_{12}}{∂w_{12}^1} \\
\frac{∂L}{∂w_{21}^1} = \frac{∂L}{∂O_{21}} * \frac{∂O_{21}}{∂O_{11}} * \frac{∂O_{11}}{∂w_{21}^1} \quad\quad\quad\quad\quad\quad\quad
\frac{∂L}{∂w_{22}^2} = \frac{∂L}{∂O_{21}} * \frac{∂O_{21}}{∂O_{12}^2} * \frac{∂O_{12}}{∂w_{22}^1} \\
\frac{∂L}{∂w_{31}^1} = \frac{∂L}{∂O_{21}} * \frac{∂O_{21}}{∂O_{11}} * \frac{∂O_{11}}{∂w_{31}^1} \quad\quad\quad\quad\quad\quad\quad
\frac{∂L}{∂w_{32}^2} = \frac{∂L}{∂O_{21}} * \frac{∂O_{21}}{∂O_{12}^2} * \frac{∂O_{12}}{∂w_{32}^1} \\
\frac{∂L}{∂b_{11}} = \frac{∂L}{∂O_{21}} * \frac{∂O_{21}}{∂O_{11}} * \frac{∂O_{11}}{∂b_{11}} \quad\quad\quad\quad\quad\quad\quad
\frac{∂L}{∂b_{12}} = \frac{∂L}{∂O_{21}} * \frac{∂O_{21}}{∂O_{12}} * \frac{∂O_{12}}{∂b_{12}}
$

This can be represented in equation as:

```scss
delta_hidden = (delta_output @ W_output.T) * activation_derivative(hidden_output)
dW_hidden = 1/N * (input.T @ delta_hidden)
db_hidden = 1/N * sum(delta_hidden)
```

where `W_output` is the weight matrix connecting the hidden layer to the output layer, `delta_hidden` is the error term for the hidden layer, `dW_hidden` is the gradient of the weights connecting the input layer to the hidden layer, and `db_hidden` is the gradient of the biases in the hidden layer.

## Parameter Updates
After calculating the gradients, the parameters (weights and biases) are updated using gradient descent to minimize the loss function. The update equations for a parameter p are:

$
P_{updated} = P_{old} - learning\_rate * dP
$

where `learning_rate` is a hyperparameter that determines the step size for the parameter update, and `dp` is the gradient of the parameter.

In [1]:
import numpy as np

class MLP:
    def __init__(self, input_size, hidden_size, output_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size

        # Initialize weights with random values
        self.W1 = np.random.randn(input_size, hidden_size)
        self.W2 = np.random.randn(hidden_size, output_size)

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def sigmoid_derivative(self, x):
        return x * (1 - x)

    def forward(self, X):
        self.hidden_layer = self.sigmoid(np.dot(X, self.W1))
        self.output_layer = self.sigmoid(np.dot(self.hidden_layer, self.W2))

    def backward(self, X, y, learning_rate):
        # Calculate the gradients
        output_error = y - self.output_layer
        output_delta = output_error * self.sigmoid_derivative(self.output_layer)
        hidden_error = output_delta.dot(self.W2.T)
        hidden_delta = hidden_error * self.sigmoid_derivative(self.hidden_layer)

        # Update the weights
        self.W2 -= self.hidden_layer.T.dot(output_delta) * learning_rate
        self.W1 -= X.T.dot(hidden_delta) * learning_rate

    def train(self, X, y, epochs=1000, learning_rate=0.01):
        for epoch in range(epochs):
            self.forward(X)
            self.backward(X, y, learning_rate)

    def predict(self, X):
        self.forward(X)
        return self.output_layer

In this example, we implement a simple MLP with one hidden layer and a sigmoid activation function. The `forward` method performs the forward pass, and the `backward` method computes the gradients and updates the parameters using the backpropagation algorithm. The `train` method iterates over the training data and performs multiple iterations of the backward pass to train the network.

In [2]:
# Training data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

# Create an MLP object
mlp = MLP(input_size=2, hidden_size=4, output_size=1)

# Train the MLP
mlp.train(X, y, epochs=10000, learning_rate=0.1)

# Test the MLP
predictions = mlp.predict(X)
print(predictions)

[[0.08292542]
 [0.92638874]
 [0.92569163]
 [0.05634042]]


In this example, we create an MLP with 2 input neurons, 4 hidden neurons, and 1 output neuron. We train the network on the `XOR` problem using the input data `X` and target outputs `y` for 1000 epochs with a learning rate of 0.1.