### 1. **Overview of Backpropagation**
Backpropagation is the algorithm used to train neural networks by minimizing the error between the predicted output and the actual target. It involves:
- **Forward pass:** Compute the output of the network.
- **Backward pass:** Compute gradients of the loss with respect to the network's parameters using the chain rule.
- **Parameter update:** Adjust weights using gradient descent.

### 2. **Forward Pass**
For a single layer, the output is:
$$
z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}
$$
$$
a^{(l)} = \sigma(z^{(l)})
$$
Here:
- $ W^{(l)} $: Weights of layer $ l $
- $ b^{(l)} $: Bias of layer $ l $
- $ a^{(l-1)} $: Activations from the previous layer
- $ \sigma $: Activation function

The output of the network is:
$$
\hat{y} = a^{(L)}
$$

### 3. **Loss Function**
Define the loss function $ \mathcal{L} $. For example, for mean squared error:
$$
\mathcal{L} = \frac{1}{2} \sum_i (y_i - \hat{y}_i)^2
$$

### 4. **Backward Pass**
#### Gradient Computation
The goal is to compute gradients of $ \mathcal{L} $ with respect to all weights and biases.

**Step 1: Compute error at the output layer**
$$
\delta^{(L)} = \frac{\partial \mathcal{L}}{\partial z^{(L)}} = (a^{(L)} - y) \odot \sigma'(z^{(L)})
$$

**Step 2: Backpropagate error to hidden layers**

For each layer $ l $ (going backward):
$$
\delta^{(l)} = (W^{(l+1)T} \delta^{(l+1)}) \odot \sigma'(z^{(l)})
$$

**Step 3: Gradients for weights and biases**
$$
\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} {a^{(l-1)}}^T
$$
$$
\frac{\partial \mathcal{L}}{\partial b^{(l)}} = \delta^{(l)}
$$

Here:
- $ \delta^{(l)} $: Error term at layer $ l $
- $ \sigma'(z^{(l)}) $: Derivative of the activation function

### 5. **Parameter Update**
Using gradient descent, update parameters:
$$
W^{(l)} \leftarrow W^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial W^{(l)}}
$$
$$
b^{(l)} \leftarrow b^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial b^{(l)}}
$$
Here, $\eta $ is the learning rate.

Let us walk through an example using numerical values, visualizations, and a Python implementation with NumPy to clarify the backpropagation process.

---

### **Example Neural Network Setup**
We’ll use a small neural network with:
- 1 input layer (2 neurons)
- 1 hidden layer (2 neurons)
- 1 output layer (1 neuron)

#### Initial weights and biases:
- $W^{(1)} = \begin{bmatrix} 0.15 & 0.25 \\ 0.20 & 0.30 \end{bmatrix}, b^{(1)} = \begin{bmatrix} 0.35 \\ 0.35 \end{bmatrix} $
- $W^{(2)} = \begin{bmatrix} 0.40 & 0.50 \end{bmatrix}, b^{(2)} = 0.60 $

#### Input:
$$
x = \begin{bmatrix} 0.05 \\ 0.10 \end{bmatrix}, \quad y = 0.01
$$

### **Forward Pass (Numerical Example)**
The steps involve computing activations layer by layer.

#### 1. **Hidden Layer**:
$$
z^{(1)} = W^{(1)}x + b^{(1)} = \begin{bmatrix} 0.15 & 0.25 \\ 0.20 & 0.30 \end{bmatrix} \begin{bmatrix} 0.05 \\ 0.10 \end{bmatrix} + \begin{bmatrix} 0.35 \\ 0.35 \end{bmatrix}
$$
$$
z^{(1)} = \begin{bmatrix} 0.3775 \\ 0.3925 \end{bmatrix}
$$

Apply the sigmoid activation function:
$$
a^{(1)} = \sigma(z^{(1)}) = \frac{1}{1 + e^{-z^{(1)}}}
$$
$$
a^{(1)} = \begin{bmatrix} 0.59327 \\ 0.59688 \end{bmatrix}
$$

#### 2. **Output Layer**:
$$
z^{(2)} = W^{(2)}a^{(1)} + b^{(2)} = \begin{bmatrix} 0.40 & 0.50 \end{bmatrix} \begin{bmatrix} 0.59327 \\ 0.59688 \end{bmatrix} + 0.60
$$
$$
z^{(2)} = 1.10591
$$

Apply the sigmoid activation:
$$
\hat{y} = \sigma(z^{(2)}) = \frac{1}{1 + e^{-z^{(2)}}}
$$
$$
\hat{y} = 0.75136
$$

#### Loss:
Using Mean Squared Error:
$$
\mathcal{L} = \frac{1}{2} (y - \hat{y})^2 = \frac{1}{2} (0.01 - 0.75136)^2 = 0.27407
$$

---

### **Backward Pass (Numerical Example)**

#### 1. **Output Layer Gradients**:
Error at the output layer:
$$
\delta^{(2)} = \frac{\partial \mathcal{L}}{\partial z^{(2)}} = (\hat{y} - y) \cdot \sigma'(z^{(2)})
$$
$$
\sigma'(z^{(2)}) = \hat{y}(1 - \hat{y}) = 0.75136 \cdot (1 - 0.75136) = 0.186815
$$
$$
\delta^{(2)} = (0.75136 - 0.01) \cdot 0.186815 = 0.13849
$$

Gradients for $W^{(2)} $ and $b^{(2)} $:
$$
\frac{\partial \mathcal{L}}{\partial W^{(2)}} = \delta^{(2)} \cdot a^{(1)} = \begin{bmatrix} 0.08220 \\ 0.08270 \end{bmatrix}
$$
$$
\frac{\partial \mathcal{L}}{\partial b^{(2)}} = \delta^{(2)} = 0.13849
$$

#### 2. **Hidden Layer Gradients**:
Error at the hidden layer:
$$
\delta^{(1)} = ({W^{(2)}}^T \delta^{(2)}) \cdot \sigma'(z^{(1)})
$$
$$
\sigma'(z^{(1)}) = a^{(1)}(1 - a^{(1)})
$$
$$
\delta^{(1)} = \begin{bmatrix} 0.05 \\ 0.06 \end{bmatrix}
$$

Gradients for $W^{(1)} $ and $b^{(1)} $:
$$
\frac{\partial \mathcal{L}}{\partial W^{(1)}} = \delta^{(1)} \cdot x^T
$$

---

### **Python Implementation**

Let us implement this in Python using NumPy.

In [1]:
import numpy as np

# Sigmoid activation function and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

# Inputs and target
x = np.array([[0.05], [0.10]])
y = 0.01

# Initial weights and biases
W1 = np.array([[0.15, 0.25], [0.20, 0.30]])
b1 = np.array([[0.35], [0.35]])
W2 = np.array([[0.40, 0.50]])
b2 = np.array([[0.60]])

# Forward pass
z1 = np.dot(W1, x) + b1
a1 = sigmoid(z1)
z2 = np.dot(W2, a1) + b2
a2 = sigmoid(z2)
loss = 0.5 * (y - a2)**2

# Backward pass
delta2 = (a2 - y) * sigmoid_derivative(z2)
grad_W2 = np.dot(delta2, a1.T)
grad_b2 = delta2

delta1 = np.dot(W2.T, delta2) * sigmoid_derivative(z1)
grad_W1 = np.dot(delta1, x.T)
grad_b1 = delta1

# Print gradients
print("Gradients for W2:", grad_W2)
print("Gradients for b2:", grad_b2)
print("Gradients for W1:", grad_W1)
print("Gradients for b1:", grad_b1)

Gradients for W2: [[0.08169586 0.08194416]]
Gradients for b2: [[0.13742501]]
Gradients for W1: [[0.00066259 0.00132519]
 [0.00082706 0.00165411]]
Gradients for b1: [[0.01325186]
 [0.01654114]]


In [2]:
# Mathematical Formulas and Python Implementation of Activation Functions

# --- Mathematical Formulas ---
# Sigmoid:          f(x) = 1 / (1 + e^(-x))
# Sigmoid Derivative: f'(x) = f(x) * (1 - f(x))

# Tanh:             f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
# Tanh Derivative:  f'(x) = 1 - f(x)^2

# ReLU:             f(x) = max(0, x)
# ReLU Derivative:  f'(x) = 1 if x > 0 else 0

# Leaky ReLU:       f(x) = x if x > 0 else alpha * x (alpha is a small constant, e.g., 0.01)
# Leaky ReLU Derivative: f'(x) = 1 if x > 0 else alpha

# Softmax:          f(x)_i = e^(x_i) / sum(e^(x_j)) for all j
# Softmax Derivative: f'(x)_i = f(x)_i * (1 - f(x)_i) for the same class, 
#                     and -f(x)_i * f(x)_j for different classes

import numpy as np

# Sigmoid Activation Function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    sig = sigmoid(x)
    return sig * (1 - sig)

# Tanh Activation Function
def tanh(x):
    return np.tanh(x)

def tanh_derivative(x):
    return 1 - np.tanh(x)**2

# ReLU Activation Function
def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return np.where(x > 0, 1, 0)

# Leaky ReLU Activation Function
def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

def leaky_relu_derivative(x, alpha=0.01):
    return np.where(x > 0, 1, alpha)

# Softmax Activation Function
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # Stability improvement by subtracting max(x)
    return exp_x / np.sum(exp_x, axis=0, keepdims=True)

def softmax_derivative(softmax_output):
    s = softmax_output.reshape(-1, 1)
    return np.diagflat(s) - np.dot(s, s.T)

# Examples for each activation function
x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])

print("Sigmoid:", sigmoid(x))
print("Sigmoid Derivative:", sigmoid_derivative(x))

print("Tanh:", tanh(x))
print("Tanh Derivative:", tanh_derivative(x))

print("ReLU:", relu(x))
print("ReLU Derivative:", relu_derivative(x))

print("Leaky ReLU:", leaky_relu(x))
print("Leaky ReLU Derivative:", leaky_relu_derivative(x))

x_for_softmax = np.array([2.0, 1.0, 0.1])
softmax_output = softmax(x_for_softmax)
print("Softmax:", softmax_output)
print("Softmax Derivative:", softmax_derivative(softmax_output))

Sigmoid: [0.11920292 0.26894142 0.5        0.73105858 0.88079708]
Sigmoid Derivative: [0.10499359 0.19661193 0.25       0.19661193 0.10499359]
Tanh: [-0.96402758 -0.76159416  0.          0.76159416  0.96402758]
Tanh Derivative: [0.07065082 0.41997434 1.         0.41997434 0.07065082]
ReLU: [0. 0. 0. 1. 2.]
ReLU Derivative: [0 0 0 1 1]
Leaky ReLU: [-0.02 -0.01  0.    1.    2.  ]
Leaky ReLU Derivative: [0.01 0.01 0.01 1.   1.  ]
Softmax: [0.65900114 0.24243297 0.09856589]
Softmax Derivative: [[ 0.22471864 -0.1597636  -0.06495503]
 [-0.1597636   0.18365923 -0.02389562]
 [-0.06495503 -0.02389562  0.08885066]]
