# Understanding Backpropagation

Backpropagation is one of the foundational algorithms in neural network training. Let me break this down for you:

## What is Backpropagation?

Backpropagation (or "backprop") is actually a specific algorithm - contrary to your understanding, it is not just a general term for a learning system. It's a method for efficiently calculating gradients in neural networks by applying the chain rule of calculus in a clever way.

The key idea of backpropagation is to calculate how each parameter (weight and bias) in the network affects the final output error, and then adjust those parameters to reduce the error. It does this by propagating the error backward through the network, hence the name.

## When is Backpropagation Used?

Backpropagation is used during the training phase of neural networks. Specifically:

1. During each training iteration, after a forward pass calculates the network's prediction
2. After the loss function computes how far off the prediction is from the actual target
3. Before the optimization algorithm (like gradient descent) updates the parameters

Virtually all modern neural networks - from simple MLPs to complex architectures like CNNs, RNNs, and Transformers - rely on backpropagation for training.

## Backpropagation Exercise

Let's design a simple exercise to understand backpropagation. We'll work with a minimal neural network with:
- 2 input neurons
- 2 hidden neurons
- 1 output neuron

```txt
Input layer (x):      [x₁, x₂]
Hidden layer (h):     [h₁, h₂]  
Output layer (y):     [y]

Weights:
- W₁: 2×2 matrix connecting input to hidden
- W₂: 2×1 matrix connecting hidden to output

Biases:
- b₁: for hidden layer
- b₂: for output layer

Activation function: sigmoid σ(x) = 1/(1+e^(-x))
Loss function: Mean Squared Error (MSE)
```

In [None]:
import numpy as np

# Define the sigmoid activation function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Derivative of the sigmoid function
def sigmoid_derivative(x):
    return x * (1 - x)

# Input data
X = np.array([0.5, 0.3])

# Output labels
t = np.array([.7])

# Seed for reproducibility
np.random.seed(42)

# Initialize weights for the hidden layer and output layer
w1 = np.array([[0.2, 0.4], [0.1, 0.3]])
w2 = np.array([[0.5], [0.6]])
b1 = np.array([0.1, 0.2])
b2 = np.array([0.1])

# Learning rate
lr = 0.1

# Training the network
for epoch in range(1):
    # Forward pass
    hidden_layer_input = np.dot(X, w1) + b1
    hidden_layer_output = sigmoid(hidden_layer_input)
    final_output = np.dot(hidden_layer_output, w2) + b2
    y = sigmoid(final_output)

    # Backpropagation
    error = np.square(y-t)
    d_predicted_output = 2 * (y - t) * sigmoid_derivative(y)

    error_hidden_layer = d_predicted_output.dot(w2.T)
    d_hidden_layer = error_hidden_layer * sigmoid_derivative(hidden_layer_output)

    # Update weights
    w2 += hidden_layer_output.reshape(-1, 1).dot(d_predicted_output.reshape(1, -1)) * lr
    w1 += X.reshape(-1, 1).dot(d_hidden_layer.reshape(1, -1)) * lr
    b2 += np.sum(d_predicted_output, axis=0) * lr
    b1 += np.sum(d_hidden_layer, axis=0) * lr

print(hidden_layer_output)
print(y)
print(error)

[0.55724785 0.62010643]
[0.67932855]
[0.00042731]


In [3]:
# Backward pass (manual calculation for gradients)
# The gradient of E with respect to y
grad_E_y = 2 * (y - t)
# The gradient of E with respect to W₂
grad_E_W2 = grad_E_y * sigmoid_derivative(y) * hidden_layer_output.reshape(-1, 1)
# The gradient of E with respect to h
grad_E_h = grad_E_y * sigmoid_derivative(y) * w2.T
# The gradient of E with respect to W₁
grad_E_W1 = grad_E_h * sigmoid_derivative(hidden_layer_output) * X.reshape(-1, 1)

print(grad_E_y)
print(grad_E_W2)
print(grad_E_h)
print(grad_E_W1)

[-0.04134291]
[[-0.00501868]
 [-0.0055848 ]]
[[-0.00450314 -0.00540377]]
[[-0.00055551 -0.00063649]
 [-0.00033331 -0.0003819 ]]


In [4]:
# Parameter updates (manual calculation)
w1 = w1 - lr * grad_E_W1
w2 = w2 - lr * grad_E_W2
b1 = b1 - lr * grad_E_h * sigmoid_derivative(hidden_layer_output)
b2 = b2 - lr * grad_E_y * sigmoid_derivative(y)

print(w1)
print(w2)
print(b1)
print(b2)

[[0.20005613 0.40006431]
 [0.10003368 0.30003858]]
[[0.50050706]
 [0.60056425]]
[[0.10011357 0.20012976]]
[0.10090993]
