Here is a derivation of the back propagation equations based on the provided lecture notes.

### Back Propagation Equations for a Neural Network

The back propagation algorithm is a key component of training artificial neural networks (ANN). It allows the network to learn from its mistakes by adjusting the weights and biases that connect the neurons. This process involves calculating the gradient of the cost function with respect to each parameter in the network, and then updating the parameters in the direction that minimizes the cost.

#### **Setting up the Problem**

*   Let's consider a feedforward neural network with an input layer, one or more hidden layers, and an output layer. 
*   The input layer is defined by the input data **x**. 
*   The output layer produces the model output **y**, which is compared to the target value **t** using a cost or loss function. 
*   The network has a set of parameters **Θ**, which includes the weights and biases of the connections between neurons.
*   The goal of back propagation is to find the values of **Θ** that minimize the cost function.

#### **Forward Pass**

In the forward pass, the input data is fed through the network to generate an output. This is done by calculating the weighted sum of inputs to each neuron, applying an activation function, and passing the result to the next layer. This process continues until the output layer is reached. The activation of a neuron in layer $l$ is given by:

$z^{(l)}_j = \sum_{i=1}^{M_{l-1}} w^{(l)}_{ij} a^{(l-1)}_i + b^{(l)}_j$,

where:

*   $z^{(l)}_j$ is the activation of neuron $j$ in layer $l$.
*   $w^{(l)}_{ij}$ is the weight of the connection between neuron $i$ in layer $l-1$ and neuron $j$ in layer $l$.
*   $a^{(l-1)}_i$ is the output (activation) of neuron $i$ in layer $l-1$.
*   $b^{(l)}_j$ is the bias of neuron $j$ in layer $l$.
*   $M_{l-1}$ is the number of neurons in layer $l-1$.

The output of a neuron is then calculated by applying an activation function to its activation:

$a^{(l)}_j = f(z^{(l)}_j)$,

where $f$ is the activation function.

#### **Calculating the Error**

The error at the output layer is the difference between the network's output and the target value. This error is then propagated back through the network to update the weights and biases.

#### **Back Propagation**

The back propagation algorithm starts at the output layer and works backwards to the input layer. The error at each layer is used to calculate the gradient of the cost function with respect to the weights and biases in that layer.

The error at the output layer is given by:

$\delta^{(L)}_j = \sigma'(z^{(L)}_j) \frac{\partial C}{\partial (a^{(L)}_j)}$,

where:

*   $\delta^{(L)}_j$ is the error of neuron $j$ in the output layer (layer $L$).
*   $\sigma'(z^{(L)}_j)$ is the derivative of the activation function evaluated at $z^{(L)}_j$.
*   $\frac{\partial C}{\partial (a^{(L)}_j)}$ is the partial derivative of the cost function with respect to the output of neuron $j$ in the output layer.

The error at a hidden layer $l$ is calculated using the errors from the layer above it ($l+1$):

$\delta^{(l)}_j = \sum_{k} \delta^{(l+1)}_k w^{(l+1)}_{kj} \sigma'(z^{(l)}_j)$,

where:

*   $\delta^{(l)}_j$ is the error of neuron $j$ in layer $l$.
*   $\delta^{(l+1)}_k$ is the error of neuron $k$ in layer $l+1$.
*   $w^{(l+1)}_{kj}$ is the weight of the connection between neuron $j$ in layer $l$ and neuron $k$ in layer $l+1$.
*   $\sigma'(z^{(l)}_j)$ is the derivative of the activation function evaluated at $z^{(l)}_j$.

#### **Updating the Weights and Biases**

The weights and biases are updated using gradient descent:

$w^{(l)}_{jk} \leftarrow w^{(l)}_{jk} - \eta \delta^{(l)}_j a^{(l-1)}_k$,

$b^{(l)}_j \leftarrow b^{(l)}_j - \eta \delta^{(l)}_j$,

where:

*   $\eta$ is the learning rate.

#### **Key Equations**

The most important equations for back propagation are:

1.  **Error at the output layer:**

$\delta^{(L)} = \sigma'(z^{(L)}) \circ \frac{\partial C}{\partial (a^{(L)})}$

2.  **Error at a hidden layer:**

$\delta^{(l)} = (W^{(l+1)})^T \delta^{(l+1)} \circ \sigma'(z^{(l)})$

3.  **Weight update:**

$W^{(l)} \leftarrow W^{(l)} - \eta \delta^{(l)} (a^{(l-1)})^T$

4.  **Bias update:**

$b^{(l)} \leftarrow b^{(l)} - \eta \delta^{(l)}$

#### **Summary**

The back propagation algorithm is a powerful tool for training neural networks. It allows the network to learn complex patterns by iteratively adjusting its weights and biases. The algorithm works by propagating the error from the output layer back to the input layer, and using this error to calculate the gradient of the cost function with respect to each parameter in the network. These gradients are then used to update the parameters using gradient descent.


Here are some ways that the chain rule is used in back propagation, based on the sources provided.

### How the Chain Rule Enables Back Propagation

*   **The chain rule allows the calculation of the gradient of the cost function with respect to each weight and bias in the network.** The cost function measures the difference between the network's predicted output and the actual target values. The gradient of the cost function indicates the direction and magnitude of change needed in the weights and biases to minimize the cost.
*   **Back propagation is essentially an implementation of the chain rule.** It starts by calculating the error at the output layer, then uses the chain rule to propagate this error back through the network. The error at each layer is used to calculate the gradient of the cost function with respect to the weights and biases in that layer.
*   **The chain rule is applied repeatedly to calculate the gradients of intermediate variables.** The activation of each neuron in the network depends on the weighted sum of inputs from the previous layer, and the output of each neuron is a function of its activation. The chain rule is used to calculate the gradients of these intermediate variables, which are then used to calculate the gradients of the cost function with respect to the weights and biases.
*   **The chain rule is used in both the forward and reverse modes of automatic differentiation.** In the forward mode, the derivatives are calculated in the same order as the function is evaluated. In the reverse mode, the derivatives are calculated in the reverse order, starting from the output and working backwards to the input. Back propagation uses the reverse mode of automatic differentiation, which is more efficient for neural networks because the output of the network typically depends on a small number of output variables.

#### Example: Simple Perceptron

Consider a simple perceptron with one input, one output, and no hidden layers.

*   The input is denoted by $x$.
*   The weight is denoted by $w_1$.
*   The bias is denoted by $b_1$.
*   The activation function is denoted by $\sigma_1$.
*   The output is denoted by $a_1 = \sigma_1(z_1)$, where $z_1 = w_1 x + b_1$.
*   The cost function is the squared error: $C(x; w_1, b_1) = \frac{1}{2} (a_1 - y)^2$, where $y$ is the target value.

To update the weight $w_1$ using gradient descent, we need to calculate $\frac{\partial C}{\partial w_1}$. Applying the chain rule:

$\frac{\partial C}{\partial w_1} = \frac{\partial C}{\partial a_1} \frac{\partial a_1}{\partial z_1} \frac{\partial z_1}{\partial w_1}$.

*   $\frac{\partial C}{\partial a_1} = a_1 - y$ (derivative of the cost function)
*   $\frac{\partial a_1}{\partial z_1} = \sigma_1'(z_1)$ (derivative of the activation function)
*   $\frac{\partial z_1}{\partial w_1} = x$ (derivative of the weighted sum of inputs)

Therefore, 

$\frac{\partial C}{\partial w_1} = (a_1 - y) \sigma_1'(z_1) x$.

Similarly, we can calculate $\frac{\partial C}{\partial b_1}$:

$\frac{\partial C}{\partial b_1} = \frac{\partial C}{\partial a_1} \frac{\partial a_1}{\partial z_1} \frac{\partial z_1}{\partial b_1} = (a_1 - y) \sigma_1'(z_1)$.

These gradients are then used to update $w_1$ and $b_1$ using gradient descent.

#### General Case

The same principles apply to more complex neural networks with multiple layers and neurons. The chain rule is used to propagate the error from the output layer back to the input layer, calculating the gradients of the cost function with respect to all weights and biases in the network. These gradients are then used to update the parameters using gradient descent.
