<a href="https://colab.research.google.com/github/foxtrotmike/CS909/blob/master/nn_optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Understanding Optimization and Convergence Issues in Neural Networks**
(Fayyaz Minhas)

## **1. Introduction to the Chain Rule**
### **What is the Chain Rule?**
The **chain rule** is a fundamental rule in calculus that allows us to differentiate composite functions. If a function $f$ depends on $u$, and $u$ depends on $x$, then we can express its output as:

$$
z = f(u), \quad u = g(x)
$$

The chain rule states that:

$$
\frac{\partial z}{\partial x} = \frac{\partial z}{\partial u} \cdot \frac{\partial u}{\partial x}
$$

This tells us that to find the derivative of $z$ with respect to $x$, we must **multiply** the derivative of $z$ with respect to $u$ by the derivative of $u$ with respect to $x$.

### **Why is the Chain Rule Important?**
- It allows us to **compute derivatives of nested functions**.
- It is **fundamental in backpropagation**, as each layer in a neural network depends on the previous one.

### **Example: Applying the Chain Rule**
Let's say:

$$
z = (3x + 1)^2
$$

We define:

$$
z = 3x + 1, \quad z = u^2
$$

Now, applying the chain rule:

$$
\frac{\partial z}{\partial x} = \frac{\partial z}{\partial u} \cdot \frac{\partial u}{\partial x}
$$

Computing the derivatives:

$$
\frac{\partial z}{\partial u} = 2u, \quad \frac{\partial u}{\partial x} = 3
$$

Thus:

$$
\frac{\partial z}{\partial x} = 2(3x + 1) \cdot 3 = 6(3x + 1)
$$

This same principle **applies to neural networks**, where each layer is dependent on the previous one, and **backpropagation uses the chain rule** to compute gradients.

---

## **2. Neural Network Representation and Evaluation**
We define a deep neural network with $L$ layers, where each layer applies its own activation function $a_k(.)$. The network output is:

$$
f(x; W) = a_L(W_L (...(a_2(W_2 (a_1(W_1 x)))))).
$$

where:
- $W_k$ is the weight matrix at layer $k$.
- $a_k(.)$ is the activation function applied at layer $k$.

Each layer transforms the input recursively:
- The **pre-activation** at layer $k$ is:
  
  $$
  z_k = W_k a_{k-1}
  $$

- The **post-activation** (output of the activation function) at layer $k$ is:

  $$
  a_k = a_k(z_k)
  $$

- The final output of the network is:

  $$
  f(x; W) = a_L(W_L a_{L-1}).
  $$

For **Evaluation**, we can use the squared error loss function:

$$
e = \frac{1}{2} (f - y)^2
$$

---

## **3. Optimization: Compute the Gradient of the Error Using the Chain Rule**

We aim to compute the gradient of the error function with respect to each layer's weight matrix $W_k$:

$$
\frac{\partial e}{\partial W_k}
$$

### **Step 1: Compute the Derivative of the Error Function**

Applying the chain rule:

$$
\frac{\partial e}{\partial W_L} = \frac{\partial e}{\partial f} \cdot \frac{\partial f}{\partial W_L}.
$$

Differentiating:

$$
\frac{\partial e}{\partial f} = (f - y)
$$



Since the final output is:

$$
f = a_L(W_L a_{L-1})
$$

we differentiate with respect to $W_L$:

$$
\frac{\partial f}{\partial W_L} = a_L'(z_L) \cdot a_{L-1}.
$$

Thus, substituting in the chain rule, we get:
$$
\frac{\partial e}{\partial W_L} = (f - y) \odot a_L'(z_L) a_{L-1}^T.
$$


We define the **error signal** at the output layer:

$$
\delta_L = (f - y) \odot a_L'(z_L).
$$

Substituting:

$$
\frac{\partial e}{\partial W_L} = \delta_L a_{L-1}^T.
$$


---

### **Step 2: Compute the Gradient for Hidden Layers ($W_k$, for $k = L-1, ..., 1$)**

Using the chain rule:

$$
\frac{\partial e}{\partial W_k} = \frac{\partial e}{\partial a_k} \cdot \frac{\partial a_k}{\partial z_k} \cdot \frac{\partial z_k}{\partial W_k}.
$$

where:



   $$
   \frac{\partial e}{\partial a_k} = W_{k+1}^T \delta_{k+1}.
   $$

   We define:

   $$
   \delta_k = (W_{k+1}^T \delta_{k+1}) \odot a_k'(z_k).
   $$

Leading to the **Final Gradient Expression**:

   $$
   \frac{\partial e}{\partial W_k} = \delta_k a_{k-1}^T.
   $$

---

## **4. Compute the Full Expression for $W_1$**

Expanding recursively:

$$
\delta_1 = (W_2^T W_3^T \dots W_L^T (f - y)) \odot a_1'(z_1) \odot a_2'(z_2) \dots \odot a_L'(z_L).
$$

Thus:

$$
\frac{\partial e}{\partial W_1} = \left(W_2^T W_3^T \dots W_L^T (f - y) \right) \odot a_1'(z_1) \odot a_2'(z_2) \dots \odot a_L'(z_L) \cdot x^T.
$$

---

## **5. Simplified Example for $L = 3$ with Single-Dimensional $x$ and $y$**

Let:
- $x$ and $y$ be scalars.
- Each layer has **one weight**: $w_1, w_2, w_3$.

### **Forward Pass:**
$$
z_1 = w_1 x, \quad a_1 = a_1(z_1).
$$

$$
z_2 = w_2 a_1, \quad a_2 = a_2(z_2).
$$

$$
z_3 = w_3 a_2, \quad f = a_3(z_3).
$$

### **Backpropagation:**
$$
\delta_3 = (f - y) a_3'(z_3).
$$

$$
\delta_2 = w_3 \delta_3 a_2'(z_2).
$$

$$
\delta_1 = w_2 \delta_2 a_1'(z_1).
$$

### **Full Expression for $ \frac{\partial e}{\partial w_1} $:**
Expanding $ \delta_1 $:

$$
\delta_1 = w_2 w_3 (f - y) a_3'(z_3) a_2'(z_2) a_1'(z_1).
$$

Thus, the **full gradient for $ w_1 $** is:

$$
\frac{\partial e}{\partial w_1} = w_2 w_3 (f - y) a_3'(z_3) a_2'(z_2) a_1'(z_1) x.
$$

This confirms:
- **Gradient includes all weight matrices** up to the last layer.
- **Activation function derivatives** appear at every layer.
- **Input \( x \) and output error \( f - y \) appear**.

---
---

## **Understanding Neural Network Convergence Issues**

Neural networks are trained using **gradient-based optimization**, such as **gradient descent**, which updates the weights using:

$$
W_k^{(t+1)} = W_k^{(t)} - \eta \frac{\partial e}{\partial W_k}
$$

where:
- $t$ is the iteration number.
- $\eta$ is the learning rate.
- $\frac{\partial e}{\partial W_k}$ is the gradient.

For proper convergence, we want **the gradient to be zero only when the error is zero**, i.e.,:

$$
f - y = 0 \quad \Rightarrow \quad f = y
$$

However, the **gradient can also be zero** under the following problematic conditions:

### **1. The Input $x$ is Zero**
👉 **Solution**: Ensure good input representation using **feature scaling and normalization** (e.g., **Batch Normalization**).

### **2. Weight Values Being Close to Zero**
If the weight values are too small, then:

$$
z_k = W_k a_{k-1}
$$

will be close to zero, leading to **low activations and weak gradient updates**.  
👉 **Solution**: Proper weight initialization (e.g., **Xavier or He initialization**).

### **3. Activation Function Derivatives Being Zero**
If any activation function has a **zero derivative**, the gradient flow **stops**.  
For example, **ReLU** has:

$$
a'(z) =
\begin{cases}
  1, & z > 0 \\
  0, & z \leq 0
\end{cases}
$$

If $z \leq 0$, gradients vanish.  
👉 **Solution**: Use **Leaky ReLU** or **ELU**.

---

## **5. Vanishing and Exploding Gradients**
Since backpropagation involves **multiplying gradients through layers**, two extreme cases can occur:

### **Vanishing Gradient Problem**
If:

$$
\delta_1 = (W_2^T W_3^T \dots W_L^T (f - y)) \odot a_1'(z_1) \odot a_2'(z_2) \dots \odot a_L'(z_L)
$$

and if **$a_k'(z_k)$ is small**, the product of many small terms causes **gradients to vanish**.  
👉 **Solution**: Use **ReLU-based activations**, **Batch Normalization** and/or **Layer Normalization**.

### **Exploding Gradient Problem**
If gradient updates become large due to any reason, then this leads to **unstable updates**.  
👉 **Solution**: Use **gradient clipping** and proper **weight initialization**.


These problems can also be mitigated by the use of **skip connections** (aka residual connections) which break-up the multiplication and provide a path for gradient flow thus improving optimization.

---
