# Introduction to Neural Network Training

## Training Phases

- **Initialize Weights and Biases:** These values control how the network initially processes information

- **Forward-Pass:** Pass the input through the network to get an output (discussed in forward_propagation)

- **Calculate the Error:** Compare the network’s output to the correct answer to measure the difference

- **Back-Propagation:** Use the loss value to adjust the weights and biases to improve the network’s accuracy

<div style="text-align:center">
    <img src="../assets/training_phases.png" alt="training phases visual">
</div>

## Back-Propagation Fundamentals

- The network uses the **loss** to adjust its **weights** and **biases** through a process known as **backpropagation**
- Backpropagation calculates **how much weights should change** to reduce the error

## Problem Setup

- **Given:**  
    The architecture of the network
    - Count of layers
    - Count of neurons of each layer
    - Activation function

- **Training Data**  
    A set of input-output pairs:
    $$(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots , (x^{(N)}, y^{(N)})$$

- We want the function $f$:
    - Consider a neural network as a parametric function $f(x;w)$

- We need a **loss function** to show how penalizes obtained output $f(x;w)$ when the desired output is $y$:
$$E(W) = \frac{1}{N} \sum_{n=1}^N loss \left(f(x^{(n)};w), y^{(n)} \right)$$

- **Minimize** the **cost function**
$$\hat{W} = \argmin_{W} E(W)$$

# Optimization Fundamentals

## Wight Update Strategies

**Random Search:**
- Tries values randomly
- Inefficient and impractical

**Gradient Descent:**
- Follows the slope of the loss function
- Efficient and guided

**Why Gradient Descent?**
- It updates weights by following the slope, reducing error with each step.
- Controlled, stepwise updates ensure we move closer to minimizing the loss effectively

## Gradient Descent Algorithm

**Gradient Descent**

In each step, takes steps proportional to the negate of teh gradient descent vector of the current point.

$$w^{(t+1)} = w^t - \gamma_t \nabla_w J(w^t)$$
$$\nabla_w J(w^t) = [\frac{\partial{J(w)}}{\partial{w_1}}, \frac{\partial{J(w)}}{\partial{w_2}}, ..., \frac{\partial{J(w)}}{\partial{w_d}}]$$

Where:
- $w^t$ is the current point.
- $\gamma_t$ is step size *(learning rate parameter)*.
    - If $\gamma_t$ is small enough, then $J(w^{(t+1)}) \le J(w^{(t)})$
    - When $\gamma$ is too small: gradient descent can be slow.
    - When $\gamma$ is too large: gradient descent can overshoot the minimum. May fail to converge or diverge.
- $J(w)$ decreases from $w^t$ in the direction of $-\gamma_t \nabla_w J(w^t)$.
- **Assumption**: $J(w)$ is defined and differentiable in a neighborhood of a point $w^t$.

Since Gradient Descent is better, here is some properties that help to use gradient descent effectively:


- Define **differentiable loss** or divergence between the output of the network and the desired output for the training instances
    - And a total error, which is the average divergence over all training instances
- Use **continuous activation functions** to enables us to estimate network parameters
- Optimize network parameters to **minimize the total error**

# Weight Initialization Strategies

## Importance of Proper Initialization

- Proper initialization ensures:
    - Faster **convergence**
    - Improves **training stability**
- Prevents issues like
    - **Vanishing gradients:**
        - Gradients becomes extremely **small**
        - Updates are **negligible**
        - Network learns very **slowly or not at all**
    - **Exploding gradients:**
        - Gradients becomes extremely **large**
        - Updates are too drastic
        - **Unstable** network training

**Question:** How can we initialize weights to maximize learning efficiency and prevent gradient problems?

## Initialization Methods

### **Zero Initialization**

**Description:**
- Set all weights to zero.

**Key Points:**
- Rarely used
- Leads to identical updates for all neurons
- Preventing the network from learning distinct features

### **Random Initialization**

**Description:**
- Assign **small random** values to weights.

**Distribution:**
- Typically, weights are initialized using a **uniform** or **normal** distribution:
$$w \approx u(-\epsilon, \epsilon) \quad \text{or} \quad w \approx N(0, \sigma^2)$$

**Key Points:**
- **Break symmetry:**
    - If all weights are initialized to the same value (e.g., zeros), every neuron in a layer learns the exact same thing during training (no progress!).
    - Random initialization breaks this symmetry by giving each neuron a slightly different starting point.

- Still cause issues with **gradient magnitudes**.
    - If $\sigma^2$ is too **small** $\rightarrow$ **Vanishing gradients**
    - If $\sigma^2$ is too **large** $\rightarrow$ **exploding gradients**

### **Xavier Initialization**

**Description:**
- Xavier Initialization is designed to **keep the variance of activations consistent** across layers, ideal for **sigmoid** and **tanh** activations.

**Objective:**
- Prevents the **vanishing** or **exploding** of signal magnitudes during forward and backward propagation.

**Condition:**
$$\frac{1}{N_l} var[w] = 1$$
Where:
- $n_l$: Number of neurons in layer $l$

**Initialization Scheme:**
$$w \approx u(- \sqrt{\frac{1}{n_l}}, \sqrt{\frac{1}{n_l}})$$

**Key Points**
- **Balancing Variance**
    - If a layer has **many inputs**, make the **weights smaller** (to **avoid exploding** signals).
    - If a layer has **few inputs**, make the **weights larger** (to **avoid vanishing** signals).
- Initialize weights so that the variance of activations and variance of gradients remain roughly the same across layers.

**Note:**  
This formula is a special case of Xavier initialization.  
The full Xavier formula for uniform distribution is:
$$w \approx u(- \sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}})$$
if $n_{in} = n_{out} = n_l$, simplifying to:
$$w \approx u(- \sqrt{\frac{1}{n_l}}, \sqrt{\frac{1}{n_l}})$$

### **He Initialization**

**Description:**
- He Initialization (or Kaiming Initialization) is designed for neural networks with **ReLU** activations, considering the non-linearity of these functions.

**Objective:**
- Aims to prevent the **exponential growth** or **reduction** of input signal magnitudes through layers.

**Condition:**
$$\frac{1}{2} n_l Var[w] = 1$$

**Initialization Scheme:**
$$w_l \approx N(0, \frac{2}{n_l})$$

**Key Points**
- ReLU discards half the signals: For any negative input, the output is zero.
    - **Xavier**'s scaling factor ($\frac{1}{n_l}$) is **too small**, causing gradients to **vanish** over time.
- This implies a **zero-centered Gaussian** distribution with a standard deviation of $\frac{2}{n_l}$, where biases are initialized to 0.

# Activation Functions

## Importance of Activation Function

**Why Transform Outputs?**  
**Raw outputs** need to be transformed into **meaningful values**, such as probabilities.

Choose activation function that makes the neuron
- **Differentiable**
- **Non-zero derivatives** over much of the input space
    - Because Training network with zero gradient will not guide us to the optimum cost

## Common Activation Functions

### **Step Function**

**Formula:**
$$
\text{Step}(z) = \begin{cases}
1 & \text{if } z \ge 0 \\ 
0 & \text{if } z \lt 0
\end{cases}
$$
<div style="text-align:center">
  <img src="../assets/step_function.png" alt="step function">
</div>
  
- Not differentiable at $z=0$
- Derivative is $0$ elsewhere $\implies$ **Not suitable**

### **Sigmoid**

**Formula:**
$$\sigma(z) = \frac{1}{1 + e^{-z}}, \quad \text{Where: } z = \sum_i w_ix_i + b$$

**Derivative:**
$$\sigma(z)' = \sigma(z) (1 - \sigma(z))$$

<div style="text-align:center">
    <img src="../assets/sigmoid.png" alt="sigmoid function">
</div>

**Advantages:**
- Squashes the input between 0 and 1, which makes it useful in **probabilistic interpretations** (e.g., logistic regression).
- Often used in output layers for **binary classification** problems.
- **Smooth** & **Differentiable**

**Limitations:**
- **Gradient Saturation:** When $z$ is very large or very small, the gradient becomes nearly zero, causing **slow learning** (**vanishing** gradient problem).
- **Not Zero-Centered:** The output is not zero-centered, which can make optimization more difficult.

### **Soft-Max**

Soft-max vector activation is often used at the output of multi-class classifier networks to convert raw outputs to probabilities for each class:
$$
o_i = \frac{\exp(z_i)}{\sum_j \exp(z_j)}, \quad \text{where } z_i = \sum_j w_{ji} x_j + b_i
$$

Where:
- $o_i$: Probability $P(y = i | x)$

<div style="text-align:center">
    <img src="../assets/soft_max.png" alt="applying soft-max function">
  </div>

**Key Points:**
- Soft-max is used in the output layer for **multi-class classification**
- It converts **logits** (raw unbounded numerical scores) into a **probability distribution** across classes.
- The class with the **highest probability** is **selected** as the prediction.
- It ensures all outputs sum to 1, making it ideal for choosing one class out of multiple options.

### **Hyperbolic Tangent (Tanh)**

**Formula:**
$$\text{Tanh}(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$$

<div style="text-align:center">
    <img src="../assets/tanh_activation.png" alt="Tanh function">
</div>

**Advantages:**
- **Zero-Centered:** Output ranges from -1 to 1, making optimization easier. (Balanced Updates $\rightarrow$ Reduced Bias in Gradient Descent $\rightarrow$ Faster Convergence)
- Better for **hidden layers** than Sigmoid due to **zero-centered** output.

**Limitations:**
- Similar **saturation issues** as Sigmoid: **Large input** values push **gradients towards zero** (**vanishing** gradient problem).

**Tanh vs. Sigmoid**
- The derivative of the Tanh function has a much **steeper slope** at $x = 0$, meaning it provides a **larger gradient** for backpropagation compared to the Sigmoid function.

<div style="text-align:center">
    <img src="../assets/tanh_vs_sigmoid.png" alt="Tanh vs Sigmoid function">
</div>

### **ReLU**

**Formula:**
$$\text{ReLU}(z) = \max(0, z)$$

<div style="text-align:center">
    <img src="../assets/basic_relu.png" alt="basic ReLU function">
</div>

**Advantages of ReLU**
- **Faster convergence**: Specially for **deep** network
- Does not saturate positive values, helping to **avoid vanishing** gradient problem
- **Computationally efficient** (Simpler than Sigmoid/Tanh)

**Limitation:**
- **Dead ReLU Problem:** Neurons can become inactive during training, outputting 0 for all inputs if they receive negative values consistently.

### **Leaky ReLU**

Allows a small, non-zero gradient for negative inputs.
$$\text{LeakyReLU}(z) = \max(\alpha z, z), \quad \alpha = 0.01$$

<div style="text-align:center">
    <img src="../assets/leaky_relu.png" alt="leaky relu function">
</div>

**Advantages:**
- Similar to ReLU
- Helps prevent the **dead ReLU** problem, where neurons stop updating.

**Parametric ReLU (PReLU)**

Similar to Leaky ReLU, but the slope for negative inputs $\alpha$ is **learned** during training.
$$\text{PReLU}(z) = \max(\alpha z, z), \quad \text{$\alpha$ is learned}$$

<div style="text-align:center">
    <img src="../assets/prelu.png" alt="parametric ReLU function">
</div>

**Advantages:**
- Provides more flexibility by adjusting the slope for negative inputs based on data.

### **Exponential Linear Unit (ELU)**

**Similar to ReLU for positive** values but **smoother for negative** inputs.

$$
\text{ELU}(z) = \begin{cases}
z & \text{if } z \gt 0 \\ 
\alpha(e^z - 1) & \text{if } z \le 0
\end{cases}
$$

<div style="text-align:center">
    <img src="../assets/elu.png" alt="ELU function">
</div>

**Advantages:**
- Provides **faster convergence** and **reduces bias** shift by smoothing negative values.

### **Soft-Plus**

SoftPlus is a **smooth approximation** to the ReLU function and can be used to constrain the output of a machine to always be **positive**.
$$\text{Softplus}(z) = \ln(1 + e^z)$$

<div style="text-align:center">
    <img src="../assets/soft_plus.png" alt="soft-plus function">
</div>

**Advantages:**
- The output is always positive
- The function smoothly increases and is **always differentiable**, **unlike ReLU** (sharp corner at zero)
- For **negative** inputs, the function **approaches zero**, but unlike ReLU, it never exactly reaches zero, **avoiding the problem of dying neurons**

# Loss Functions

## Regression Loss Functions

### **Sum of Squared Errors (SSE) & Mean Squared Error (MSE)**

**SSE**

Also known as $L_2$ divergence

For real-valued output vectors (regression problems).

Squared Euclidean distance between true and desired outout:
$$
loss(y, o) = \frac{1}{2} \|y - o \|^2 = \frac{1}{2} \sum_k (y_k - o_k)^2
$$

**Mean Squared Error (MSE)**
$$MSE = \frac{1}{N} SSE$$

**Note:** This is differentiable

### **Mean Absolute Error (MAE)**

For real-valued output vectors (regression problems).

Absolute Euclidean distance between true and desired outout:
$$
loss(y, o) = \frac{1}{N} \sum_k |y_k - o_k|
$$

### **MSE vs. MAE**

**MSE (Mean Squared Error):**
- Heavily **penalizes large errors**, promoting smoother outputs.
- Its quadratic gradient leads to **faster convergence** for **large errors**, but it can be **sensitive to outliers**.

**MAE (Mean Absolute Error):**
- Treats all **errors uniformly**, resulting in sharper outputs and **better handling of outliers**.
- Its constant gradient **ensures stable optimization** but can **slow convergence** with **large errors**.

## Classification Loss Function

### **Binary Classification (Logistic Regression)**

**Logistic Regression**

**Sigmoid Function:**
$$\sigma(z) = \frac{1}{1 + e^{-z}}, \quad \text{Where: } z = \sum_i w_ix_i + b$$

For classification problems:
$$P(Y = 1 | X) = \frac{1}{1 + \exp(- \sum_i w_ix_i - b)}$$

This the perceptron with a sigmoid activation
    - It actually computes the probability that the input belongs to class 1

If the desired **output is a binary**, Output activation typically is a **sigmoid**
- Viewed as a probability $P(Y = c | x)$ of class $c$
- Differentiable

**Binary Classifier:**

For binary classifier with scalar output $o \in (0, 1)$:
$$
loss(y, o) = -y \log(o) - (1 - y)\log(1- o)
$$

### **Multi-Class Classification**

**Multi-Class Classifier**

Multi-class classifier with $K$ classes, the one-hot representation for the desired output $y$
- ($K-1$ zeros and a single $1$)

The network's output will be a probability vector
- $K$ probability values that sum to $1$

**Soft-Max Activation Function:**

Soft-max vector activation is often used at the output of multi-class classifier networks:
$$
o_i = \frac{\exp(z_i)}{\sum_j \exp(z_j)}, \quad \text{where } z_i = \sum_j w_{ji} x_j + b_i
$$

Where:
- $o_i$: Probability $P(class = i | x)$

<div style="text-align:center">
    <img src="../assets/soft_max.png" alt="applying soft-max function">
  </div>

If the desired **output is a multinomial**, Output activation typically is a **soft-max**

**Cost function:**

Desired output $y$ is one-hot vector $[0, 0, \cdots, 1, \cdots, 0, 0]^T$ with the 1 in the $c$-th position(for class $c$)

Actual output will be probability distribution $[o_1, o_2, \cdots, o_K]^T$

The **cross-entropy** between the desired one-hot output and **actual class $c$**
$$
loss(y, o) = \sum_{i=1}^K y_i \log(o_i) = - \log(o_c)
$$

## Probabilistic View

### **Likelihood and Log-Likelihood Objectives**

**Likelihood:**
$$
p(y^{(1)}, y^{(2)}, \cdots, y^{(N)} | x^{(1)}, x^{(2)}, \cdots, x^{(N)}) = \prod_{n=1}^N p(y^{(n)} | x^{(n)}) = \prod_{n=1}^N p \left(y^{(n)} | f(x^{(n)}; W) \right)
$$

**Log-Likelihood:**
$$
\log \prod_{n=1}^N p \left(y^{(n)} | f(x^{(n)}; W) \right) = \sum_{n=1}^N \log \left(p \left(y^{(n)} | f(x^{(n)}; W) \right) \right)
$$

Maximizing likelihood corresponds to loss function $Loss = - \log \left(p \left(y^{(n)} | f(x^{(n)}; W) \right) \right)$


### **Probabilistic Modeling for Regression**

Assume $y$ given $x$ follows a  Gaussian (Normal) distribution:
$$p(y|x) = N(f(x;w), \sigma^2)$$
This is equivalent to writing:
$$y = f(x;w) + \epsilon, \quad \epsilon \sim N(0, \sigma^2)$$

That means:
- For a fixed $x$, $y$ is normally distributed
- Mean = $f(x; w)$
- Variance = $\sigma^2$

We model the uncertainty in the predictions:
$$p(y|x,w,\sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2\sigma^2}(y - f(x;w))^2}$$

**Apply MLE**

Recall: The likelihood of parameters $w$ and $\sigma^2$
$$L(D;w,\sigma^2) = \prod_{i=1}^n p(y^{(i)} | x^{(i)}, w, \sigma^2)$$

The goal is to find:
$$\hat w = \argmax_{w}L(D;w,\sigma^2)$$

It is easier to maximize the log-likelihood instead:
$$\hat w = \argmax_{w} \ln L(D;w,\sigma^2)$$

$$\ln L(D;w,\sigma^2) = \ln \prod_{i=1}^n p(y^{(i)} | x^{(i)}, w, \sigma^2) = \sum_{i=1}^n \ln p(y^{(i)} | x^{(i)}, w, \sigma^2)$$

Substitute the Gaussian formula:
$$\ln p(y^{(i)} | x^{(i)}, w, \sigma^2) = -\ln \sigma - \frac{1}{2}\ln 2\pi - \frac{1}{2\sigma^2} (y^{(i)} - f(x^{(i)};w))^2$$

Then the full log-likelihood becomes:
$$\ln L(D;w,\sigma^2) = -n \ln \sigma - \frac{n}{2}\ln 2\pi - \frac{1}{2\sigma^2}\sum_{i=1}^n (y^{(i)} - f(x^{(i)};w))^2$$

**MLE for $w$**

The Goal is:
$$\hat w = \argmax_{w} \ln L(D;w,\sigma^2) = -n \ln \sigma - \frac{n}{2}\ln 2\pi - \frac{1}{2\sigma^2}\sum_{i=1}^n (y^{(i)} - f(x^{(i)};w))^2$$

Which is the same as:
$$\hat w = \argmin_{w} \frac{1}{2\sigma^2}\sum_{i=1}^n (y^{(i)} - f(x^{(i)};w))^2$$

This is exactly the minimizing **Sum of Squared Errors (SSE)**.  
The standard objective in linear regression.

### **Probabilistic Modeling of Classification**

#### **Binary Classification**

**Maximum Log Likelihood:**
$$\hat{w} = \argmax_w \log \left(\prod_{i=1}^N p(y^{(i)} | f(x^{(i)}, W))\right) = \argmax_w \sum_{i=1}^N \log \left(p(y^{(i)} | f(x^{(i)}, W)) \right)$$

**Bernoulli Model**  
- For binary classification:
$$
p(y^{(i)} | f(x^{(i)}, W)) = 
\begin{cases} 
f(x^{(i)};W) & \text{if } y^{(i)} = 1 \\
1 - f(x^{(i)};W) & \text{if } y^{(i)} = 0
\end{cases}
$$
- Concept Form:
$$p(y^{(i)} | f(x^{(i)}, W)) = f(x^{(i)};W)^{y^{(i)}} (1 - f(x^{(i)};W))^{(1 - y^{(i)})}$$

**Substitute In MLE formula:**
$$\log \left(p(y^{(i)} | f(x^{(i)}, W))\right) = \sum_{i=1}^N \left[y^{(i)}\log \left(f(x^{(i)};W)\right) + (1 - y^{(i)}) \log\left(1 - f(x^{(i)};W)\right) \right]$$

**Cost Function**
**Cost Function: Negative Likelihood**

To convert maximization to minimization:
$$J(w) = -\sum_{i=1}^N \log \left(p(y^{(i)} | f(x^{(i)}, W)) \right)$$
$$ = \sum_{i=1}^N - y^{(i)}\log \left(f(x^{(i)};W)\right) - (1 - y^{(i)}) \log \left(1 - f(x^{(i)};W)\right)$$

So:
$$\hat{w} = \argmin_w J(w)$$

**Key Properties:**
- No Closed form solution for $\nabla_wJ(w) = 0$  
- However $J(w)$ is **convex** and has global minimum.
- Solution Method: Use iterative optimization (e.g., gradient descent).

#### **Multi-Class Classification**

**Recall Multinomial Distribution:**

Parameter Definition:
$$\theta = [\theta_1, \theta_2, ..., \theta_K]$$
Where:
$$\theta_k \in [0, 1] \quad \text{and} \quad \sum_{k=1}^{K} \theta_k = 1$$
$$\theta_k = p(x_k = 1)$$

Likelihood:
$$P(x|\theta) = \prod_{k=1}^K \theta_k^{x_k} = \theta_j \quad \text{(when $x_j = 1$)}$$

Set **cost function** as **negative of log likelihood**.

We need $\hat{W} = \argmin_W J(W)$

$$J(W) = -\log\prod_{i=1}^Np(y^{(i)} | x^{(i)}, W)$$
$$ = -\log \prod_{i=1}^N \prod_{k=1}^K f_k(x^{(i)};W)^{{y_k}^{(i)}}$$
$$ = -\log \sum_{i=1}^N \sum_{k=1}^K {{y_k}^{(i)}}\log \left(f_k(x^{(i)};W)\right)$$

There is no closed-form solution for $\hat{W}$.  
Use iterative optimization instead.

# Cross Entropy Loss & KL Divergence

**Cross Entropy Loss**

Cost Function in classification problems

**Kullback–Leibler Divergence Formula:**
$$D_{KL}[q \| p] = \int{q(z) \log \left(\frac{q(z)}{p(z)} \right) dz} = \int{q(z) \log \left(q(z) \right) dz} - \int{q(z) \log \left(p(z) \right) dz}$$
Also:
$$D_{KL}[y \| o] = \sum_{k=1}^K q(z) \log \left(\frac{q(z)}{p(z)} \right) = \sum_{k=1}^K q(z) \log \left(q(z) \right) - \sum_{k=1}^K q(z) \log \left(p(z) \right)$$

For multi-class classification problem:
- $q(y | x)$: One-hot vector shows the target class of $x$
- $p_{\theta}(y | x)$: Outout of the parametric model

Divergence between $y = [y_1, \cdots, y_K]$ and $o = [o_1, \cdots, o_K]$:
$$D_{KL}[y \| o] = \sum_{k=1}^K q(z) \log \left(\frac{q(z)}{p(z)} \right) = 0 - \sum_{k=1}^K y_k \log \left(p_{\theta}(y = k | x) \right)$$