# Optimization

**Convex Functions**

**Definition:**  
A function is convex if, for any two points on or above its surface, the line segment connecting them lies entirely **above** the surface **without intersecting it**.

**Key Properties:**
- **Single global minimum** $\rightarrow$ Easier to optimize since no local minima or saddle points exist.
- **Numerical methods (e.g., Gradient Descent)** are guaranteed to converge to the global minimum.

**Non-Convex Functions**

**Definition:**  
Functions with complex surfaces that may contain multiple minima, maxima, and saddle points.

**Challenges in Optimization:**
- **No guaranteed convergence** to a global minimum.
- Algorithms may exhibit unstable behavior:
    - **Jitter:** Oscillations near a minimum due to high gradients or learning rates.
        <div style="text-align:center">
        <img src="../assets/jitter.png" alt="jittering">
        </div>
    - **Diverge:** The optimization fails entirely, with the loss increasing uncontrollably.
        <div style="text-align:center">
        <img src="../assets/diverging.png" alt="diverging">
        </div>

**Critical Points in Non-Convex Functions:**
- **Global minimum**
    - Overall lowest point
- **Local minimum**
    -  Lower than nearby points, but not the lowest overall.
- **Saddle points:**
    - Regions where the **gradient is zero** but can **increase or decrease** in other directions.
    - Some of the **Eigenvalues** of the **Hessian** are positive; others are negative

*Neural network loss surfaces are typically **non-convex**, making optimization challenging.*

**The Controversial Error Surface**
- An **exponential number of saddle** points in **large networks** (Dauphin et. al (2015))
- For **large networks**, most **local minima** lie in a band and are **equivalent** (Chomoranksa et. al (2015))
- In **networks of small size**, trained on finite data, you can have **horrible local minima** (Swirscz et. al. (2016))

# Newton's Method

## Newton's Original Method

**Objective**  
Newton's method is derived by finding the tangent line of a function $ f(x) $ at an initial point $ x_0 $.

**Tangent Line Equation**  
Given a point $ x_0 $ where $ f(x_0) \neq 0 $, the tangent line at $ x_0 $ is expressed as:
$$ y = mx_0 + c $$

- **Gradient (Slope):**  
  The slope $ m $ is equal to the derivative of $ f(x) $ evaluated at $ x_0 $:
  $$ m = f'(x_0) $$

<div style="text-align:center">
  <img src="../assets/newton_method.png" alt="Newton's method">
</div>

**Finding the Y-Intercept $ c $**  
Substitute the point $ (x_0, f(x_0)) $ into the tangent line equation $ y = mx + c $:
$$ f(x_0) = f'(x_0)x_0 + c $$  
Solving for $ c $:
$$ c = f(x_0) - f'(x_0)x_0 $$

**Final Tangent Line Equation**  
Substitute $ m = f'(x_0) $ and $ c $ back into the tangent line equation:
$$ y = f'(x_0)x + f(x_0) - f'(x_0)x_0 $$  
Simplify to obtain the standard form:
$$ y = f(x_0) + f'(x_0)(x - x_0) $$

**Approximating the Root**  
To approximate the root of $ f(x) $, set $ y = 0 $ in the tangent line equation:
$$ 0 = f(x_0) + f'(x_0)(x_1 - x_0) $$  
Rearrange to solve for the next approximation $ x_1 $:
$$ x_1 = x_0 - \frac{f(x_0)}{f'(x_0)} $$

**Iteration Process**  
Repeat the step above to iteratively refine the approximation of the root:
$$ x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)} $$

## Newton’s Method for Optimization

**From Root-Finding to Optimization**

- **Root-finding (Original Newton's Method):**
    - Uses a **first-order approximation** (tangent line) to iteratively find roots of $f(w)$.

- **Optimization Adaptation:**
    - Uses a **second-order Taylor approximation** to locate minima/maxima of a function $E(w)$.

**Second-order Taylor expansion** of $E(w)$ around $w = w_k$:
$$
E(w) \approx E(w^{(k)}) + E'(w^{(k)})(w - w^{(k)}) + \frac{1}{2} E''(w^{(k)}) (w - w^{(k)})^2
$$
**Key Property:** Exact equality holds if $E(w)$ is **quadratic**.

**Optimality Condition:**  
To find $\hat{w} = \argmin_w E(w)$, set the gradient of the approximation to zero:
$$
\frac{\partial E(w)}{\partial w} = E'(w^{(k)}) + E''(w^{(k)}) (w - w^{(k)}) = 0
$$

Multiply through by the inverse Hessian $E''(w^{(k)})^{-1}$:
$$
E''(w^{(k)})^{-1} E'(w^{(k)}) + w - w^{(k)} = 0
$$

Rearrange to obtain the **Newton update rule**:
$$
w = w^{(k)} - E''(w^{(k)})^{-1} E'(w^{(k)})
$$

**Hessian Matrix $H(\theta)$**

Given a scalar-valued function $ f(\theta) $, where  
$ \theta = [\theta_1, \theta_2, \dots, \theta_n]^T $,  
the **Hessian matrix** $ H(\theta) $ is the square matrix of second-order partial derivatives of scalar-valued function:

$$
H(\theta) = \nabla^2 f(\theta) =
\begin{bmatrix}
\frac{\partial^2 f}{\partial \theta_1^2} & \frac{\partial^2 f}{\partial \theta_1 \partial \theta_2} & \cdots & \frac{\partial^2 f}{\partial \theta_1 \partial \theta_n} \\
\frac{\partial^2 f}{\partial \theta_2 \partial \theta_1} & \frac{\partial^2 f}{\partial \theta_2^2} & \cdots & \frac{\partial^2 f}{\partial \theta_2 \partial \theta_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^2 f}{\partial \theta_n \partial \theta_1} & \frac{\partial^2 f}{\partial \theta_n \partial \theta_2} & \cdots & \frac{\partial^2 f}{\partial \theta_n^2}
\end{bmatrix}
$$

**Note:**
- $E''(w^{(k)})$: $H_E(w^{(k)})$ Hessian matrix
- $E'(w^{(k)})$: $\nabla_w E(w^{(k)})$ Gradient matrix

Substitute in **Newton update rule**
$$
E(w) \approx E(w^{(k)}) + \nabla_w E(w^{(k)}) (w - w^{(k)}) + \frac{1}{2} H_E(w^{(k)}) (w - w^{(k)})^2
$$
Result is:
$$
w = w^{(k)} - H_E(w^{(k)})^{-1} \nabla_w E(w^{(k)})
$$

We can arrive at the optimum in a single step using the optimum step size:
$$
\eta_{opt} = E''(w^{(k)})^{-1} = H_E(w^{(k)})^{-1}
$$

**With Non-Optimal Step Size**

Gradient descent with fixed step size $\eta$ to estimate scalar parameter $w$:
$$
W^{(k + 1)} = w^{(k)} - \eta \frac{\partial E(w^{(k)})}{\partial w}
$$

- For $\eta \lt \eta_{opt}$ the algorithm will converge monotonically

- For $2\eta_{opt} \gt \eta \gt \eta_{opt}$ we have oscillating convergence

- For $\eta \gt 2\eta_{opt}$ we get divergence

<div style="text-align:center">
  <img src="../assets/non_optimal_step_size.png" alt="non-optimal step size">
</div>

## Advantages & Disadvantages

**Advantages**

- **Faster Convergence**
    - Quadratic convergence enables reaching minima faster in convex problems.

- **Adaptive Step Sizes**
    - Curvature-based step adjustment avoids slow progress in shallow regions.

- **Reduced Oscillations**
    - Curvature information stabilizes paths in oscillatory regions.

**Disadvantages**

- **Computationally Expensive**
    - Requires Hessian calculation, making it costly in high-dimensional models.

- **Memory Intensive**
    - Storing the Hessian matrix is memory-intensive for models with millions of parameters.

- **Convergence Challenges**
    - May converge to saddle points in non-convex functions common in machine learning.

## Remarks on Second-Order Optimization Methods

**Purpose:**  
Normalize gradient updates across different directions to handle varying curvature, eliminating the need for per-component learning rate tuning.

**Challenges:**
- Requires computing (and inverting) second-derivative matrices (Hessians), which is **computationally infeasible** for large models.
- Unstable in **non-convex regions** (e.g., saddle points, sharp minima).

**Workarounds:**
- Approximate methods (e.g., **Quasi-Newton, L-BFGS**) mitigate these issues but may still be less practical than first-order methods.

**Non-Convex Optimization in Neural Networks**

**Learning Rate Dynamics:**
- A learning rate $\eta > 2\eta_{opt}$ can help escape poor local optima by overshooting shallow minima.
- However, persistently using $\eta > 2\eta_{opt}$ prevents convergence entirely (divergence or oscillations).

**Practical Solutions:**
- **Decaying Learning Rates:**
    - Start with a higher $\eta$ to escape bad minima, then reduce it for stable convergence.
- **Adaptive Methods (e.g., Adam, RMSprop):**
    - Automatically adjust $\eta$ per parameter, balancing exploration and convergence.

# Momentum

## First Momentum

## Nestorov's Accelerated Gradient

## Adagrad

## RMS Prob

## ADAM