# Optimization

## Introduction

Components:
1. Score function
2. Loss function
3. **Optimization**

1 varies for different architectures (NNs, CNNs, etc). 2 and 3 mostly stay the same.

## Loss Function

- $L$ is bowl-shaped. We are trying to reach the bottom by adjusting $W$.
- SVM loss is piecewise-linear, and convex.

## Optimization

1. Random search: randomize $W$, then compute loss. Repeat a bunch of times. Pick $W$ that resulted in the smallest loss.
2. Random local search: randomize $\delta W$ (perturbation), compute loss with $W+\delta W$. If loss decreased, use $W+\delta W$ as the new weights.
3. Follow gradient: instead of $\delta W$ (random perturbation), we step in a direction that's guaranteed to decrease loss. This is done with $\frac{dL}{dW}$.

### Numerical Gradient
Numerical gradient: $\frac{f(x+h)-f(x)}{h}$.
- Note: $f$ outputs a scalar, $x$ is a matrix or vector.
- To compute this, nudge each dimension of $x$ by $h$, and see its (scalar) effect on $f$.
- Resulting gradient matrix (or vector) is same shape as $x$: one gradient entry for each variable in $x$.

Then, update weights with $W_{\text{new}} = W - \alpha \frac{dL}{dW}$.

#### Advantages:
- Easy to implement

#### Disadvantages:
- Approximate: $h$ could always be arbitrarily smaller
- Slow to compute: need to recompute loss for each entry of $W$ (compute over entire dataset)

### Analytical Gradient

- Exact, faster, but harder to implement (since we might mess up gradient equations)
- Gradient check: use numerical gradient as sanity check
- Example: SVM gradient

    - $\nabla_{w_{y_i}} L_i = -( \sum_{j \neq y_i} \mathbb{1}[w_j^T x_i + \Delta - w_{y_i}^Tx_i > 0]) x_i$
    - $\nabla_{w_j} L_i = \mathbb{1} (w_j^Tx_i+\Delta-w_{y_i}^Tx_i > 0)x_i$

## Gradient Descent

```
weights_grad = eval_grad(loss_fun, data, weights)
weights -= step_size * weights_grad
```

### Minibatch Gradient Descent
- Gradient descent requires computing over entire $(x,y)$ to update parameters once.
- Idea: gradient descent on smaller batches, so we update parameters more often and model converges quicker

### Stochastic Gradient Descent
- Special case of minibatch gd, where batch size is one. Not common since vectorization is fast.