# Optimisation

> Throughout the notes below, we are investigating techniques for iterative, non-linear optimisations.
> These optimisations are trying to find parameter estimations for the max likelihood, or MAP

## Intro

### The problem statement

- Continuous nonlinear optimisations seek to find the values of a set of params $$ \hat{\theta} $$ that minimise a the value of an objective / cost function $$ f[\cdot] $$

$$ \hat{\theta} = \underset{\theta}{argmin}[ f[\cdot] ] $$

- Most optimisation techniques are described in terms of minimising the value of this function
- However, in the context of ML, we are more frequently concerned with finding the parameter values that maximise the log probability
- To turn a problem from one to the other, we simply multiply the function by -1 (ie. we now seek to minimise the negative log prob)

### Convexity

> In practice, this property will hold for very few problems that we'll come across

- *Iterative* optimisation techniques are (typically) **local**: the next update will be determined based purely on the properties of the function at its current position (ie. local search around current position on a curve)
- These techniques can therefore only guarantee finding a local minimia, not a global minima
- One method for mitigating this shortcoming is to initialise the optimisation from multiple places and select the best outpu
- **IFF** a function is convex, there is only a single minima & so the only local minimum is also the global minimum

#### What is convexity?

- Intuitively, a function is convex if any cord (line drawn between two points) does not intersect with any other point(s) on the function
- This property can be established algebraically by creating a matrix of 2nd derivatives at each point on the function: if all the second derivs are +ve definite, then the function is convex

> Put an illlustration here

- When working with higher dimensions, we use the [Hessian matrix](hessian-matrix): if the matrix is +ve definite everywhere, then the cost function is convex and a global minima can be found

## Overview

- The cost function is a hyperplane through the problem space, with a dimension for each parameter in the cost function $$ f $$
- 2 attributes need to be considered when deciding how best to update the parameter values with each iteration:
    - the **Direction** of the update: $$ S $$
    - the **Distance/Magnitude** of the update: $$ \lambda $$. This is also known as the *line search*

We want $$ \lambda $$ such that:

$$ \hat{\lambda} = \underset{\lambda}{argmin} [ f[\theta^{[t]} + \lambda s] ] $$

> ie. the distance that minimises the cost

where the update each iteration will then be:

$$ \theta^{[t+1]} = \theta^{t} + \hat{\lambda}s $$


## Direction of search

There are 2 general methods for selecting the search direction:
1. Steepest descent
2. Gauss-Newton method (a special case of the Newton method)

Both approaches rely on computing derivatives of the cost function wrt params @ the current position
These approaches rely on an assumption of smooth functions so that derivatives are **well-behaved**

### Calculation of derivatives

1. A closed-form expression where possible
2. If not, estimates can be made using finite differences:

The 1st derivative of $$ f[\cdot] $$ wrt the jth element of $$ \theta $$:

$$  $$

## Distance of search

## Reparameterisation