<img src=../figures/Brown_logo.svg width=30%>

# Project 1: Learning to Choose Optimizers


### Martin van der Schelling | <a href = "mailto: martin_van_der_schelling@brown.edu">martin_van_der_schelling@brown.edu</a>  | PhD candidate

## Outline of today

At the end of this lecture you have learned:

* Why we need optimization algorithms
* Why we need to make an informed decision on our optimizer choice
* How we might get free lunch anyway!



## Optimization problems are everywhere

Optimization is the process of finding the **best solution** or outcome from a set of possible options based on a certain **criteria**.

<!-- * learning goal: understand that optimization is used in machine learning but also in other disciplines: -->

<img src=./img/examples_opt.png width=60%>

## Different kind of optimization problems

### Time complexity
* *"How much does the time to solve increase when we make the problem bigger?"*
* Time complexity is a measure of the **amount of time an algorithm takes to run as a function of the size of the input ($n$)**

* This is often expressed using **"Big O" notation**: $\mathcal{O}(n)$, $\mathcal{O}(n \; \mathrm{log}(n))$

### Why is this important?
* Gives us a sense if the current algorithm can solve this problem
* It let us compare the **efficiency** of algorithms

<img src=./img/time-complexity-examples.png width=40%>

What happens when we **scale the problem** and try to solve a optimization problem that has $\mathcal{O}(2^n)$ ?

**We run out of memory/time to solve the problem exactly**

## Well how do we solve those problems?

🫱🏽‍🫲🏻 Making a compromise: Get a good enough solution within polynomial time complexity
* Trade-off between **complexity** and **solution quality**

### Iterative procedure

<img src=./img/schematic_optimization.png width=80%>

Step 1) Choose an **initial guess** ($\mathbf{x}_0$)

Step 2) Update your current solution with an **optimization algorithm**

 $\mathbf{x}_{t+1} = \mathbf{x}_t + \omega$
 
Step 3) Repeat step 2 until some **stopping criteria**

Step 4) Take your optimized value and hope for the best :)

## For now, you can treat the optimization as a black box

*More in-depth information about 'opening the black-box' of optimization in a later lecture!*

<img src=./img/opt_blackbox.png width=40%>

Your next iteration $\mathbf{x}_{t+1}$ depends on:
* The **current solution** ($\mathbf{x}_{t}$) and **response** ($y_{t}$) that you have
* (Optionally) other information like **gradients** or **history of evaluations** ($\mathbf{X}_{0 .. t}$ and $\mathbf{y}_{0..t}$)
* The **choice** of optimization algorithm
* Any **hyperparameters** of the optimizer (e.g. the learning rate $\alpha$)

## There are many, many different optimization algorithms ..

Each field has its own collection of optimizers: 

| Engineering Application | Optimization Algorithms |
|-------------------------|--------------------------|
| Structural design      | Genetic algorithms, simulated annealing, gradient descent |
| Topology optimization  | Method of moving Asymptotes (MMA), Interior-point line-search (IPOPT), Optimality Criteria (OC) |
| Process control        | Linear programming, quadratic programming, nonlinear programming |
| Supply chain management| Linear programming, mixed integer programming |
| Machine learning       | Stochastic gradient descent (SGD), Adam, Conjugate gradient (CG) |
| Computer vision        | Gradient descent, stochastic gradient descent (SGD), coordinate descent |
| Robotics               | Trajectory optimization, model predictive control |
| Protein docking        | Monte Carlo with minimization (MCM), conformational space annealing (CSA), particle swarm optimization (PSO) |
 
Basically, every optimizer has an $\omega$ operation of calculating the next iterate $\mathbf{x}_{t+1}$

For example: Gradient descent: 

$\mathbf{x}_{t+1} = \mathbf{x}_{t} - \alpha \cdot \frac{dy}{dx}$

$\omega = - \alpha \cdot \frac{dy}{dx}$

But why so many? Why don't we have **one optimizer to rule them all**?

<img src=./img/one_optimizer.png width=40%, aling='center'>

## ❌🥪 No Free Lunch Theorem [1]

> "Any elevated performance over one class of problems is offset by performance over another class."


*[1] Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67–82. https://doi.org/10.1109/4235.585893*

In other words; some optimization algorithms work on specific problems, but might not work for others!

In the limit: **no optimization algorithm will perform better than random search over the entire space of optimization problems**

## Example: gradient based optimizer
<img src=./img/Adam.gif width=40%, align='right'>



<img src=./img/Sphere.png width=20%, align='left'><img src=./img/Schwefel.png width=20%, align='left'>


* Gradient based optimizers work well in convex landscapes
* They fail when multiple local minima are involved!

Optimizers are designed to **exploit differen problem characteristics** in order to gain an advantage over random search

## How do we choose an optimizer?

* Choose one based on the **knowledge you have** about your problem
* Or you can try a bunch of them out (architecture search)

But what if we **learn our optimization choice with data?**

## Learning to Optimize (L2O)

Adjust optimizer baseed on the response of the problem

<img src=./img/l2o.png width=50%>

*[2] Li, K., & Malik, J. (2017). Learning to optimize. 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings.*

From **constant** update step to **trainable model**:

* 'Classic' optimizer: $\mathbf{x}_{t+1} = \mathbf{x}_t + \omega$
* Learning to otpimize: $\mathbf{x}_{t+1} = \mathbf{x}_t + m(\omega ; \phi)$