<img src=./img/Brown_logo.svg width=30%>

# Project 1: Learning to Choose Optimizers


### Martin van der Schelling | <a href = "mailto: martin_van_der_schelling@brown.edu">martin_van_der_schelling@brown.edu</a>  | PhD candidate

## Outline of today

At the end of this lecture you have learned:

* Why we need optimization algorithms
* Why we need to make an informed decision on our optimizer choice
* How we might get free lunch anyway!
* How to get started on project 1: L2O



## Optimization problems are everywhere

Optimization is the process of finding the **best solution** or outcome from a set of possible options based on a certain **criteria**.

<!-- * learning goal: understand that optimization is used in machine learning but also in other disciplines: -->

<img src=./img/examples_opt.png width=60%>

## Different kind of optimization problems

### Time complexity
* *"How much does the time to solve increase when we make the problem bigger?"*
* Time complexity is a measure of the **amount of time an algorithm takes as a function of the size of the input ($n$)**

* This is often expressed using **"Big O" notation**: $\mathcal{O}(n)$, $\mathcal{O}(n \; \mathrm{log}(n))$

### Why is this important?
* Gives us a sense if the algorithm can solve this problem
* It let us compare the **efficiency** of algorithms

<img src=./img/time-complexity-examples.png width=40%>

What happens when we **scale the problem** and try to solve a optimization problem that has $\mathcal{O}(2^n)$ ?

**We run out of memory/time to solve the problem exactly**

## Well how do we solve those problems?

🫱🏽‍🫲🏻 Making a compromise: Get a good enough solution within polynomial time complexity
* Trade-off between **complexity** and **solution quality**

### Iterative procedure

<img src=./img/schematic_optimization.png width=80%>

Step 1) Choose an **initial guess** ($\mathbf{x}_0$)

Step 2) Update your current solution with an **optimization algorithm**

 $\mathbf{x}_{t+1} = \mathbf{x}_t + \omega$
 
Step 3) Repeat step 2 until some **stopping criteria**

Step 4) Take your optimized value and hope for the best :)

## For now, you can treat the optimization as a black box

*More in-depth information about 'opening the black-box' of optimization in a later lecture!*

<img src=./img/opt_blackbox.png width=40%>

Your next iteration $\mathbf{x}_{t+1}$ can depend on:
* The **current solution** ($\mathbf{x}_{t}$) and **response** ($y_{t}$) that you have
* (Optionally) other information like **gradients** or **history of evaluations** ($\mathbf{X}_{0 .. t}$ and $\mathbf{y}_{0..t}$)
* The **choice** of optimization algorithm
* Any **hyperparameters** of the optimizer (e.g. the learning rate $\alpha$)

## There are many, many different optimization algorithms ..

Each field has its own collection of optimizers: 

| Engineering Application | Optimization Algorithms |
|-------------------------|--------------------------|
| Structural design      | Genetic algorithms, simulated annealing, gradient descent |
| Topology optimization  | Method of moving Asymptotes (MMA), Interior-point line-search (IPOPT), Optimality Criteria (OC) |
| Process control        | Linear programming, quadratic programming, nonlinear programming |
| Supply chain management| Linear programming, mixed integer programming |
| Machine learning       | Stochastic gradient descent (SGD), Adam, Conjugate gradient (CG) |
| Computer vision        | Gradient descent, stochastic gradient descent (SGD), coordinate descent |
| Robotics               | Trajectory optimization, model predictive control |
| Protein docking        | Monte Carlo with minimization (MCM), conformational space annealing (CSA), particle swarm optimization (PSO) |
 
Basically, every optimizer has an $\omega$ operation of calculating the next iterate $\mathbf{x}_{t+1}$

For example: Gradient descent: 

$\mathbf{x}_{t+1} = \mathbf{x}_{t} - \alpha \cdot \frac{dy}{dx}$

$\omega = - \alpha \cdot \frac{dy}{dx}$

But why so many? Why don't we have **one optimizer to rule them all**?

<img src=./img/one_optimizer.png width=40%, aling='center'>

## ❌🥪 No Free Lunch Theorem [1]

> "Any elevated performance over one class of problems is offset by performance over another class."


*[1] Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67–82. https://doi.org/10.1109/4235.585893*

In other words; some optimization algorithms work on specific problems, but might not work for others!

In the limit: **no optimization algorithm will perform better than random search over the entire space of optimization problems**

## Example: gradient based optimizer
<img src=./img/Adam.gif width=40%, align='right'>



<img src=./img/Sphere.png width=20%, align='left'><img src=./img/Schwefel.png width=20%, align='left'>


* Gradient based optimizers work well in convex landscapes
* They fail when multiple local minima are involved!

Optimizers are designed to **exploit differen problem characteristics** in order to gain an advantage over random search

## How do we choose an optimizer?

* Choose one based on the **knowledge** you have about your problem
* Or you can try a bunch of them out (architecture search)

But what if we **learn our optimization choice from data?**

## Learning to Optimize (L2O)

Adjust optimizer based on the response of the problem

<img src=./img/l2o.png width=40%>

*[2] Li, K., & Malik, J. (2017). Learning to optimize. 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings.*

From **constant** update step to **trainable model**:

* 'Classic' optimizer: $\mathbf{x}_{t+1} = \mathbf{x}_t + \omega$
* Learning to optimize: $\mathbf{x}_{t+1} = \mathbf{x}_t + m(\omega ; \phi)$

<!-- ### Classic optimizer -->

<img src=./img/schematic_opt_chen2021.png width=45%, align='left'>

<!-- ### Learning to optimize -->

<img src=./img/schematic_l2o_chen2021.png width=45%, align='right'>


 
*T. Chen, X. Chen, W. Chen, H. Heaton, J. Liu, Z. Wang, and W. Yin, Learning to Optimize: A Primer and A Benchmark, 2021 (2021)*

## Project 1: L2O

**Provided data and resources**: Two datasets with the training dynamics of a set of benchmark funcitons, optimized with a variety of hand-engineered optimizers: 

**Aim of the project**: Design a general *data-driven optimization strategy* with machine learning that competes with classic hand-engineered optimizers

### How is the dataset generated?

... with `f3dasm` of course ! 🌟🌟🌟🌟🌟

<img src="./img/blocks_project.png" width="100%">

In [1]:
from f3dasm import ExperimentData
from f3dasm.design import Domain, make_nd_continuous_domain
import numpy as np

2023-10-27 12:01:12,845 - f3dasm - INFO - Imported f3dasm (version: 1.4.3)


First, we create a `Domain` object with continuous input variables bounded by $[0.0, 1.0]^{d}$, where $d$ is the number of input parameters (=`dimensionality`):

In [2]:
domain = make_nd_continuous_domain(bounds=[[0.0, 1.0], [0.0, 1.0]], dimensionality=2)
domain

Domain(space={'x0': ContinuousParameter(lower_bound=0.0, upper_bound=1.0, log=False), 'x1': ContinuousParameter(lower_bound=0.0, upper_bound=1.0, log=False)})

*`make_nd_continuous_domain` is a helper function to create n-dimensional continuous design spaces*

Then, we sample $30$ initial points from the domain with Latin Hypercube sampling:

In [3]:
experimentdata = ExperimentData(domain)
experimentdata.sample(sampler='latin', n_samples=30, seed=2023)

We evaluate the samples with one of the benchmark functions (in this case the `Ackley` function)
We can provide the following options:
* `noise`: Gaussian noise with a certain standard deviation
* `seed`: If the function is noisy, seed for the pseudo random-number-generator
* `scale_bounds`: Scaling the function to some other boundaries, in this case the box-constraints of the `Domain`

> Note: Each benchmark-function is **off-set by a random vector**. As most of the benchmark functions have their global minimum in the center of the design space, we want to avoid optimizers to "cheat" their way to the optimum by immediately jumping to the middle.



This is how a 2D-representation of these analytical functions looks like:

<img src=./img/func_cropped.gif width=30%>

Each of the generated functions have **labels** that describe the **characteristics** of the loss-landscape

In [4]:
experimentdata.evaluate(data_generator='Ackley', 
                        kwargs={'seed': 2023, 'noise': 0.0, 'scale_bounds': domain.get_bounds()})

We **optimize** the benchmark function with an optimizer for $2000$ iterations (our `budget`) with the default hyper-parameters:

In [5]:
experimentdata.optimize(data_generator='Ackley', optimizer='CMAES', iterations=2000, hyperparameters={'seed': 2023},
                       kwargs={'seed': 2023, 'noise': 0.0, 'scale_bounds': domain.get_bounds()})

In [6]:
experimentdata

Unnamed: 0_level_0,jobs,input,input,output
Unnamed: 0_level_1,Unnamed: 1_level_1,x0,x1,y
0,finished,0.181879,0.857659,21.650953
1,finished,0.377353,0.790722,21.957123
2,finished,0.945234,0.643587,21.709284
3,finished,0.629493,0.603660,21.623877
4,finished,0.313033,0.238736,20.218512
...,...,...,...,...
2025,finished,0.588052,0.126597,0.000129
2026,finished,0.588052,0.126596,0.000064
2027,finished,0.588052,0.126596,0.000066
2028,finished,0.588052,0.126596,0.000024


We redo this optimization for **different initial conditions** ($x_{0}$) and save the **entire history of iterations**.


We run this optimization process with a **set of different optimizers**
* Adam
* Covariance matrix adaptation evolution strategy (CMAES)
* Particle Swarm Optimization (PSO)
* Limited-memory BFGS with box constraints (L-BFGS-B)
* Random Search

Each experiment is one instantiation of a benchmark function that is optimized for multiple realizations with a set of optimizers:

In [7]:
small_dataset = ExperimentData.from_file('/home/martin/Documents/GitHub/3dasm_course/Projects/L2O/STUFF_NOT_TO_UPLOAD/data/small_dataset/small_dataset')
small_dataset

Unnamed: 0_level_0,jobs,input,input,input,input,input,output,output
Unnamed: 0_level_1,Unnamed: 1_level_1,budget,dimensionality,function_name,noise,seed,path_raw,path_post
2198,finished,2000.0,2,Thevenot,0.1,100895,raw/2198.nc,post/2198.nc
1478,finished,2000.0,2,Quartic,0.1,91077,raw/1478.nc,post/1478.nc
130,finished,2000.0,2,Levy,0.0,554703,raw/130.nc,post/130.nc
2951,finished,2000.0,100,Shubert,0.0,152730,raw/2951.nc,post/2951.nc
2547,finished,2000.0,2,Schwefel2_20,0.1,204304,raw/2547.nc,post/2547.nc
...,...,...,...,...,...,...,...,...
1423,finished,2000.0,2,XinSheYang,0.0,328933,raw/1423.nc,post/1423.nc
713,finished,2000.0,10,Ackley,0.0,1357,raw/713.nc,post/713.nc
1954,finished,2000.0,10,Powell,0.0,276639,raw/1954.nc,post/1954.nc
691,finished,2000.0,10,Ackley,0.0,504520,raw/691.nc,post/691.nc


The data for each experiment is stored in two ways:
* `raw`: file with the **entire history of iterations** for multiple realization and optimizers
* `post`: file with already post-processed data containing a **performance metric** and the **problem-specific characteristics**

The datasets are described in more detail in the [assignment description](https://github.com/bessagroup/3dasm_course/tree/main/Projects). 

### 👋 Say hi to the multidimensional pandas: `xarray`:

<img src="./img/hi_xarray.png" title="hi xarray" width="30%" align="right">
<img src="./img/datastructure.png" width="60%" align="left">

* `pandas` is great when working with two coordinates (rows and columns) 
* However, when you have structured data that uses more coordinates it becomes complicated ..
* `xarray`: a data-analysis library for multi-coordinate data with a similar API to `pandas`!

In [8]:
import xarray as xr

On the [3dasm GitHub page](https://github.com/bessagroup/3dasm_course) you can download a Python file (`l20.py`) with functions that help you on the project:

In [9]:
import sys

# Insert the helper functions file in the system path so that Python can import it
sys.path.insert(0, '/home/martin/Documents/GitHub/3dasm_course/Projects/L2O/')
import l2o

As an example; we can inspect the `raw` data of experiment \#280 of the experiment from the small dataset with the `open_one_dataset_raw()` function:

In [10]:
# Go out of presentation mode to see this cell in its entirety

l2o.open_one_dataset_raw(small_dataset, 280)

### Aim of the project

Design a general *data-driven optimization strategy* with machine learning that competes with classic hand-engineered optimizers

You will have to fill in the template code provided in the `l2o.py` file:
* Implement the `predict()` function of your custom strategy!

In [11]:
class MyStrategy(l2o.CustomStrategy):
    name: str = "custom_strategy"

    def predict(self, features):
        ...

<img src="./img/predict.png" width="35%">

The project consists of answering 4 questions 

* The first 3 questions let you explore the datasets and teaches you how to train a basic classification model.  
* In the last question you can show your creativity in fitting machine learning model of your choice! 

## [Black-Box Optimization Benchmarking BBOB](https://numbbo.github.io/workshops/BBOB-2022/index.html) competition

<img src=./img/Brown_logo.svg width=30%>

# Project 1: Learning to Choose Optimizers


### Martin van der Schelling | <a href = "mailto: martin_van_der_schelling@brown.edu">martin_van_der_schelling@brown.edu</a>  | PhD candidate