<img src=../figures/Brown_logo.svg width=50%>

# Data-Driven Design & Analyses of Structures & Materials (3dasm)

## Lecture 24

### Miguel A. Bessa | <a href = "mailto: miguel_bessa@brown.edu">miguel_bessa@brown.edu</a>  | Associate Professor

### Suryanarayanan M. S. | <a href = "mailto: s.manojsanu@tudelft.nl">s.manojsanu@tudelft.nl</a>  | PhD Candidate

# Outline for today's lecture

<!-- <img align=right src=./figures/grr.png width=60%> -->

A brief introduction to optimization (for machine learning):

* Optimization problem formulation: Objective, variables and constraints
* Characteristics of optimization problems: Modality, and convexity
* Solving optimization problems: Broad classification of optimizers; recall Taylor series expansion

**References:**
* J. R. R. Martins & Andrew Ning, Engineering Design Optimization, 2021 - Chapters 1 & 4
* For practical details of algorithms (Extra):
    * Nocedal, Jorge, and Stephen J. Wright. Numerical optimization. Springer Science & Business Media, 2006.

# Why optimization?

Optimization is a central field in Engineering. It appears in multiple parts of the data-driven process.

<img src=./figures/blocks.png width=100%>

In Machine Learning, we saw that optimization is used when finding Point Estimates at 3 levels:

1. Level 1: Finding a Point Estimate of parameters of a model
    *  $\hat{\mathbf{z}} \in \underset{\mathbf{z}}{\mathrm{argmin}} \left[\mathcal{L}(\mathbf{z})\right]$, where $\mathcal{L}(\mathbf{z})$ is the loss function (the negative log posterior) and $\mathbf{z}$ are the model parameters.

2. Level 2: Finding a Point Estimate of the hyperparameters of a model
    * Hyperparameters are usually *continuous*, but they can also be *discrete* (and a mixture).

3. Level 3: Finding the best model or model structure
    * Typically this is a *discrete* search: choose best among $\mathcal{M}_1$, $\mathcal{M}_2$, etc.

**Note**: Sometimes it is difficult to distinguish between (discrete) "hyperparameters" and "model structures". Often, that distinction becomes clear in context. Recall that:

* "Hyperparameter" is a parameter associated to a particular model or model structure. All hyperparameters should be at the same "level" (Level 2, i.e. a level above the parameters of a model).

* "Model structure" implies that this search happens at Level 3, i.e. each model structure is usually associated to a unique set of hyperparameters.

Let's recall some practical examples of the different levels of Model Selection:

* **Example 1**: Consider a **Linear Regression Model** ($\mathcal{M}_1$) whose **parameters are weights**, the **hyperparameters** are the **degree of the polynomial** (integer) and the **regularization strength** (continuous hyperparameter). And consider a **Gaussian process model** ($\mathcal{M}_2$) which has no parameters, and the **hyperparamaters** are the **kernel parameters**. Then:

    * **Level 3** is the selection of the best model between $\mathcal{M}_1$ and $\mathcal{M}_2$.
    * **Level 2** is the hyperparameter optimization for each model, leading to $\hat{\boldsymbol{\theta}}$ (note that in this case the hyperparameters are different for each model).
    * **Level 1** only requires optimization for the Linear Regression Model (to find the "optimal" weights $\hat{\mathbf{z}}$ for each $\boldsymbol{\theta}$ of $\mathcal{M}_1$). The Gaussian Process model is obtained by Bayesian inference at Level 1, so this model ($\mathcal{M}_2$) does not require optimization at this level.

Note: **In this case**, Level 2 requires to use an optimization algorithm capable of simultaneously optimizing for *discrete* and *continuous* variables! There are many optimizers that **cannot** do that well (the ones that can are usually called *Mixed-Integer Programming*).

* **Example 2**: Alternatively, it is useful to understand that **we can define different levels** even when only considering Linear Regression. The parameters would be the weights ($\mathbf{z}$), but the **hyperparameter would now be the regularization strength** ($\theta$), while the **polynomial degree could be considered a model structure** ($\mathcal{M}$). Then:

    * **Level 3** would be the choice of each model structure $\mathcal{M}_1$ (e.g. Linear Regression with polynomial basis of degree 2), $\mathcal{M}_2$ (Linear regression with degree 3), etc.
    * **Level 2** would be the hyperparameter optimization for each model, i.e. finding the "best" regularization strength $\hat{\theta}$ for each model structure.
    * **Level 1** is the optimization of the weights $\mathbf{z}$ (for each hyperparameter $\theta$ of each model structure $\mathcal{M}_i$)

Note: In this case, the above-mentioned Level 2 requires simpler (and more common) optimization algorithms that only deal with continuous variables. Level 3 may not even require the use of optimization algorithms, as it is simply choosing the best model structure by its performance (the ExperimentData table of f3dasm from where you pick the best model as you did in the Homework). Level 1 is usually in-built on machine learning packages, as every method usually already has common optimizers that can be chosen to find its parameters.

I am sharing this with you just to highlight that there are different ways of setting up your Model Selection strategy, as it depends on the problem you are trying to solve, the optimization algorithms you are familiar with, and the computational resources you have available...

Ideally, becoming an expert in machine learning should imply becoming an expert in optimization...

In practice... That's not always the case...

This course will only give you a very basic introduction about optimization algorithms, and it will not cover mixed-integer programming. So, we will just focus on some simple gradient-based optimizers, and you will also explore some gradient-free optimizers.

## Optimization: formalism

General formulation of an optimization  problem:

$\begin{align}
\text{minimize} \quad & f(\mathbf{x}) \\
\end{align}
$

$\begin{align}
\text{by varying} \quad & \underline{x}_i \leq \ x_i \leq \overline{x}_i  & i=1,...,D_{x}
\end{align}
$

$\begin{align}
\text{subject to} \quad & g_j(\mathbf{x}) \leq 0  & j=1,...,N_g \\
                        & h_l(\mathbf{x}) = 0  & l=1,...,N_h
\end{align}
$

* $\mathbf{x}$ are decision variables.
    * $\underline{x}_i$ and $\overline{x}_i$ are the lower and upper bounds of input $x_i$ for each of the $D_{x}$ input dimensions.
* $f(\mathbf{x})$ is the objective function (e.g. the loss function)
* $\mathbf{g}$ is the vector with all $N_g$ inequality constraint functions, and $\mathbf{h}$ the $N_h$ equality constraint functions.

**Note:** Sometimes the constraints can be written differently in different software packages and even for different algorithms.

The formulation we wrote above is called the LHS form. Different textbooks follow different conventions.

For example, you can see how [SciPy](https://docs.scipy.org/doc/scipy/tutorial/optimize.html#constrained-minimization-of-multivariate-scalar-functions-minimize) summarizes its optimization algorithms. For example, you can see that `SLSQP` uses a combined LHS and RHS formulation, but and `trust-constr` uses RHS formulation.

### A. Decision variables or Design variables
        
<img style="float: right;" src="./figures/dvs.png" width=20%>

* Represent the space of possibilities (a.k.a. design space)
    * This is the space that will be searched for the best solution
* Variables can be:
    * continuous or discrete
    * Bounded or unbounded
    * No. of design variables = dimensionality
    * Physically meaningful or abstract
        * E.g. thickness of a beam or parameters of a model


### B. Objective function

<img style="float: right;" src="./figures/objective.png" width=30%>

* Performance metric to be optimized

* Single or multiple objectives are possible

    * E.g. Cost vs quality, Weight vs stiffness etc.

* Can be linear or non-linear (in the design variables)

* May have different meanings
    * E.g. weight of a beam or loss of a model

**Note**: Recall that optimization is often formulated as a minimization problem. Maximization problems can be converted to minimization problems by multiplying the objective function by -1.

### C. Constraints

<img style="float: right;" src="./figures/constraint.png" width=30%>


* Functions that limit the design space.
    * Linear or non-linear
    * E.g. Stress in a beam $\leq$ yield stress of the material.

* Make optimization much more difficult.
    * Solution has to be **feasible** (not violate any constraints)
    * Solution has to be **optimal** (best possible solution)

* Equality or inequality
    * Inequality is more difficult to handle!
    * They can be on or off (active or inactive)
    
Note: Bounds on the design variables are often also called **box constraints**. However, these are trivial to handle.

### Narrowing our focus

- Optimization is extremely vast

- Fortunately, the vast majority of optimization scenarios needed in machine learning involve:
    - Discrete or continuous decision variables
    - Single objective
    - Trivial constraints (just box constraints)

**Good news**: Most optimization scenarios in machine learning are formulates as

$\begin{align}
\text{minimize} \quad & f(\mathbf{x}) \\
\end{align}
$

$\begin{align}
\text{by varying} \quad & \underline{x}_i \leq \ x_i \leq \overline{x}_i  & i=1,...,D_x
\end{align}
$

- What about the **bad news**?

Most Engineering applications often involve nontrivial constraints.

* Your Final project will illustrate this (quite simplified).

But we will not explicitly cover constrained optimization in the lectures. You'll have to get creative!

There are some simple ways to handle contraints that all of you will immediately think about. But there are more sophisticated ways...

# Characteristics of an optimization problem

* Optimization problems usually have some key attributes that:
    * Help categorize the problem and candidate optimization algorithms to consider
    * Eventually helping to get better solutions with less resources!

* Examples of important attributes are:
    * Smoothness of the functions involved
    * Linearity
    * Stochasticity
    * **Modality**
    * **Convexity**

### Modality

<img align=right src="./figures/modality.png" width=30%>

**Modes** (i.e. minima) are points that are better than their neighbours

Formally:

$\mathbf{x}^*$ is a minima if $f(\mathbf{x}^*) \leq f(\mathbf{x})$ for every $\mathbf{x} \in \{||\mathbf{x} - \mathbf{x}^*|| < \epsilon\ , \epsilon > 0 \}$

Global minima vs local minima:

* Multi-modal functions are much more difficult to optimize (just like multi-modal distributions are difficult to integrate)
    * Risks of getting stuck in local minima
    * Every global minimum is also a local minimum...

* **So, how do we know we found the global optimum?**

In general, we don't know... 😮‍💨

### Convexity

* Purely mathematical
* **Golden rule: If an optimization problem is convex, all local minima are global minima**


* What is a convex optimization problem ?
    * If all functions involved are convex
    * ...

* Convex functions should meet 2 criteria

**A** : *The line segment joining any two points of the graph of the function has to be above the function itself.*
* For any two points $\mathbf{x}_i, \mathbf{x}_j$ in the domain of $f$ and variable $0 \leq \alpha \leq 1$ :
    
$$f(\alpha \mathbf{x}_i + (1 - \alpha) \mathbf{x}_j) \leq \alpha f(\mathbf{x}_i) + (1 - \alpha) f(\mathbf{x}_j)$$
<img align=centre src="./figures/convexity.png" width=60%>

**B** : *The domain of the function should not have holes*
* The line segment joining two points of a set should also lie in the set
* For continuous design variables in unconstrained settings, this is always valid

<img align=centre src="./figures/convexity.png" width=60%>

* The **bad news**?

Except some simple models (e.g. linear regression), **most ML models have (severely) non-convex loss** functions (i.e. negative log posterior, or negative log likelihood if there is no prior)...

# Solving optimization problems

* Searching the design space = optimizers
* Choice of optimizer is dependent on problem characteristics
    * No-free lunch theorem!

## Classifying optimization algorithms

<img align=centre src="./figures/optimizers.png" width=50%>

1. **Search strategy**
    * Local: Search in the vicinity of their initialization
        * **Exploitative** nature
        * Finds the nearest minima
        
    * Global: Search the entire design space (or try)
        * **Exploratory** in nature

2. **Algorithm design**
    * Heuristics
        * Nature inspired or based on rules-of-thumb
        * Robust (fewer assumptions) & general
    * Mathematically designed
        * Strictly converge to an optimal point*.
        * Works well when the assumptions match the problem
        * Very efficient
        * Difficult to design and implement

3. **Order of information used**
    * Practically, the most important classification
    * Lets dive a bit deeper!

## Taylor series and order of information


* Approximating a function $f$ at a given point $x_0$
    
$$f(x + \alpha)|_{x=x_0} \approx f(x_0) + \frac{df}{dx}\Big\lvert_{x=x_0} \alpha + \frac{1}{2}\frac{d^2 f}{d x^2}\Big\lvert_{x=x_0}\alpha^2 + \mathcal{O}(\alpha^3)$$ 

* $\alpha$ is the perturbation
    * Since there is one direction, the perturbation is along the derivatives
* More terms imply a better approximation
    * The error in the approximation scales as $\alpha^{n+1}$, if we include only $n$ terms
        


If we have $D_x$ dimensions for the input variables:

$$ f(\mathbf{x} + \alpha \mathbf{p})|_{\mathbf{x}=\mathbf{x}_0} \approx f(\mathbf{x}_0) + \alpha \nabla f(\mathbf{x}_0)^T \mathbf{p} + \frac{1} {2}\alpha^2 \mathbf{p}^T   \mathbf{H}(\mathbf{x}_0)   \mathbf{p}$$

* The main difference is that we perturb along a given direction!
* This is the form used by many optimization algorithms 

* Three terms in the series
    * **Zeroth order term**: value of the function at the $\mathbf{x}$ (our point of interest)
    
    * **First order term**: gradient of $f$ wrt to $\mathbf{x}$, i.e. vector of partial derivatives
        - Generalization of the first-derivative to $D_{\text{in}}$ dimensions:
         $$ \nabla f(\mathbf{x}) = \Big[\frac{\partial f}{ \partial x_1}, \frac{\partial f}{ \partial x_2} ..., \frac{\partial f}{ \partial x_{D_x}} \Big]^T$$
        - Direction is the steepest **ascent** of the function. i.e. the direction in which the function increases the fastest
            * (if you want to minimize, you want to go in the opposite direction of the gradient, i.e. the direction of steepest **descent**)

* **Second order term**: Hessian of $f$ wrt $\mathbf{x}$, i.e.  matrix of second-order derivatives
    - Generalization of the second-derivative to $D_\text{in}$ dimensions. It is a symmetric matrix with size $D_\text{in}\times D_\text{in}$:
$$ H(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_{D_x}} \\ \vdots &  \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_{D_x} \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_{D_x}^2} \end{bmatrix} $$

    - This matrix represents the curvature of the function (since it is a change in the gradients!).

* Zeroth-order optimizers (also called black-box optimizers)
    * Use only function evaluations
    * Common in hyperparameter tuning
* First-order optimizers
    * Gradient-based optimizers
    * Overwhelming majority of ML optimizers
* Second-order optimizers
    * Theoretically faster than first-order, but memory intensive and computationally costly
    * Unpopular in ML. Could this ever change?
        * E.g. K-FAC, Shampoo etc.

## Summary

* Optimizers are algorithms for searching solutions within a design (or decision) space
* General form of an optimization problem
* Constraints make optimization difficult
* Convex optimization problems are super nice... But not common in ML...

**Next lecture**
* More about optimizers
* Optimality criteria (to identify minima)
* Automatic differentiation

### See you next class

Have fun!