# Bayesian Optimization

#### Bayesian Hyperparameter Optimization
* Why auto-tuning matters:
    * Humans are really bad at it
    * Properly set parameters outperform the most complex, state-of-the-art models
* Tuning tips:
    * Keep an open-mind (explore the full space from the beginning)
        * Don't prejudge hyperparameter possibilities
    * Don't do grid search as your hyperparameter search method
        * In practice you'll have many hyperparameters, some of them will matter, and some of them will end up being mostly irrelevant
    * Try and eliminate irrelevant hyperparameters where possible
        * The more parameters you have, the harder it is to tune
    * To see a clear pattern it can take way longer than you expect
        * If you want to have really good tuning, be prepared to spend a lot of time on it
        
**Bayesian parameter estimation for automatically tuning hyperparameters:**
* Neural nets have certain hyperparameters which aren't part of the training procedure
* You can evaluate them using a validation set, but there's still the problem of which values to try:
    * Brute force search (e.g. grid search, random search) is very expensive and time-consuming
* Hyperparameter tuning is a kind of black box optimization: you want to minimize a function, but you only get to query values, not compute gradients
* Each evaluation is expensive, so we want to use few evaluations
* You want to query a point which:
    * you expect to be good
    * you are uncertain about
* $\Rightarrow$ **Bayesian regression allows us to predict not just a value, but a distribution.** $\Leftarrow$

**Bayesian Linear Regression**
* We're interested in the uncertainty
* Bayesian Linear Regression considers various plausible explanations for how the data were generated 
* It makes predictions using all possible regression weights, weighted by their posterior probability
* We can turn this into non-linear regression using basis functions (e.g. Gaussian basis functions)

**Bayesian Optimization**
* Applying all of this to black-box optimization: let's review the technique called **Bayesian optimization.**
* The actual function we're trying to optimize (e.g. validation error as a function of hyperparameters) is really complicated. So let's approximate it with a simple function, called the surrogate function.
* After we've queried a certain number of points, we can condition on these to infer the posterior over the **surrogate function** using Bayesian linear regression.
* To choose the next point to query, we must define an **acquisition function**, which tells use how promising a candidate it is.
    * Desiderata ([see MW](https://www.merriam-webster.com/dictionary/desideratum)):
        * high for points we expect to be good
        * high for points we're uncertain about
        * low for points we've already tried
* The problem with **Probability of Improvement (PI)** is that it queries points it is highly confident will have a small improvement; usually, these are right next to ones we've already evaluated.
* A better choice: **Expected Improvement (EI).** 
    * The idea: if the new value is much better, we win by a lot; if it's much worse, we haven't lost anything
    * There is an explicit formula for this if the posterior predictive distribution is Gaussian.
    
    

#### How does Bayesian optimization work?
* Bayesian optimization works by constructing a posterior distribution of functions (Gaussian process) that best describes the function you want to optimize. As the number of observations grows, the posterior distribution improves, and the algorithm becomes more certain of which regions in parameter space are worth exploring and which are not.
* As you iterate over and over, the algorithm balances its needs of exploration and exploitation while taking into account what it knows about the target function. At each step, a Gaussian Process is fitted to the known samples (points previously explored), and the posterior distribution, combined with an exploration strategy (such as UCB — aka Upper Confidence Bound), or EI (Expected Improvement). This process is used to determine the next point that should be explored.
* These are the main parameters of the Bayesian Optimizer:
    * **`n_iter`:** This is how many steps of Bayesian optimization you want to perform. The more steps, the more likely you are to find a good maximum.
    * **`init_points`:** This is how many steps of random exploration you want to perform. Random exploration can help by diversifying the exploration space.