# Problem 1: Bayesian Optimization By Hand

While there are a variety of libraries that can be used to perform Bayesian optimization, it is important to understand the underlying process. In this problem, you will build a bayesian optimization pipeline by hand and use it to find 15 of the top 5% of materials in a dataset.

Taken as a whole, this problem may seem daunting. However, when broken into its components it is quite manageable. Recall the Bayesian optimization flow chart from the lectures that details to core components of the a bayesian optimization pipeline. Rather than trying to solve this problem all in one shot, we recommend building the pipeline one component at a time and making sure you understand it before moving on to the next component.

You are given the crossed barrel dataset, which shows the toughness of materials as a function of several 3D printer parameters. For learning purposes, we are going to pretend that we don't know the optimal parameters in advance and perform a simulated optimzation campaign to find them. Your task is to start with 5 random samples from the data and use Bayesian optimization to find the top 5% of candidates in the data. Use the gaussain process model included in the `scikit-learn` package as your surrogate model. We would like you to perform the optimization with two different acquisition functions: expected improvement and upper confidence bound. You will need to implement these acquisition functions by hand (don't worry it's easy). Your termination criteria will be when you have found at least 15 candidates from the top 5% of the materials in the dataset.

Your code will likely get very messy as you experiment and learn how to implement these components. However, clean code is an expectation in professional settings. Therefore, we are requiring that your final code by clean and commented (this includes function explanations) upon submission. 3pts will be assigned to this. If you are unsure if your code is clean enough, ask a TA or instructor for feedback.

**Specific Tasks**

Please show the following for both UCB and EI cases:
- How many iterations it took to find 15 candidates (should be less than 150)
- A plot of the 15 candidates you found relative to the entire dataset
- A plot of the number of candidates you found as a function of iteation count
- A plot of the best candidate you found as a function of the iteration count

Finally provide some commentary on the performance of your optimization campaigns. What surprised you about the results? What would you do differently if you were to do this in a real experimental setting?

In [1]:
# Import statements for the rest of the blocks
import pandas as pd
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
import matplotlib.pyplot as plt

# set my random seed here 
rnd_seed = 42

In [3]:
# Start by setting up my data pool of 5 random points from the initial dataset
data_full = pd.read_csv('data\crossed_barrel_dataset.csv')
data_pool = data_full.sample(n=5,random_state=rnd_seed)
print(data_pool)

       n  theta    r     t  toughness
1591  10    125  2.0  1.05  25.992194
943   10     50  2.1  1.05  18.128545
869    8    175  1.6  1.40  17.153946
162    8      0  2.3  0.70  25.358264
1271   6    100  1.7  1.40   9.274256


Define a function that generates the gaussian process model from a given data_pool
    

In [11]:

def generateGP(data_pool):

    if not isinstance(data_pool, pd.DataFrame):
        raise ValueError("data_pool must be a DataFrame object.  generateGP(data_pool)")

    # pull X and y from my data_pool
    X = data_pool.drop(columns=['toughness'])
    y = data_pool[['toughness']]

    # Set the kernel to RBF
    kernel = C(1.0, (1e-3, 1e3)) * RBF(10, (1e-2, 1e2))

    # Initialize Gaussian Process
    gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)

    # Fit to data
    gp.fit(X, y)
    return gp


SyntaxError: expected ':' (298432800.py, line 3)

In [10]:
gp = generateGP(data_pool)

## Upper Confidence Bound

In [2]:
#your code goes here

## Expected Improvement

In [3]:
#your code goes here