# Optimization techniques Lab. 6: Bayesian Optimization
## Introduction
**Goal.** The goal of this lab is to study the behavior of Bayesian optimization on a regression problem and a classifier one. 
Bayesian optimization is a probabilistic approach that uses the Bayes' Theorem $P(A|B) = \frac{P(B|A)*P(A)}{P(B)}$. Briefly, we use the prior information, $P(A)$,(random samples) to optimize a surrogate function, $P(B|A)$.

**Getting started.** The following cells contain the implementation of the methods that we will use throughout this lab, together with utilities. 


In [None]:
from typing import Tuple
from warnings import catch_warnings, simplefilter
from matplotlib import pyplot
from numpy import arange, ndarray, sin, argmax, asarray, mean, vstack
from numpy.random import normal, random
from sklearn.datasets import make_blobs
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from skopt import gp_minimize
from skopt.space import Integer
from skopt.utils import use_named_args

In [None]:
def surrogate(model: GaussianProcessRegressor, X: ndarray[float]) -> Tuple[ndarray[float], ndarray[float]]:
    """
    surrogate or approximation for the objective function

    :param model:
    :param X:
    :return:
    """
    # catch any warning generated when making a prediction
    with catch_warnings():
        # ignore generated warnings
        simplefilter("ignore")
        return model.predict(X, return_std=True)


def acquisition(X: ndarray[float], Xsamples: ndarray[float], model: GaussianProcessRegressor) -> float:
    # calculate the best surrogate score found so far
    yhat, _ = surrogate(model, X)
    print(X)
    best = max(yhat)
    # calculate mean and stdev via surrogate function
    mu, std = surrogate(model, Xsamples)
    # mu = mu[:, 0]
    # calculate the probability of improvement
    probs = acquisition_function()
    return probs


def opt_acquisition(X: ndarray[float], y: ndarray[float], model: GaussianProcessRegressor) -> float:
    """
    optimize the acquisition function

    :param X:
    :param y:
    :param model:
    :return:
    """
    # random search, generate random samples
    Xsamples: ndarray[float] = random(100)
    Xsamples = Xsamples.reshape(len(Xsamples), 1)
    # calculate the acquisition function for each sample
    scores = acquisition(X, Xsamples, model)
    # locate the index of the largest scores
    ix = argmax(scores)
    return Xsamples[ix, 0]


def plot(X: ndarray[float], y: ndarray[float], model: GaussianProcessRegressor) -> None:
    """
    plot real observations vs surrogate function

    :param X:
    :param y:
    :param model:
    :return:
    """
    # scatter plot of inputs and real objective function
    pyplot.scatter(X, y)
    # line plot of surrogate function across domain
    Xsamples = asarray(arange(0, 1, 0.001))
    Xsamples = Xsamples.reshape(len(Xsamples), 1)
    ysamples, _ = surrogate(model, Xsamples)
    pyplot.plot(Xsamples, ysamples)
    # show the plot
    pyplot.show()
    pyplot.close()


def bayesian_optimization(generation: int) -> Tuple[ndarray[float], ndarray[float], GaussianProcessRegressor]:
    # sample the domain sparsely with noise
    xs: ndarray[float]
    ys: ndarray[float]
    xs, ys = initial_point()
    # reshape into rows and cols
    xs = xs.reshape(len(xs), 1)
    ys = ys.reshape(len(ys), 1)
    # define the model
    model: GaussianProcessRegressor = GaussianProcessRegressor()  # you can set the kernel and the optimizer
    # fit the model
    model.fit(xs, ys)
    # perform the optimization process
    for i in range(generation):
        # select the next point to sample
        x = opt_acquisition(xs, ys, model)
        # sample the point
        actual: float = objective(x)
        # summarize the finding
        est, _ = surrogate(model, [[x]])
        # add the data to the dataset
        xs = vstack((xs, [[x]]))
        ys = vstack((ys, [[actual]]))
        # update the model
        model.fit(xs, ys)
    return xs, ys, model

#Implementative part.
Your first step, will be to implement the following functions:


1.   objective() is the function to optimize. 
2.   initial_point() returns the initial set of points (a priori knowledge)
3.   acquisition_function() implements the acquisition function




In [None]:
# objective function
def objective(x: float, noise: float = 0.1) -> float:
    return sin(x * 10) + normal(loc=0, scale=noise)


# remember to return the value in the right order and type
def initial_point(size: int = 2) -> Tuple[ndarray[float], ndarray[float]]:
    xs: ndarray[float] = random(size)
    ys: ndarray[float] = asarray([objective(x) for x in xs])
    return xs, ys


# you have to add the parameters that you need
def acquisition_function() -> float:
    return 0

# Regression
---
## Questions:
- How does the prior knowledge change the optimization?
- How does the kernel change the optimization? (see here the [kernels](https://scikit-learn.org/stable/modules/gaussian_process.html#kernels-for-gaussian-processes))
- How does the acquisition function affect the optimization?

In [None]:
def regression() -> None :
    X: ndarray[float]
    y: ndarray[float]
    model: GaussianProcessRegressor
    X, y, model = bayesian_optimization(10)
    plot(X, y, model)
    # best result
    ix: ndarray[int] = argmax(y)
    print('Best Result: x=%.3f, y=%.3f' % (X[ix], y[ix]))


regression()

# Classifier
---
## Questions:
- Try different ranges of hyperparameters. How do the results change?
- Does the model influence the choice of the hyperparameters?

In [None]:
def classifier() -> None:
    # generate 2d classification dataset
    X, y = make_blobs(n_samples=500, centers=3, n_features=2)
    # define the model

    model = KNeighborsClassifier()
    # define the space of hyperparameters to search
    search_space = [Integer(1, 5, name='n_neighbors'), Integer(1, 2, name='p')]

    # define the function used to evaluate a given configuration
    @use_named_args(search_space)
    def evaluate_model(**params):
        # something
        model.set_params(**params)
        # calculate 5-fold cross validation
        with catch_warnings():
            # ignore generated warnings
            simplefilter("ignore")
            result = cross_val_score(model, X, y, cv=5, n_jobs=-1, scoring='accuracy')
            # calculate the mean of the scores
            estimate = mean(result)
            return 1.0 - estimate

    # perform optimization
    result = gp_minimize(evaluate_model, search_space)
    # summarizing finding:
    print('Best Accuracy: %.3f' % (1.0 - result.fun))
    print('Best Parameters: n_neighbors=%d, p=%d' % (result.x[0], result.x[1]))


classifier()

# BONUS

You see in the classifier the effect of hyperparameter tuning. 
You can now change the acquisition functions in the regression problem, adding a slack variable as a hyperparameter. How does this variable affect the optimization problem?