<div class="alert alert-block alert-info">
Section of the book chapter: <b>5.3 Model Selection, Optimization and Evaluation</b>
</div>

# 5. Model Selection and Evaluation

**Table of Contents**

* [5.1 Hyperparameter Optimization](#5.1-Hyperparameter-Optimization)
* [5.2 Model Evaluation](#5.2-Model-Evaluation)

**Learnings:**

- how to optimize machine learning (ML) models with grid search, random search and Bayesian optimization,
- how to evaluate ML models.



### Packages

In [1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.ensemble import RandomForestRegressor

import utils

### Read in Data

**Dataset:** Felix M. Riese and Sina Keller, "Hyperspectral benchmark dataset on soil moisture", Dataset, Zenodo, 2018. [DOI:10.5281/zenodo.1227836](http://doi.org/10.5281/zenodo.1227836) and [GitHub](https://github.com/felixriese/hyperspectral-soilmoisture-dataset)

**Introducing paper:** Felix M. Riese and Sina Keller, “Introducing a Framework of Self-Organizing Maps for Regression of Soil Moisture with Hyperspectral Data,” in IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 2018, pp. 6151-6154. [DOI:10.1109/IGARSS.2018.8517812](https://doi.org/10.1109/IGARSS.2018.8517812)

In [None]:
X_train, X_test, y_train, y_test = utils.get_xy_split()

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

### Fix Random State

In [None]:
np.random.seed(42)

***

## 5.1 Hyperparameter Optimization

Content:

- [5.1.1 Grid Search](#5.1.1-Grid-Search)
- [5.1.2 Randomized Search](#5.1.2-Randomized-Search)
- [5.1.3 Bayesian Optimization](#5.1.3-Bayesian-Optimization)

### 5.1.1 Grid Search

In [None]:
# NBVAL_IGNORE_OUTPUT

from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV

# example mode: support vector regressor
model = SVR(kernel="rbf")

# define parameter grid to be tested
params = {
    "C": np.logspace(-4, 4, 9),
    "gamma": np.logspace(-4, 4, 9)}


# set up grid search and run it on the data
gs = GridSearchCV(model, params)
%timeit gs.fit(X_train, y_train)
print("R2 score = {0:.2f} %".format(gs.score(X_test, y_test)*100))

### 5.1.2 Randomized Search

In [None]:
# NBVAL_IGNORE_OUTPUT

from sklearn.svm import SVR
from sklearn.model_selection import RandomizedSearchCV

# example mode: support vector regressor
model = SVR(kernel="rbf")

# define parameter grid to be tested
params = {
    "C": np.logspace(-4, 4, 9),
    "gamma": np.logspace(-4, 4, 9)}

# set up grid search and run it on the data
gsr = RandomizedSearchCV(model, params, n_iter=15, refit=True)
%timeit gsr.fit(X_train, y_train)
print("R2 score = {0:.2f} %".format(gsr.score(X_test, y_test)*100))

### 5.1.3 Bayesian Optimization

Implementation: [github.com/fmfn/BayesianOptimization](https://github.com/fmfn/BayesianOptimization)

In [None]:
# NBVAL_IGNORE_OUTPUT

from sklearn.svm import SVR
from bayes_opt import BayesianOptimization

# define function to be optimized
def opt_func(C, gamma):
    model = SVR(C=C, gamma=gamma)
    return model.fit(X_train, y_train).score(X_test, y_test)

# set bounded region of parameter space
pbounds = {'C': (1e-5, 1e4), 'gamma': (1e-5, 1e4)}

# define optimizer
optimizer = BayesianOptimization(
    f=opt_func,
    pbounds=pbounds,
    random_state=1)

# optimize
%time optimizer.maximize(init_points=2, n_iter=15)
print("R2 score = {0:.2f} %".format(optimizer.max["target"]*100))

***

## 5.2 Model Evaluation

Content:

- [5.2.1 Generate Exemplary Data](#5.2.1-Generate-Exemplary-Data)
- [5.2.2 Plot the Data](#5.2.2-Plot-the-Data)
- [5.2.3 Evaluation Metrics](#5.2.3-Evaluation-Metrics)

In [None]:
import sklearn.metrics as me

### 5.2.1 Generate Exemplary Data

In [None]:
### generate example data
np.random.seed(1)

# define x grid
x_grid = np.linspace(0, 10, 11)
y_model = x_grid*0.5

# define first dataset without outlier
y1 = np.array([y + np.random.normal(scale=0.2) for y in y_model])

# define second dataset with outlier
y2 = np.copy(y1)
y2[9] = 0.5

# define third dataset with higher variance
y3 = np.array([y + np.random.normal(scale=1.0) for y in y_model])

### 5.2.2 Plot the Data

In [None]:
# plot example data
fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(12,4))
fontsize = 18
titleweight = "bold"
titlepad = 10

scatter_label = "Data"
scatter_alpha = 0.7
scatter_s = 100
ax1.scatter(x_grid, y1, label=scatter_label, alpha=scatter_alpha, s=scatter_s)
ax1.set_title("(a) Low var.", fontsize=fontsize, fontweight=titleweight, pad=titlepad)

ax2.scatter(x_grid, y2, label=scatter_label, alpha=scatter_alpha, s=scatter_s)
ax2.set_title("(b) Low var. + outlier", fontsize=fontsize, fontweight=titleweight, pad=titlepad)

ax3.scatter(x_grid, y3, label=scatter_label, alpha=scatter_alpha, s=scatter_s)
ax3.set_title("(c) Higher var.", fontsize=fontsize, fontweight=titleweight, pad=titlepad)

for i, ax in enumerate([ax1, ax2, ax3]):
    i += 1
    
    # red line
    ax.plot(x_grid, y_model, label="Model", c="tab:red", linestyle="dashed", linewidth=4, alpha=scatter_alpha)
    
    # x-axis cosmetics
    ax.set_xlabel("x in a.u.", fontsize=fontsize)
    for tick in ax.xaxis.get_major_ticks():
        tick.label.set_fontsize(fontsize) 
    
    # y-axis cosmetics
    if i != 1:
        ax.set_yticklabels([])
    else:
        ax.set_ylabel("y in a.u.", fontsize=fontsize, rotation=90)
        for tick in ax.yaxis.get_major_ticks():
            tick.label.set_fontsize(fontsize) 
    ax.set_xlim(-0.5, 10.5)
    ax.set_ylim(-0.5, 6.5)
    # ax.set_title("Example "+str(i), fontsize=fontsize)
    if i == 2:
        ax.legend(loc=2, fontsize=fontsize*1.0, frameon=True)

plt.tight_layout()
plt.savefig("plots/metrics_plot.pdf", bbox_inches="tight")

### 5.2.3 Evaluation Metrics

In [None]:
# calculating the metrics
for i, y in enumerate([y1, y2, y3]):
    print("Example", i+1)
    print("- MAE = {:.2f}".format(me.mean_absolute_error(y_model, y)))
    print("- MSE = {:.2f}".format(me.mean_squared_error(y_model, y)))
    print("- RMSE = {:.2f}".format(np.sqrt(me.mean_squared_error(y_model, y))))
    print("- R2 = {:.2f}%".format(me.r2_score(y_model, y)*100))
    print("-"*20)