# Exercise set 6: Partial least squares regression and model evaluation

The main goals of this exercise are to perform Partial Least Squares (PLS) regression and use training and testing sets. Using training and testing sets allows us to assess the model's ability to generalise to unseen data and avoid overfitting. 

**Learning Objectives:**

After completing this exercise set, you will be able to:

- Create a PLS regression model.
- Create and use training and test sets.
- Assess your regression model by calculating root mean squared errors.

**To get the exercise approved, complete the following problems:**

- [6.1(a)](#6.1(a)) and [6.1(d)](#6.1(d)): To show that you can create a training set and a test set, a PLS regression model, and evaluate your PLS model. You might find it helpful to also do [6.1(b)](#6.1(b)), [6.1(c)](#6.1(c)).

**Files required for this exercise:**
* [Exercise 6.1](#Exercise-6.1): [egg-storage.csv](egg-storage.csv)
* [Exercise 6.2](#Exercise-6.2): [forbes.csv](forbes.csv)

Please ensure that these files are saved in the same directory as this notebook.

**Note:** [Exercise 6.2](#Exercise-6.2) is optional. It shows you how you can use cross-validation when we have too few samples to split into a training set and test set.

## Exercise 6.1 

In this exercise, we will use a data set that contains [NIR spectra of poultry eggs at different storage days](https://data.mendeley.com/datasets/6hn67h2trb/1). We will use these spectra to create a regression model for predicting the number of days an egg has been stored. The data is given in the file [egg-storage.csv](egg-storage.csv) and it has already been preprocessed so you can use it directly.

You can load and visualise the spectra as follows:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

%matplotlib inline
sns.set_theme(style="ticks", context="notebook", palette="colorblind")

In [None]:
data = pd.read_csv("egg-storage.csv")
data.head()

In [None]:
# Extract the days:
y = data["days"]
# Extract the spectra:
xvars = [i for i in data.columns if i not in ("days",)]
X = data[xvars].to_numpy()
# Extract the wavelengths for plotting:
wavelengths = np.array([float(i.split("nm")[0]) for i in xvars])

To visualise the data, you can do the following:

In [None]:
from matplotlib.colors import Normalize
from matplotlib.cm import ScalarMappable

norm = Normalize(y.min(), y.max())
y_normed = norm(y)
cmap = sns.color_palette("Spectral", as_cmap=True)
color_days = cmap(y_normed)

fig, ax1 = plt.subplots(constrained_layout=True)
for i, spec in enumerate(X):
    ax1.plot(wavelengths, spec, color=color_days[i])
    
sm = ScalarMappable(cmap=cmap, norm=norm)
sm.set_array([])
cbar = fig.colorbar(sm, ax=ax1, label='Days')
tick_locations = np.arange(0, y.max()+1)
cbar.set_ticks(tick_locations)

### 6.1(a)

**Task: Create a training set and a testing set by running the code below. How many spectra are in the training set, and how many are in the test set?**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=2026  # Make the data splitting process reproducible
)

In [None]:
# Your code here

#### Your answer to question 6.1(a): How many spectra are in the training and test set, respectively?
*Double click here*

### 6.1(b)

**Task: Create a least squares model using the training data. Calculate the R² value and the root mean squared error for the calibration (RMSEC).**

**Hint:** Use [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) from scikit-learn to create the regression model, e.g.,
```python
from sklearn.linear_model import LinearRegression

model1 = LinearRegression()
model1.fit(X_train, y_train)
```

The R² and RMSEC can be calculated using:
```python
from sklearn.metrics import r2_score, root_mean_squared_error

y_pred_train = model1.predict(X_train)  # Predict using the model
r2_train = r2_score(y_train, y_pred_train)
rmsec = root_mean_squared_error(y_train, y_pred_train)
```

In [None]:
# Your code here

#### Your answer to question 6.1(b): What values did you get for R² and RMSEC?
*Double click here*

### 6.1(c)

**Task: Evaluate your linear regression model by calculating R² and the root mean squared error of prediction (RMSEP) for the test set. How do you interpret these values?**

In [None]:
# Your code here

#### Your answer to question 6.1(c): What values did you get for R² and RMSEP for the test set? How do you interpret these?
*Double click here*

### 6.1(d)

**Task: Create a partial least squares (PLS) regression model by running the code below. Calculate the R² for the training and testing set, RMSEC and RMSEP. How do you interpret these values?**

**Note:** If you have time, try generating the y vs. ŷ (observed vs. predicted) plot for the training set and for the test set. Visualising the results this way may help you spot outliers or non-linear trends.

In [None]:
from sklearn.cross_decomposition import PLSRegression

model_pls = PLSRegression(
    n_components=25,  # Use 25 latent variables, more on this later.
    scale=False  # Do not scale the X and y, since the spectra have been preprocessed.
)

model_pls.fit(X_train, y_train)

In [None]:
# Your code here

#### Your answer to question 6.1(d): What values did you get for R², RMSEC, and RMSEP, and how do you interpret them?
*Double click here*

### 6.1(e)

The PLS regression model you made above uses 25 latent variables. That sounds like a high number! We will check if that is correct by running cross-validation
**Task: Run the code below and determine the number of latent variables to use for the PLS regression model.**

In [None]:
# We use a grid search with cross-validation to look for the
# best number of latent variables:
from sklearn.model_selection import GridSearchCV

# The parameter we are going to optimise is the number
# of PLS components. We assume that this is in the range
# from 1 to 50.
parameters = {"n_components": range(1, 51)}

# Next, we set up the grid search:
grid = GridSearchCV(
        PLSRegression(scale=False),  # This is the model we will make.
        parameters,  # The parameters we optimise.
        scoring="neg_root_mean_squared_error",  # We use a negative RMSE as a scoring for models.
        cv=10,  # Use 10 splits.
        refit=True,  # Fit the best model to the whole data set.
        n_jobs=4,  # Run in parallel, of 4 processes.
        verbose=1,  # Print out slightly more while fitting.
)
# Why do we use a negative RMSE? It is because the GridSearchCV is maximising
# the score, and to make a smaller RMSE better, we turn it into a negative
# value.
# Run the grid search:
grid.fit(X_train, y_train)
# Store the optimised model:
model_pls_opt = grid.best_estimator_
print(model_pls_opt)

The grid search above might have picked many components (25) just because the error is slightly less than with fewer components. We can check where the error levels off to see if we should use fewer components. We can do this visually:

In [None]:
fig, ax = plt.subplots(constrained_layout=True)
score = -1.0 * grid.cv_results_["mean_test_score"]
score_std = grid.cv_results_["std_test_score"]

ax.errorbar(
    parameters["n_components"],
    score,
    yerr=score_std,
    marker="o",
    markerfacecolor="none",
    ms=10,
)
ax.set(xlabel="PLS components", ylabel="RMSE")
ax.set_title("Results from cross-validation (grid search)", loc="left")
sns.despine(fig=fig)

#### Your answer to question 6.1(e): How many PLS components do you recommend using?
*Double click here*

## Exercise 6.2

It is not always feasible to do the split into training and test sets when we have few samples. Another option then is to use something called **Leave-one-out cross-validation** (LOOCV). LOOCV involves training the model on all but one data point and using the remaining point for testing, repeating this process for each data point. We will use that method in this exercise.

We will use the data of [Forbes](https://doi.org/10.1017/S0080456800032075) (from Exercise 5) who investigated the
relationship between the boiling point of water and the atmospheric pressure, and collected data in the Alps and Scotland. Forbes' goal was to estimate altitudes from the boiling point alone.

LOOCV is described in [appendix A](#A.-Leave-one-out-cross-validation). It can be a good idea to read this before starting this exercise.

### 6.2(a)

**Task: Load the data from Forbes (data file [forbes.csv](forbes.csv)), plot it, and create a linear regression model (use scikit-learn)
that predicts the atmospheric pressure from the temperature. Report the R² and [root mean
squared error (RMSE)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.root_mean_squared_error.html) for your model.**

In [None]:
# Your code here

#### Your answer to question 6.2(a): What value did you get for R² and the RMSE?
*Double click here*

### 6.2(b)

**Task: Estimate the error you can expect to make if you use your model for predicting the pressure.
Do this by LOOCV and calculate the root mean squared error of cross-validation (RMSECV)**

**Note:** LOOCV is a special case of **training** and **testing**, and you can find a short description of it
in [appendix A](#A.-Leave-one-out-cross-validation) with example code for running LOOCV. The code example for LOOCV is concise, so make sure you understand what goes on here (that is, what LOOCV is doing). If you are working with someone, try explaining testing/training and how LOOCV works to them.

In [None]:
# Your code here

#### Your answer to question 6.2(b): What value did you get for RMSECV?
*Double click here*

# Appendix

## A. Leave-one-out cross-validation

In Leave-one-out cross-validation (LOOCV), we first pick one sample,
measurement number $j$, and we fit the model using the $n-1$ other points
(all points except $j$). After the fitting, we check how well the model can predict
measurement $j$ by calculating the difference between the
measured ($y_j$) and predicted ($\tilde{y}_j$) value. This difference, $r_j = y_{j} - \tilde{y}_j$, is
called the predicted residual, and it tells us the error we just made.

There is nothing special about picking point $j$, and we can try all possibilities
of leaving one point out, fitting the model using the remaining $n-1$
measurements, and predicting the value we left out.
After doing this for all possibilities, we have fitted the model
$n$ times and calculated $n$ predicted residuals. The mean squared error (obtained from the squared
residuals), $\mathrm{MSE}_{\mathrm{CV}}$, can then be used
to estimate the error in the model,

\begin{equation}
\mathrm{MSE}_{\mathrm{CV}} = \frac{1}{n} \sum_{i=1}^{n} r_i^2 =  \frac{1}{n} \sum_{i=1}^{n} (y_i - \tilde{y}_i)^2,
\end{equation}

where $y_i$ is the measured $y$ in experiment $i$, and $\tilde{y}_i$ is the
predicted $y$, using a model which was fitted using all points *except* $y_i$.

For a polynomial fitting, there is an alternative to refitting the model $n$ times. In fact,
we can show that for polynomial fitting, the mean squared error can
be obtained by,

\begin{equation}
\mathrm{MSE}_{\mathrm{CV}} = \frac{1}{n}\sum_{i=1}^{n} (y_i - \tilde{y}_i)^2 =
\frac{1}{n}\sum_{i=1}^{m} \left(\frac{y_i - \hat{y}_i}{1 - h_{ii}} \right)^2,
\end{equation}

where the $\hat{y}_i$'s are predicted values using the
model fitted with *all data points*,
and $h_{ii}$ is the $i$'th diagonal element of the
$\mathbf{H}$ matrix (the projection matrix,
see Eq.(4.49) on page 49 in our textbook),

\begin{equation}
\mathbf{H} =
\mathbf{X} 
\left( 
  \mathbf{X}^\mathrm{T} \mathbf{X}
\right)^{-1}
\mathbf{X}^\mathrm{T} = \mathbf{X} \mathbf{X}^+,
\end{equation}

Note the difference between $\hat{y}_i$ and $\tilde{y}_i$, and the
fact that we  do not have to do the
refitting(!) to obtain the $\mathrm{MSE}_{\mathrm{CV}}$.

When you calculate $\mathrm{MSE}_{\mathrm{CV}}$, use one of the two approaches above or both
if you want to see if they give the same answer.

In [None]:
# The examples below assume:
# - that the matrix X is called X_temp
# - that y is stored in the variable pressure.

# Example 1 of LOOCV:
from sklearn.linear_model import LinearRegression

# scikit-learn has a method to pick out samples for leave-one-out:
from sklearn.model_selection import LeaveOneOut


loo = LeaveOneOut()
error = []
# Split the X-data in X_temp into training and testing:
for train_index, test_index in loo.split(X_temp):
    # train_index = index of samples to use for training
    # test_index = index of samples to use for testing
    # Pick out samples (for training and testing):
    X_train, X_test = X_temp[train_index], X_temp[test_index]
    y_train, y_test = pressure[train_index], pressure[test_index]
    # Fit a new model with the training set:
    model = LinearRegression(fit_intercept=True).fit(X_train, y_train)
    # Predict y for the test set:
    y_hat = model.predict(X_test)
    # Compare the predicted y values in the test set with the measured ones:
    error.append((y_test - y_hat) ** 2)
rmsecv_1 = np.sqrt(np.mean(error))
print(f"RMSECV = {rmsecv_1}")

In [None]:
# Example 2 of LOOCV:

# scikit-learn has a method for leave-one-out selection, and a method for
# cross-validation. And these two can be combined:
from sklearn.model_selection import LeaveOneOut, cross_val_score

# Create "empty" model for fitting:
model = LinearRegression(fit_intercept=True)
# Run cross-validation, where we select testing and training with LeaveOneOut:
scores = cross_val_score(
    model, X_temp, pressure, scoring="neg_mean_squared_error", cv=LeaveOneOut()
)
rmsecv_2 = np.sqrt(np.mean(-scores))
print(f"RMSECV = {rmsecv_2}")

In [None]:
# Example 3 of LOOCV:

# We calculate the H matrix and use that:
# OBS! First, a detail that is easy to miss; The X used for H includes the column of ones!
X_matrix = np.column_stack((np.ones_like(temperature), temperature))
H = X_matrix @ np.linalg.pinv(X_matrix)
hii = np.diagonal(H)
residuals_loo = (pressure - pressure_hat) / (1 - hii)
rmsecv_3 = np.sqrt(np.mean(residuals_loo**2))
print(f"RMSECV = {rmsecv_3}")