# Exercise set 5: Partial Least squares and training and testing.

The main goals of this exercise are to perform Partial Least Squares (PLS) regression and use training and testing sets. Using training and testing sets allows us to assess the model's ability to generalize to unseen data and avoid overfitting. 

**Learning Objectives:**

After completing this exercise set, you will be able to:

- Create a PLS regression model.
- Create and use training and test sets.
- Assess your regression model by calculating root mean squared errors.

**To get the exercise approved, complete the following problems:**

* [5.1(a)](#5.1(a)), [5.1(b)](#5.1(b)), and [5.1(c)](#5.1(c)): To show that you can train a Partial Least Squares regression model and calculate RMSEC (Root Mean Squared Error of Calibration) and RMSEP (Root Mean Squared Error of Prediction).

**Note:**
Exercises [5.1(d)](#5.1(d)), [5.1(e)](#5.1(e)), [5.2(b)](#5.2(b)), [5.3(b)](#5.3(b)), [5.3(c)](#5.3(c)), [5.3(d)](#5.3(d)) involve programming of a complexity beyond what you are expected to handle independently, particularly under exam time pressure. Understanding the results and the underlying concepts are more important and to help you practice this (without focusing on the technical programming part), you can find partial solutions (without interpretation) in [appendix B](#B.-Partial-answers-to-some-of-the-exercises), specifically:

* [5.1(e)](#5.1(e)): You can use the figure in [appendix B.1](#B.1-Results-for-5.1(e)) to compare the performance of the model for the training and test sets. Make sure that you understand what the figure is showing.

* [5.3(b)](#5.3(b)): The X-scores and PLS X-rotations can be found in the figure in [appendix B.2](#B.2-Results-for-5.3(b)).

* [5.3(c)](#5.3(c)): The Y-scores and PLS Y-rotations can be found in the figure in [appendix B.3](#B.3-Results-for-5.3(c)).

* [5.3(d)](#5.3(d)): The PLS loadings for X and Y can be found in the figure in [appendix B.4](#B.4-Results-for-5.3(d)).

For [5.1(d)](#5.1(d)) and [5.2(b)](#5.2(b)), ensure you understand the process and purpose of cross-validation. One way to check your understanding is to explain it to a classmate.

## Exercise 5.1 Partial Least Squares with training and testing

[Windig and Stephenson](https://doi.org/10.1021/ac00046a015) measured near-infrared spectra
for 140 mixtures of the solvents methylene chloride, 2-butanol, methanol,
dichloropropane, and acetone. Here, we will predict the compositions of the mixtures from the spectra.
Each spectrum was sampled at 700 wavelengths
between 1100 and 2500 nm. The file
[`windig.csv`](windig.csv) contains the raw data:
Each row in this file
contains a spectrum (the columns starting with `wavelength.`) and the
corresponding concentrations (the columns starting with `conc.`).

**The goal of exercise 5.1 is to make a model for predicting the composition of a mixture from its spectrum.**

You can inspect the raw data by:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

%matplotlib inline
sns.set_theme(style="ticks", context="notebook", palette="colorblind")


data = pd.read_csv("windig.csv")
X = data.filter(like="wavelength", axis=1).values  # NIR spectra
Y = data.filter(like="conc", axis=1).values  # Concentrations
print(f"No. of spectra: {X.shape[0]}")
print(f"No. of wavelengths: {X.shape[1]}")
print(f"No of concentration samples: {Y.shape[0]}")
print(f"No of species in each sample: {Y.shape[1]}")

And the individual spectra can be visualised with:

In [None]:
# To plot the spectra:
fig, ax = plt.subplots(constrained_layout=True)
for spectrum in X:
    ax.plot(spectrum)
ax.set(xlabel="Wavelength (nm)", ylabel="Absorbance")
sns.despine(fig=fig)

**Note:** The spectra have been processed so you can use the directly.

### 5.1(a)

To develop and assess your model, you will create and make use of a training and testing data set.

**Explain what the purposes of these two sets are and how they can be created.**

#### Your answer to question 5.1(a): What is the purpose of the training and testing data sets, and how are they created?

*Double click here*

### 5.1(b)

**Split the raw data into a training set and a test set. Use 33% of the data for the test set. How many samples do you have in the training set and the test set?**

**Hint:** With scikit-learn's [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), splitting the data can be done with
```python
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(
    X,
    Y,
    test_size=0.33,  # Use 33 % of the data (one-third) for the test set.
    shuffle=True,  # Randomly shuffle the data
)
```

In [None]:
# Your code here

#### Your answer to question 5.1(b): How many samples do you have in the training set and the test set?
*Double click here*

### 5.1(c)

**Task: Create a Partial Least Squares (PLS) regression model for predicting the concentrations from the spectra. Use 2 latent variables for the PLS model and evaluate your model by calculating the RMSEC (root mean squared error of calibration) and RMSEP (root mean squared error of prediction) for each of the five concentrations.**

**Hints:**

1.  **Create a PLS regression model:**
    ```python
    from sklearn.cross_decomposition import PLSRegression
    # Set up a PLS model:
    model = PLSRegression(
        n_components=2,  # Use two components (latent variables)
        scale=False,  # Do not scale X and Y (we will do this separately, if needed)
    )
    ```

2.  **Fit the model to the training data:**
    ```python
    model.fit(X_train, Y_train)  # Fit/make the model
    ```

3.  **Calculate the RMSEC (root mean squared error of calibration):**
    * When we use the training set to create our model, we are doing a *calibration*. If we calculate the RMSE (root mean squared error) based on the training set, we refer to this as the RMSEC. This quantifies the error we get in connection with making (calibrating) the model.
    ```python
    from sklearn.metrics import root_mean_squared_error

    y_hat_train = model.predict(X_train)
    rmsec = root_mean_squared_error(Y_train, y_hat_train)
    ```

4.  **Calculate the RMSEP (root mean squared error of prediction):**
    * When we use the test set to test our model, we are checking how well our model *predicts* "new" samples (that is, samples not used when making the model). If we calculate RMSE based on the test set, we refer to this as the RMSEP. This quantifies the error we can expect to make when using our model for predicting new samples.
    ```python
    y_hat_test = model.predict(X_test)
    rmsep = root_mean_squared_error(Y_test, y_hat_test)
    ```

In [None]:
# Your code here

#### Your answer to question 5.1(c): What values did you get for RMSEC and RMSEP.
*Double click here*

### 5.1(d)

**Task: Optimize the number of PLS components by performing cross-validation on a grid where you vary the number of components. Calculate RMSEC, RMSECV (root mean squared error of cross-validation), and RMSEP for your new model. Report the optimal number of components.**

**Hints:**

1. **Optimize the number of components by using cross-validation on a grid of possible parameters, for instance, by using [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) for scikit-learn:**
```python
from sklearn.model_selection import GridSearchCV

parameters = {"n_components": range(1, 11)}  # Test 1 through 10 components
# Set up a search over the parameter space:
grid_search = GridSearchCV(
    PLSRegression(scale=False),  # The base model
    parameters,  # The parameters we will consider,
    cv=5,  # The number of splits for the cross-validation
    scoring="neg_mean_squared_error",  # How we score how well the model is performing
    refit=True,  # Refit using the best-found parameters on the whole training set.
)
# Run the cross-validation
grid_search.fit(X_train, Y_train)
# Get the best number of components:
best_components = grid_search.best_params_["n_components"]
# Get the best-performing model:
best_model = grid_search.best_estimator_
# Get the results per parameter considered:
mean_score = grid_search.cv_results_["mean_test_score"]  # The mean score
error_score = grid_search.cv_results_[
    "std_test_score"
]  # The uncertainty in the score
# These two can be plotted as a function of the number of parameters considered;
# this can help us see the best parameters while considering the error.
```

2. **Recalculate RMSECV using the optimized model, for instance, by using [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) from scikit-learn:**
```python
from sklearn.model_selection import cross_val_score

cvscore = cross_val_score(
    model,  # Select the model we are going to score
    X_train,  # Give the X-training set
    Y_train,  # Give the y-training set
    scoring="neg_mean_squared_error",  # select scoring method
    cv=5,  # Number of splits to make
)
cvscore = np.sqrt(-cvscore)  # Account for the negative sign.
rmsecv = cvscore.mean()
rmsecv_std = np.std(cvscore)
print(f"\nRMSECV: {rmsecv} ± {rmsecv_std}")
```

**Note:** We use a *negative* mean squared error for the grid search and for calculating RMSECV. This is because the methods `cross_val_score` and `GridSearchCV` are often used in connection with optimization where we want to *maximize* something. If we *maximize the negative* of the mean squared error, we can *minimize the error*.

In [None]:
# Your code here

#### Your answer to question 5.1(d): What is the optimal number of components?
*Double click here*

### 5.1(e)

**Task: Show the results for the training data and the testing data graphically by plotting the predicted vs. the observed values for all 5 chemical components for the optimized model. Include RMSEC, R² for the training set, RMSEP and R² for the test set as labels in your figure (calculate these for each component separately). Further, report the metrics you calculated in a table.**

**Hint:** The R² can be calculated using:
```python
from sklearn.metrics import r2_score

component_number = 1  # To select a column (one component) from Y:
r_squared_train = r2_score(Y_train[:,component_number], y_predicted_train[:,component_number])
r_squared_test = r2_score(Y_test[:,component_number], y_predicted_test[:,component_number])
```

In [None]:
# Your code here

#### Your answer to question 5.1(e): Report the metrics you found:
*Double click here*

| Component                        | RMSEC | RMSEP | R² (train) | R² (test) |
|----------------------------------|-------|-------|------------|-----------|
| methylene chloride (component 1) |       |       |            |           |
| 2-butanol (component 2)          |       |       |            |           |
| methanol (component 3)           |       |       |            |           |
| dichloropropane (component 4)    |       |       |            |           |
| acetone (component 5)            |       |       |            |           |

## Exercise 5.2 Use of cross-validation when we have few samples

It is not always feasible to do the split into training and test sets when we have few samples. Another option then is to use something called **Leave-one-out cross-validation** (LOOCV). LOOCV involves training the model on all but one data point and using the remaining point for testing, repeating this process for each data point. We will use that method in this exercise

We will use the data of [Forbes](https://doi.org/10.1017/S0080456800032075) who investigated the
relationship between the boiling point of water and the atmospheric pressure, and collected data in the Alps and Scotland. Forbes' goal was to estimate altitudes from the boiling point alone.

### 5.2(a)

**Task: Load the data from Forbes (data file [forbes.csv](forbes.csv)), plot it, and create a linear regression model
that predicts the atmospheric pressure from the temperature. Report the R² and [root mean
squared error (RMSE)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.root_mean_squared_error.html) for your model.**

In [None]:
# Your code here

#### Your answer to question 5.2(a): What value did you get for R² and the RMSE?
*Double click here*

### 5.2(b)

**Task: Estimate the error you can expect to make if you use your model for predicting the pressure.
Do this by LOOCV and calculate the root mean squared error of cross-validation (RMSECV)**

**Note:** LOOCV is a special case of **training** and **testing**, and you can find a short description of it
in [appendix A](#A.-Leave-one-out-cross-validation) with example code for running LOOCV. The code example for LOOCV is concise, so make sure you understand what goes on here (that is, what LOOCV is doing). If you are working with someone, try explaining testing/training and how LOOCV works to them.

In [None]:
# Your code here

#### Your answer to question 5.2(b): What value did you get for RMSECV?
*Double click here*

## Exercise 5.3 Partial Least Squares and interpretation of scores and loadings

The file [elements.csv](elements.csv) contains information about the elements of the periodic table. This dataset includes various physical and chemical properties, allowing us to explore the underlying relationships between these elements. The columns in the file are as follows:


| **Column**                      | **Description**                                         | **Unit** |
|:--------------------------------|:--------------------------------------------------------|:---------|
| name                            | The name of the element                                 |          |
| symbol                          | The symbol for the element (e.g. H, He, etc.)           |          |
| atomic_radius                   | Atomic radius                                           | Å        |
| atomic_weight                   | Atomic weight                                           | u        |
| covalent_radius                 | Covalent radius                                         | pm       |
| density                         | Density at 295 K                                        | g/cm³    |
| dipole_polarizability           | Dipole polarizability                                   | bohr³    |
| electrons                       | The number of electrons in the element                  |          |
| mass_number                     | Mass number of the most abundant isotope                |          |
| neutrons                        | The number of neutrons in the element                   |          |
| protons                         | The number of protons in the element                    |          |
| zeff                            | Effective nuclear charge                                |          |
| vdw_radius                      | Van der Waals radius                                    | pm       |
| first_ionization                | First ionization energy                                 | eV       |
| electronegativity allred-rochow | Allred and Rochow’s scale of electronegativity          | e²/pm²   |
| electronegativity gordy         | Gordy’s scale of electronegativity                      | e/pm     | 
| atomic_radius_wikipedia         | Atomic radius from [Wikipedia](https://en.wikipedia.org/wiki/Atomic_radii_of_the_elements_(data_page)) | pm |

### 5.3(a)

**Task: Create a PLS regression model for predicting `first_ionization`, `density`, `protons`, and `atomic_radius` from the other variables.**

**Notes:**:

1. Remove the non-numeric columns like 'name' and 'symbol' before creating the PLS model.

2. Use two components for the PLS regression model. Do not do a split into a training and test set in this exercise (this is not so crucial here since we will focus on interpreting scores and loadings).

3. Consider if you should normalize (scale) the data using a [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) before performing the PLS.

In [None]:
# Your code here

### 5.3(b)

**Task: Inspect the scores and rotations for X by creating 2D scatter plots. Are there any trends/groupings or outliers in the scores? What of the original X-variables can, if groups/trends are present, be used to interpret these trends?**

**Hints:**

1. Assuming that `pls` is the fitted object containing the PLS model, and `X` is our raw data, we can get the scores and rotations by:
```python
x_scores = pls.transform(X)
x_rotations = pls.x_rotations_
```

2. For the scatter plots, you have two options. To plot the scores and rotations for the two PLS components in two different plots, or in the same plot (as a biplot). The biplot can sometimes help interpretation, but can be crowded if there are many samples and features.


**Notes:**

1. We use the rotations here instead of the loadings (or weights). This is because the X-rotations apply directly to the X data to create the scores. Thus, the rotations tell use more directly the relation between the original variables and the calculated scores.

In [None]:
# Your code here

#### Your answer to question 5.3(b): Are there any trends/groupings or outliers in the scores and what variables can be used to interpret them?

*Double click here*

### 5.3(c)

**Task: Inspect the scores and rotations for Y by creating 2D scatter plots. Are there any trends/groupings or outliers in the scores? What of the original Y-variables can, if groups/trends are present, be used to interpret these trends?**

**Hints:**

1. Assuming that `pls` is the fitted object containing the PLS model, and `X` and `Y` are our raw data, we can get the scores and rotations by:
```python
x_scores, y_scores = pls.transform(X, y=Y)
y_rotations = pls.y_rotations_
```

2. For the scatter plots, you have two options. To plot the scores and rotations for the two PLS components in two different plots, or in the same plot (as a biplot). The biplot can sometimes help interpretation, but can be crowded if there are many samples and features.


**Notes:**

1. We use the rotations here instead of the loadings (or weights). This is because the Y-rotations apply directly to the Y data to create the scores. Thus, the rotations tell use more directly the relation between the original variables and the calculated scores.

In [None]:
# Your code here

#### Your answer to question 5.3(c): Are there any trends/groupings or outliers in the scores and what variables can be used to interpret them?

*Double click here*

### 5.3(d)

**Task: Inspect the X-rotations and Y-loadings together. Which of the X-variables could be important for predicting the different Y-variables (answer this by exploring the correlations between the X-variables and the Y-variables).**

**Hints:** 

1. Assuming that `pls` is the fitted object containing the PLS model, we can get the scores for Y by:
```python
y_loadings = pls.y_loadings_
```

**Notes:**

1. We use the X-rotations ($R$) and Y-loadings ($Q$) because they help us understand the relationships between the X-variables and the Y-variables in the PLS model. The PLS model $Y = XB = XRQ^T$ shows that the regression coefficients ($B$) can be decomposed into the product of $R$ and $Q^T$. Therefore, plotting $R$ and $Q$ together helps us investigate correlations captured by the PLS model.

In [None]:
# Your code here

#### Your answer to question 5.3(d): What X-variables seem important for predicting Y and what variables are correlated?
*Double click here*

# Appendix

## A. Leave-one-out cross-validation

In Leave-one-out cross-validation (LOOCV), we first pick one sample,
measurement number $j$, and we fit the model using the $n-1$ other points
(all points except $j$). After the fitting, we check how well the model can predict
measurement $j$ by calculating the difference between the
measured ($y_j$) and predicted ($\tilde{y}_j$) value. This difference, $r_j = y_{j} - \tilde{y}_j$, is
called the predicted residual, and it tells us the error we just made.

There is nothing special about picking point $j$, and we can try all possibilities
of leaving one point out, fitting the model using the remaining $n-1$
measurements, and predicting the value we left out.
After doing this for all possibilities, we have fitted the model
$n$ times and calculated $n$ predicted residuals. The mean squared error (obtained from the squared
residuals), $\mathrm{MSE}_{\mathrm{CV}}$, can then be used
to estimate the error in the model,

\begin{equation}
\mathrm{MSE}_{\mathrm{CV}} = \frac{1}{n} \sum_{i=1}^{n} r_i^2 =  \frac{1}{n} \sum_{i=1}^{n} (y_i - \tilde{y}_i)^2,
\end{equation}

where $y_i$ is the measured $y$ in experiment $i$, and $\tilde{y}_i$ is the
predicted $y$, using a model which was fitted using all points *except* $y_i$.

For a polynomial fitting, there is an alternative to refitting the model $n$ times. In fact,
we can show that for polynomial fitting, the mean squared error can
be obtained by,

\begin{equation}
\mathrm{MSE}_{\mathrm{CV}} = \frac{1}{n}\sum_{i=1}^{n} (y_i - \tilde{y}_i)^2 =
\frac{1}{n}\sum_{i=1}^{m} \left(\frac{y_i - \hat{y}_i}{1 - h_{ii}} \right)^2,
\end{equation}

where the $\hat{y}_i$'s are predicted values using the
model fitted with *all data points*,
and $h_{ii}$ is the $i$'th diagonal element of the
$\mathbf{H}$ matrix (the projection matrix,
see Eq.(4.49) on page 49 in our textbook),

\begin{equation}
\mathbf{H} =
\mathbf{X} 
\left( 
  \mathbf{X}^\mathrm{T} \mathbf{X}
\right)^{-1}
\mathbf{X}^\mathrm{T} = \mathbf{X} \mathbf{X}^+,
\end{equation}

Note the difference between $\hat{y}_i$ and $\tilde{y}_i$, and the
fact that we  do not have to do the
refitting(!) to obtain the $\mathrm{MSE}_{\mathrm{CV}}$.

When you calculate $\mathrm{MSE}_{\mathrm{CV}}$, use one of the two approaches above or both
if you want to see if they give the same answer.

In [None]:
# The examples below assume that the matrix X is called X_temp
# and that y is stored in the variable pressure.

# Example 1 of LOOCV:
from sklearn.linear_model import LinearRegression

# scikit-learn has a method to pick out samples for leave-one-out:
from sklearn.model_selection import LeaveOneOut


loo = LeaveOneOut()
error = []
# Split the X-data in X_temp into training and testing:
for train_index, test_index in loo.split(X_temp):
    # train_index = index of samples to use for training
    # test_index = index of samples to use for testing
    # Pick out samples (for training and testing):
    X_train, X_test = X_temp[train_index], X_temp[test_index]
    y_train, y_test = pressure[train_index], pressure[test_index]
    # Fit a new model with the training set:
    model = LinearRegression(fit_intercept=True).fit(X_train, y_train)
    # Predict y for the test set:
    y_hat = model.predict(X_test)
    # Compare the predicted y values in the test set with the measured ones:
    error.append((y_test - y_hat) ** 2)
rmsecv_1 = np.sqrt(np.mean(error))
print(f"RMSECV = {rmsecv_1}")

In [None]:
# Example 2 of LOOCV:

# scikit-learn has a method for leave-one-out selection, and a method for
# cross-validation. And these two can be combined:
from sklearn.model_selection import LeaveOneOut, cross_val_score

# Create "empty" model for fitting:
model = LinearRegression(fit_intercept=True)
# Run cross-validation, where we select testing and training with LeaveOneOut:
scores = cross_val_score(
    model, X_temp, pressure, scoring="neg_mean_squared_error", cv=LeaveOneOut()
)
rmsecv_2 = np.sqrt(np.mean(-scores))
print(f"RMSECV = {rmsecv_2}")

In [None]:
# Example 3 of LOOCV:

# We calculate the H matrix and use that:
# OBS! First, a detail that is easy to miss; The X used for H includes the column of ones!
X_matrix = np.column_stack((np.ones_like(temperature), temperature))
H = X_matrix @ np.linalg.pinv(X_matrix)
hii = np.diagonal(H)
residuals_loo = (pressure - pressure_hat) / (1 - hii)
rmsecv_3 = np.sqrt(np.mean(residuals_loo**2))
print(f"RMSECV = {rmsecv_3}")

## B. Partial answers to some of the exercises

### B.1 Results for 5.1(e)

![Exercise 5.1(e)](5.1.e.png)

|    | Component          |    RMSEC |    RMSEP |   R² (train) |   R² (test) |
|---:|:-------------------|---------:|---------:|-------------:|------------:|
|  0 | methylene chloride | 0.774702 | 0.913428 |     0.996022 |    0.994106 |
|  1 | 2-butanol          | 0.56474  | 0.666846 |     0.997418 |    0.997723 |
|  2 | methanol           | 0.686723 | 0.871563 |     0.996762 |    0.994905 |
|  3 | dichloropropane    | 0.829081 | 0.954528 |     0.99573  |    0.992305 |
|  4 | acetone            | 0.416455 | 0.541473 |     0.998559 |    0.99849  |

### B.2 Results for 5.3(b)

![Exercise 5.3(b)](5.3.b.png)

**Note:** The elements have been colored according to their period to help the interpretation. The text labels for `electrons`, `neutrons`, `atomic_weight`, and `mass_number` are overlapping, making it difficult to see the individual labels. This is because their rotations are close together.

### B.3 Results for 5.3(c)

![Exercise 5.3(c)](5.3.c.png)

**Note:** The elements have been colored according to their period to help the interpretation.

### B.4 Results for 5.3(d)

![Exercise 5.3(d)](5.3.d.png)

**Note:** In the plot, the text labels for `electrons`, `neutrons`, `atomic_weight`, and `mass_number` are overlapping, making it difficult to see the individual labels. This is because their rotations are close together.

## Your feedback for Exercise 5

1. **Time & Difficulty:**
* Length (1=too short, 5=too long): 1  2  3  4  5
* Difficulty (1=too easy, 5=too difficult): 1  2  3  4  5
* Most challenging part: ________________________

2. **Code Examples:**
* More or less example code?  More  Less  About Right
* Areas where more examples would be helpful: ________________________

3. **Errors/Inconsistencies:** Did you encounter any?  Yes  No  If yes, please describe: ________________________
    
4. **Suggestions:** How could this exercise be improved? ________________________