# Exercise set 9

> *  The goal of the first part of the exercise is to get familiarity with partial least
> squares regression. For this, we will make
> a model that can predict the concentrations in a mixture from near-infrared spectra.
> 
> *  In the second part of the exercise, you will do a PCA analysis of gene expressions.
> The goal
> of this part is to get more experience with PCA, and in particular
> with interpreting results from PCA.

## Exercise 9.1

[Windig and Stephenson](https://doi.org/10.1021/ac00046a015) have measured near-infrared spectra
for 140 mixtures of the solvents methylene chloride, 2-butanol, methanol,
dichloropropane, and acetone. We will in this exercise see if we can
predict the compositions of the mixtures from the spectra.
Each of the $140$ spectra has been sampled at $700$ wavelengths
between $1100$ and $2500$ nm. The raw data containing the spectra
and the corresponding concentrations can be found in the file
[`Data/windig.csv`](Data/windig.csv).

The raw data for this exercise can
be loaded as shown below.
Each row in the $\mathbf{X}$ matrix
contains a spectrum (700 intensities -- each column is
a specific wavelength) and each row in the
$\mathbf{Y}$ matrix contains the measured concentrations
of the solvents in the order given above.

In [None]:
import pandas as pd

data = pd.read_csv("Data/windig.csv")
X = data.filter(like="data", axis=1).values  # NIR spectra
Y = data.filter(like="concentrations", axis=1).values  # Concentrations
print(f"No. of spectra: {X.shape[0]}")
print(f"No. of wavelengths: {X.shape[1]}")
print(f"No of concentration samples: {Y.shape[0]}")
print(f"No of species in each sample: {Y.shape[1]}")

**(a)**  Create a partial least squares regression (PLSR) model for predicting
the concentrations. Use 1 PLS component for your first model and
assess it using $R^2$, RMSEC, RMSECV and RMSEP. An example
of how this can be done are given below.

These values (RMSEC, RMSECV, and RMSEP) are all based on calculating the
root mean squared error (RMSE) given by,

\begin{equation}
\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^N (y_i - \hat{y}_i)^2},
\tag{1}\end{equation}

where $y_i$ are our measured $y$-values and $\hat{y}_i$ are the
values predicted by our model. The difference between RMSEC,
RMSEP, and RMSECV lies in the part of the data we use to
calculate them. This is based on first splitting the data into
a *training* and *test* set, and then
performing what we call
*cross-validation* using
the training set:


*  When we use the training set to create our model, we are doing
   a *calibration*. If we calculate RMSE based on using
   the training set, we refer to this as the RMSEC (root mean squared
   error of calibration).
   This number
   quantifies the error we get in connection with making (calibrating)
   the model.


*  When we use the test set to test our model, we are
   checking how well our model *predicts* "new" samples
   (that is, samples that were not used when making the model). If
   we calculate RMSE based on the training set, we refer to 
   this as the RMSEP (root mean squared error of prediction). This
   number quantifies the error we can expect to make when using
   our model for predicting samples that were not used when making
   the model.


*  Cross-validation is based on doing a further split
   of the training set. Typically, we split the training set into
   $k$ smaller subsamples and we repeat the fitting of the
   model $k$ times.
   For each time we repeat the fitting, we retain a single
   subsample for validation, and we fit the model using the
   $k-1$ other subsamples. For the subsample we retained
   for validation, we can calculate the RMSE value of how
   well this is predicted by our model. Since we repeat this $k$ times,
   we can make it so that each of the $k$ subsamples is
   used exactly once for validation. Finally, we can
   obtain the average RMSE of the $k$ fittings and we
   refer to this value as the RMSECV (root mean squared error
   of cross-validation). This number indicates how well
   our model predict samples that were obtained among
   the calibration cases. In addition to obtaining
   the average, we can also obtain a standard deviation.
   We can use this to see how sensitive the model parameters
   are to the training set.
   

Luckily, methods for splitting our data into training and test
sets, calculating RMSE, and doing cross-validation are already
available in sklearn. There is also
a nice illustration of cross-validation in the
[scikit-learn documentation](https://scikit-learn.org/stable/modules/cross_validation.html):

![cross](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

In [None]:
"""Example for RMSEC, RMSEP and RMSECV"""
import numpy as np
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
)
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.cross_decomposition import PLSRegression
from sklearn.datasets import load_diabetes

X, y = load_diabetes(return_X_y=True)  # Load example data

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,  # Use 20 % of the data for the test set.
    shuffle=True,  # Randomly shuffle the data
)

# Make a model using the training set:
# PLS regression with one PLS component/latent variable:
model = PLSRegression(n_components=1)
model.fit(X_train, y_train)
# Predict using the training set:
y_hat_train = model.predict(X_train)
# Predict using the test set:
y_hat_test = model.predict(X_test)

# For training set: calculate R^2 and RMSEC:
r2_train = r2_score(y_train, y_hat_train)
rmsec = np.sqrt(mean_squared_error(y_train, y_hat_train))
print(f"R² (training): {r2_train}")
print(f"RMSEC: {rmsec}")

# For test set: calculate R^2 and RMSEP:
r2_test = r2_score(y_test, y_hat_test)
rmsep = np.sqrt(mean_squared_error(y_test, y_hat_test))
print(f"\nR² (test): {r2_test}")
print(f"RMSEP: {rmsep}")

# Run cross-validation:
cvscore = cross_val_score(
    model,  # Select the model we are going to score
    X_train,  # Give the X-training set
    y_train,  # Give the y-training set
    scoring="neg_mean_squared_error",  # select scoring method
    cv=5,  # Number of splits to make
)

# Note: the scoring is here "neg_mean_squared_error".
# This is the negative of the MSE!
# The cross_val_score method is often used in
# connection with optimization where we would like to
# maximize something, and the score can be used to pick
# the best value. Since we usually do not want to
# maximize the error, this method is made so that it
# calculates the negative of the error.

cvscore = np.sqrt(-cvscore)  # Account for the negative sign.
rmsecv = cvscore.mean()
rmsecv_std = np.std(cvscore)
print(f"\nRMSECV: {rmsecv} ± {rmsecv_std}")

In [None]:
# Your code here

**Your answer to question 9.1(a):** *Double click here*

**(b)** Improve your PLSR model by including more
PLS components. Try components in the
range from 2 up to 15 and compare the different models. How many
PLS components are you satisfied with? In the following, we will refer
to the model you are most satisfied with as "model A".

In [None]:
# Your code here

**Your answer to question 9.1(b):** *Double click here*

**(c)**  Plot the regression coefficients for model A (see point **(b)**).



In [None]:
# Your code here

**Your answer to question 9.1(c):** *Double click here*

**(d)**  If you are given a new spectrum of a mixture of methylene chloride,
2-butanol, methanol, dichloropropane, and acetone, how well would
your model A predict the concentrations of the different solvents
in the mixture?

In [None]:
# Your code here

**Your answer to question 9.1(d):** *Double click here*

**(e)**  Create a least squares model for predicting the concentrations.
Assess it using $R^2$, RMSEC, RMSECV and RMSEP. Does this model
perform as you expect?

In [None]:
# Your code here

**Your answer to question 9.1(e):** *Double click here*

## Exercise 9.2

[Schummer *et al.*](https://doi.org/10.1016/S0378-1119(99)00342-X) studied ovarian cancer by measuring gene expression
values for $1536$ genes in both normal and tumor tissues. One of their goals was
to find genes that were overexpressed in tumor samples compared with normal samples.
This knowledge may be used for tumor diagnosis and we will here see if we
can find such genes by performing a PCA. The raw data can be
found in the file [`Data/ovo.csv`](Data/ovo.csv)  and can be loaded as
follows:

In [None]:
"""Load the data set."""
import pandas as pd

data = pd.read_csv("Data/ovo.csv")
classes = data["objlabels"]  # Classification of samples.
X = data.filter(like="X.", axis=1)  # Gene expressions for samples.

Each row in the matrix
$\mathbf{X}$ contains the gene expression (for 1536 genes) for a
tissue sample. Each column corresponds to a specific gene.
The classification of tissue as normal ("N") or cancer ("C") can
be found in the variable "classes" defined above.

**(a)**   Perform a principal component analysis (PCA) on the gene expression data,
and obtain the explained variance when using 1, 2, 5, and 10
components.

Center the data before performing the PCA. This can be
done as follows with the `scale` method
from `sklearn.preprocessing`: `X = scale(X, with_std=False)`.
Here, all the variables are in the same units, so we do not need
to scale the variance (we set `with_std=False`).

In [None]:
# Your code here

**Your answer to question 9.2(a):** *Double click here*

**(b)** Inspect the data by plotting the scores and loadings for
principal component
number 1 and principal component number 2:


* (i)  Can you observe any clustering
  of the samples? Here, it is helpful to color the samples
  according to their classification as normal or cancer.


* (ii) Are there any outliers among the samples?


* (iii) Can you identify some
  genes which are overexpressed in tumors? 


* (iv) Can you identify some
  genes which are underexpressed in tumors? 

In [None]:
# Your code here

**Your answer to question 9.2(b):** *Double click here*

**(c)**  Based on your answer in **(b)**, can
you identify some pairs of genes that seem to distinguish between
normal and tumor tissues? Support your findings by plotting the raw data.

In [None]:
# Your code here

**Your answer to question 9.2(c):** *Double click here*