**Exercise set 7**
==============

> The goal of this exercise is to run PCA and PLSR on a real
data set in order to show how these methods can be used in practice.
We are considering data that are given by
[Platikanov et al.](https://doi.org/10.1016/j.watres.2012.10.040) and we are aiming to
reproduce some of the results found in their work. You may find it useful to
inspect this article when interpreting your results.

**Exercise 7.1**

In this exercise, we will consider two data sets: one set contains data for bottled mineral water
(we will call this "data set 1"),
and the other set contains data for tap water and bottled water (we will call this "data set 2").

The different water samples have been blind-tasted
by trained panelists, who have rated the different water samples
according to their liking. Also, the chemical compositions of the
samples have been determined. All measured quantities are given
in table 1.

Our end goal is to predict the ratings (the "mean liking" in
table 1) of the water samples, given
measurements of the chemical composition. We will define this rating as the
$y$-variable we are going to predict.

|Variable| Unit|
|:-------|----:|
 |Conductivity                       | ($\mu$S/cm)  |
 |TDS (total dissolved solids)       | (mg/L)       |
 |Cl$^{-}$                            | (mg/L)      |
 |SO$_4^{2-}$                         | (mg/L)      |
 |NO$_3^-$                            | (mg/L)      |
 |HCO$_3^-$                           | (mg/L)      |
 |Ca$^{2+}$                           | (mg/L)      |
 |Mg$^{2+}$                           | (mg/L)      |
 |Na$^{+}$                            | (mg/L)      |
 |K$^{+}$                             | (mg/L)      |
 |pH                                  |             |
 |Si                                  | (mg/L)      |
 |Cl$_2$                              | (mg/L)      |
 |Mean liking                         |             |
 | **Table 1:** *Data columns present in the data sets: [Data/table1.csv](Data/table1.csv) and  [Data/table2.csv](Data/table2.csv)* |

**(a)** Begin by exploring the raw data. In the [original article](https://doi.org/10.1016/j.watres.2012.10.040),
      you can find correlation heat maps (see Fig. $1$ in the aforementioned article). 
      Create such correlation maps yourself (for data set 1 & 2)
      and compare them with Fig. $1$ in the article. Does the mean liking
      seem to be correlated with some of the variables?

In [None]:
# Your code here

**Your answer to question 7.1(a):** *Double click here*

**(b)** We will further explore the raw data with PCA. Perform a principal
      component analysis for data set 1 and for data set 2. How much of the variance is
      explained by the first two principal components? Also, plot the scores
      and the loadings for principal component 1 and 2, and for principal component 1 and 3.
      Does any of the variables seem to be correlated?
      Which variables seem to influence the mean liking most
      for the two data sets?

In [None]:
# Your code here

**Your answer to question 7.1(b):** *Double click here*

**(c)** Let us start the modeling by creating a linear least-squares model:

* (i) Create linear least-squares models for the two data sets in which you use all available data in the fitting. Calculate $R^2$ and the root mean squared error (RMSE) of your models. Comment on the values you have obtained.
          

* (ii)  Evaluate the root mean squared error of cross-validation (RMSECV) for your two models. Compare the RMSECV values with the previously obtained RMSE values.
          
  
* (iii) Using the two models you have created, what variables seem to be most important for predicting a high mean liking? Here, you can inspect the regression coefficients (assuming that you have scaled the variables). If you were to create a new brand of bottled water, what chemical components would you focus on to maximize the mean liking?

In [None]:
# Your code here

**Your answer to question 7.1(c):** *Double click here*

**(d)** We will now consider partial least-squares regression (PLSR) models.
Before we do the actual modeling, let us repeat the fundamental
equations for PLSR. The model itself is based on the following two
equations:
\begin{equation}
         \begin{split}
         \mathbf{X} &= \mathbf{T} \mathbf{P}^\top, \\
         \mathbf{Y} &= \mathbf{U} \mathbf{Q}^\top, \\
         \end{split}
         \tag{1}
      \end{equation}
      where $\mathbf{T}$ is the $x$-scores, $\mathbf{P}$ is the $x$-loadings,
      $\mathbf{U}$ is the $y$-scores, and $\mathbf{Q}$ is the $y$-loadings.
      The linear relation between $\mathbf{X}$ and $\mathbf{Y}$ is in this
      case given by,
      \begin{equation}
          \mathbf{Y} = \mathbf{T} \mathbf{Q}^\top.
          \label{eq:plsrreg} \tag{2}
      \end{equation}

When we wish to *predict* new $\mathbf{Y}$-values from new
      $\mathbf{X}$-values, we need to calculate new $x$-scores. This is
      done by introducing an additional matrix, $\mathbf{R}$, so that
      the following is satisfied:
      \begin{equation}
            \mathbf{T} = \mathbf{X} \mathbf{R}. 
          \label{eq:plsrscores} \tag{3}
      \end{equation}
      Here, we can think of the $\mathbf{R}$ as a matrix we can use to invert
      the relation $\mathbf{X} = \mathbf{T} \mathbf{P}^\top$. We can then predict
      new $\mathbf{Y}$-values, by combining Eq. \eqref{eq:plsrreg}
      and Eq. \eqref{eq:plsrscores}:
      \begin{equation}
        \mathbf{Y} = \mathbf{T} \mathbf{Q}^\top =
                     \mathbf{X} \mathbf{R} \mathbf{Q}^\top =
                     \mathbf{X} \mathbf{B}_\text{PLS},
                     \tag{4}
      \end{equation}
      where the regression coefficients $\mathbf{B}_\text{PLS}$ are given by:
      \begin{equation}
        \mathbf{B}_\text{PLS}  = \mathbf{R} \mathbf{Q}^\top. \tag{5}
      \end{equation}

In `sklearn` we can run PLSR by using `PLSRegression`
which is found in the module `sklearn.cross_decomposition`.
After running the regression, we can access the matrices given above with the python code below.

In [None]:
from sklearn.preprocessing import scale
from sklearn.cross_decomposition import PLSRegression
X = scale(X)
y = scale(y)
plsr = PLSRegression(n_components=2)
plsr.fit(X, y)
R = plsr.x_rotations_
B = plsr.coef_
Q = plsr.y_loadings_

* (i) Create PLSR models for the two data sets in which you use all available data in the fitting. Use 
  only two components when you create the model, that is, set `PLSRegression(n_components=2)` when
  you set up the models. Calculate $R^2$ and the root mean squared error (RMSE) of your models.
  Comment on the values you have obtained, and compare them with the corresponding values from your
  linear least-squares models.


* (ii) Calculate RMSECV for your two PLSR models. Compare the RMSECV values with the previously obtained
  RMSE values. Would you say that your PLSR models perform better or worse than the least-squares 
  models?


* (iii) Plot the $x$- and $y$-loadings for the two components. These loadings are available as 
  `Q = plsr.y_loadings_`. What variables seem to be most important for predicting the mean liking?


* (iv) Inspect the weights (the $\mathbf{R}$ matrix) for PLS component 1 and PLS component 2.
  Compare your results to the results given by 
  [Platikanov et al.](https://doi.org/10.1016/j.watres.2012.10.040) in Fig. $6$.


* (v) Based on the PLSR results: If you were to create
  a new brand of bottled water, what chemical components
  would you focus on to maximize the mean liking?


In [None]:
# Your code here

**Your answer to question 7.1(d):** *Double click here*