# Exercise set 5


>The first goal of this exercise is to perform
>**least squares regression** for a case where we have
>**correlations** between variables. We will also see how
>we can use a **training set** and a **test set** to
>assess the quality of a model.
>
>The second goal of this exercise is to analyze the results from a experimental
>design. After this exercise, you should have gained familiarity
>with a full fractional factorial design and know how to calculate
>effects.

## Exercise 5.1 

In this exercise, we will make a least squares model for a case where we have several
variables which may be correlated.
The file [bloodpress.txt](Data/bloodpress.txt) ('Data/bloodpress.txt') contains data about 20 individuals with high blood pressure.
The data columns present in the file are given in Table 1.


| Column | Description              |             Unit |
|:-------|:-------------------------|-----------------:|
| PT     | Label for person         |              --- |
| BP     | Blood pressure           |             mmHg |
| Age    | Age                      |            years |
| Weight | Weight                   |               kg |
| BSA    | Body surface area        |            m$^2$ |
| DUR    | Duration of hypertension |            years |
| Pulse  | Basal heart rate         | beats per minute |
| Stress | Stress index             |              --- |
||**Table 1:** *Data columns present in the file [bloodpress.txt](Data/bloodpress.txt)*|


**(a)** Begin by exploring the data. Plot pairs of variables against each other
to visually inspect which variables might be correlated. Based on your plots, which pairs
of variables seem to be correlated? Hint: The [scatter plot matrix](https://seaborn.pydata.org/examples/scatterplot_matrix.html) from [seaborn](https://seaborn.pydata.org/index.html) might help with these plots.

In [None]:
# Your code here

**Your answer to 5.1(a):** (double click here)

**(b)** To quantitatively investigate possible correlations, we can calculate
the [Pearson correlation
coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) ($\rho_{X,Y}$),

\begin{equation*}
\rho_{X,Y} = \frac{\mathrm{cov}(X,Y)}{S_X S_Y},
\end{equation*}

where $\mathrm{cov}(X,Y)$ is the covariance between the variables $X$ and $Y$, and
$S_i$ is the estimated standard deviation for variable $i$.
In `scipy`, this coefficient is a part of the [stats package](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html) 
and can be imported by:
```Python
from scipy.stats import pearsonr
```
This function returns two values:
 * The correlation coefficient itself. It is close to $1$ for positively correlated distributions
   and close to $-1$ for negatively correlated distributions.
 * A "p-value" which (to quote the documentation of `scipy`):

> roughly indicates the probability of an uncorrelated system
> producing datasets that have a Pearson correlation at least as extreme
> as the one computed from these datasets.


Calculate this coefficient for each of the pairs of variables and
use it to judge if the variables
are correlated. Note: You can also create a heat map of correlations to visualize them.
This can be done with the [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html)
method from [seaborn](https://seaborn.pydata.org/index.html) in combination
with the [corr](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)
method from [pandas](https://pandas.pydata.org/docs/index.html).

In [None]:
# Your code here

**Your answer to 5.1(b):** (double click here)

**(c)** Create a linear model in which you predict the blood pressure ($y$) from
all 6 available variables (Age, Weight, BSA, DUR, Pulse, and Stress).
For this, make use of the `statsmodels` package and the
ordinary least squares (OLS) estimation. (For an example of the usage, see: [here](https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html))
It might be beneficial to scale your variables so that they have a
mean of $0$ and a variance of $1$. This can, for instance, be done using the
[scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html)
method from [sklearn](https://scikit-learn.org/stable/).

**Note:** One motivation for using the `statsmodels` package is that it will do
some additional statistics in connection with the fitting. To print out
the results of a fitting, we can do the following:

```Python
import statsmodels.api as sm
# load X data...
# load y data...
# Do fitting:
model = sm.OLS(y, X).fit()
# Print out statistics:
print(model.summary())
```

Some information on the interpretation of this summary can be found [here](https://blog.datarobot.com/ordinary-least-squares-in-python).

In [None]:
# Your code here

**Your answer to 5.1(c):** (double click here)

**(d)** Your results so far should indicate that some of the variables
are correlated. Create a new model where you remove variables
that are correlated. That is, if a variable, say $u$, is correlated with
another variable, say $v$, pick either $u$ or $v$ for your new model.

1. Fit your new model to the blood pressure data
   and compare it with the model where you included all variables.
2. How many variables can you remove and still get a good model?
3. Which variables seem
   to be more important for predicting if a person will have high blood pressure?

In [None]:
# Your code here

**Your answer to 5.1(d):** (double click here)

**(e)** One strategy to assess the quality of a model is to split the raw data
into a **training set** and a **test set**. Then the **training set**
is used to fit the data and the **test set** is used to test the quality of
the model.

Consider a data set containing $n$ samples. If the data set is
relatively small, we can use the "Leave-one-out cross-validation (LOOCV)"
approach. Here, we pick one sample, say measurement number $j$, which we
keep out of the fitting and we fit the model using the $n-1$ other points.
Now, we can *predict* the value of measurement $j$ using the fitted model,
let us call the predicted value $\tilde{y}_j$. The difference between the predicted
value from the model and the measured value $y_j$ can then be taken as a measure
of the error,

\begin{equation*}
r_j = y_{j} - \tilde{y}_j,
\end{equation*}

where $r_j$ is the so-called predicted residual.
There is nothing special about picking point $j$, and we can try all possibilities
of leaving one point out, and refitting the model using the remaining $n-1$
measurements. After doing this for all possibilities, we have fitted the model
$n$ times and the mean squared error (obtained from the squared
residuals), $\mathrm{MSE}_{\mathrm{CV}}$, can then be used
to estimate the error in the model,

\begin{equation*}
\mathrm{MSE}_{\mathrm{CV}} = \frac{1}{n} \sum_{i=1}^{n} r_i^2 =  \frac{1}{n} \sum_{i=1}^{n} (y_i - \tilde{y}_i)^2,
\end{equation*}

where $y_i$ is the measured $y$ in experiment $i$, and $\tilde{y}_i$ is the
predicted $y$, using a model which was fitted using all point *except* $y_i$.

1. Implement the LOOCV approach and calculate the mean squared error for a linear fit
   using all variables, and for a linear fit using just the weight and age as
   variables.

   **Hint:** The splitting into training and test sets can be done with the method
   [LeaveOneOut](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html)
   from [sklearn](https://scikit-learn.org/stable/):

   ```Python
   from sklearn.model_selection import LeaveOneOut
   # load X data...
   # load y data...
   loo = LeaveOneOut()
   for train_index, test_index in loo.split(X):
       X_train, X_test = X[train_index], X[test_index]
       y_train, y_test = y[train_index], y[test_index]
       # Do fitting using X_train and y_train.
       # Calculate error using y_test and predicted y from the model.
   ```

2. One can show that for polynomial fitting, the mean squared error can
   be obtained in a simpler way,

   \begin{equation*}
   \mathrm{MSE}_{\mathrm{CV}} = \frac{1}{n}\sum_{i=1}^{n} (y_i - \tilde{y}_i)^2 =
   \frac{1}{n}\sum_{i=1}^{m} \left(\frac{y_i - \hat{y}_i}{1 - h_{ii}} \right)^2,
   \end{equation*}

   where the $\hat{y}_i$'s are predicted values using the
   model fitted with *all data points*,
   and $h_{ii}$ is the $i$'th diagonal element of the
   $\mathbf{H}$ matrix (the projection matrix,
   see Eq. (2) in exercise 3),
   
   \begin{equation}
   \mathbf{H} =
   \mathbf{X} 
   \left( 
     \mathbf{X}^\mathrm{T} \mathbf{X}
   \right)^{-1}
   \mathbf{X}^\mathrm{T} = \mathbf{X} \mathbf{X}^+
   \end{equation}

   Using the equations given above, recalculate $\mathrm{MSE}_{\mathrm{CV}}$ and compare
   with your previous answers.

   Note the difference between $\hat{y}_i$ and $\tilde{y}_i$, and the
   fact that we actually don't have to do the
   refitting(!) in order to obtain the $\mathrm{MSE}_{\mathrm{CV}}$.

In [None]:
# Your code here

**Your answer to 5.1(e):** (double click here)

## Exercise 5.2

The growth rate of a certain bacterium specie depends
on the concentration of nutrients such as phosphate,
sucrose, and nitrate. We have conducted a set of
experiments where we have investigated how the
growth is influenced by the concentration of
phosphate ($P$), the concentration of sucrose ($S$) and
the concentration of nitrate ($N$). The design
matrix and the measured growth rate are given in Table 2.


|$P$  | $S$ | $N$ | **Growth rate**  |
|:---:|:---:|:---:|:---:|
|$+$ | $-$ | $-$ | $7$ |
|$-$ | $+$ | $-$ | $10$ |
|$+$ | $-$ | $+$ | $8$ |
|$-$ | $+$ | $+$ | $11$ |
|$-$ | $-$ | $-$ | $11$ |
|$+$ | $+$ | $+$ | $12$ |
|$+$ | $+$ | $-$ | $7$ |
|$-$ | $-$ | $+$ | $7$ |

| |
|:---|
|**Table 2:** *Experimental design matrix for the growth rate of the investigated bacteria. The factors are the concentration of phosphate ($P$), the concentration of sucrose ($S$), and the concentration of nitrate ($N$).*|

**(a)**  Compute all the main effects.

In [None]:
# Your code here

**Your answer to question 5.2(a):** *Double click here*

**(b)**  Extend the design matrix with the possible $2$-factor and $3$-factor
interaction effects. Compute these interaction effects.



In [None]:
# Your code here

**Your answer to question 5.2(b):** *Double click here*

**(c)** What factors and interactions seem
to increase the growth rate?



In [None]:
# Your code here

**Your answer to question 5.2(c):** *Double click here*

**(d)**  Make two least-squares models of the data
given in Table 2 (convert "$+$" to $1$ and
"$-$" to $-1$):

* (i)  Model 1, which only includes the main effects.

* (ii)  Model 2, which includes the main effects and
  the interactions.

In [None]:
# Your code here

**Your answer to question 5.2(d):** *Double click here*

**(e)** Compare the two models with the effects you have calculated, and the conclusions you
made in point (c). You can fill out this table for your comparison:


| **Factor**   | **Regression coefficient**   | **Effect**        |        **Effect/regression coefficient** |
|--------------|------------------------------|-------------------|------------------------------------------|
| Average      |   $a_0 = ...$                | Average y = ...   |  Average y / $a_0$                       |
| $P$          |   $a_1 = ...$                | Effect($P$) = ... |  Effect($P$) / $a_1$                     |
| $S$          |                              |                   |                                          |
| $N$          |                              |                   |                                          |
| $PS$         |                              |                   |                                          |
| $PN$         |                              |                   |                                          |
| $SN$         |                              |                   |                                          |
| $PSN$        |                              |                   |                                          |

| |
|:---|
|**Table 3:** *Table for regression coefficients, effects, and the ratio effect/(regression coefficient).*|


(There is a nicer table in the PDF-version of the exercise!)

In [None]:
# Your code here

**Your answer to question 5.2(e):** *Double click here*