**Exercise 4.1:** 

In this exercise, we will make a least squares model for a case where we have several
variables which may be correlated.
The file [bloodpress.txt](Data/bloodpress.txt) ('Data/bloodpress.txt') contains data about 20 individuals with high blood pressure.
The data columns present in the file are given in table 1.


| Label  | Description              |             Unit |
|:-------|:-------------------------|-----------------:|
| PT     | Label for person         |              --- |
| BP     | Blood pressure           |             mmHg |
| Age    | Age                      |            years |
| Weight | Weight                   |               kg |
| BSA    | Body surface area        |            m$^2$ |
| DUR    | Duration of hypertension |            years |
| Pulse  | Basal heart rate         | beats per minute |
| Stress | Stress index             |              --- |
||**Table 1:** *Data columns present in the file [bloodpress.txt](Data/bloodpress.txt)*|


**(a)** Begin by exploring the data. Plot pairs of variables against each other
to visually inspect which variables might be correlated. Based on your plots, which pairs
of variables seem to be correlated?

In [None]:
# Your code here

**Your answer to 4.1(a):** (double click here)

**(b)** To quantitatively investigate possible correlations, we can calculate
the [Pearson correlation
coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) ($\rho_{X,Y}$),
\begin{equation*}
\rho_{X,Y} = \frac{\mathrm{cov}(X,Y)}{S_X S_Y},
\end{equation*}
where $\mathrm{cov}(X,Y)$ is the covariance between the variables $X$ and $Y$, and
$S_i$ is the estimated standard deviation for variable $i$.
In `scipy`, this coefficient is a part of the [`stats` package](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html) 
and can be imported by:
```Python
from scipy.stats import pearsonr
```
This function returns two values:
 * The correlation coefficient itself. It is close to $1$ for positively correlated distributions
   and close to $-1$ for negatively correlated distributions.
 * A "p-value" which (to quote the documentation of `scipy`):

> roughly indicates the probability of an uncorrelated system
> producing datasets that have a Pearson correlation at least as extreme
> as the one computed from these datasets.



Calculate this coefficient for each of the pairs of variables and
use it to judge if the variables
are correlated. Does this seem reasonable?



In [None]:
# Your code here

**Your answer to 4.1(b):** (double click here)

**(c)** Create a linear model in which you predict the blood pressure ($y$) from
all 6 available variables (Age, Weight, BSA, DUR, Pulse, and Stress).
For this, make use of the `statsmodels` package and the
ordinary least squares (OLS) estimation. (For an example of the usage, see: [here](https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html))
Here, it might be beneficial to scale your variables so that they have a
mean of $0$ and a variance of $1$. This can, for instance, be done using the `sklearn` package.(Please see the [`scale` method](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html))
Compare the results of the fitting using both scaled and non-scaled variables.

**Note:** One motivation for using the `statsmodels` package is that it will do
some additional statistics in connection with the fitting. To print out
the results of a fitting, we can do the following:
```Python
import statsmodels.api as sm
# load X data...
# load y data...
# Do fitting:
model = sm.OLS(y, X).fit()
# Print out statistics:
print(model.summary())
```
Some information on the interpretation of this summary can be found [here](https://blog.datarobot.com/ordinary-least-squares-in-python)

In [None]:
# Your code here

**Your answer to 4.1(c):** (double click here)

**(d)** Your results so far should indicate that some of the variables
are correlated. Create a new model where you remove variables
that are correlated. That is, if a variable, say $u$, is correlated with
another variable, say $v$, pick either $u$ or $v$ for your new model.

Fit your new model to the blood pressure data
and compare it with the model where you included all variables.
How many variables can you remove and still get a good model?
Which variables seem
to be more important for predicting if a person will have high blood pressure?

In [None]:
# Your code here

**Your answer to 4.1(d):** (double click here)

**(e) (Optional exercise)** One strategy to assess the quality of a model is to split the raw data
into a **training set** and a **test set**. Then the **training set**
is used to fit the data and the **test set** is used to test the quality of
the model.

Consider a data set containing $n$ samples. If the data set is
relatively small, we can use the "Leave-one-out cross-validation (LOOCV)"
approach. Here, we pick one sample, say measurement number $j$, which we
keep out of the fitting and we fit the model using the $n-1$ other points.
Now, we can *predict* the value of measurement $j$ using the fitted model,
let us call the predicted value $\tilde{y}_j$. The difference between the predicted
value from the model and the measured value $y_j$ can then be taken as a measure
of the error,
\begin{equation*}
r_j = y_{j} - \tilde{y}_j,
\end{equation*}
where $r_j$ is the so-called predicted residual.
There is nothing special about picking point $j$, and we can try all possibilities
of leaving one point out, and refitting the model using the remaining $n-1$
measurements. After doing this for all possibilities, we have fitted the model
$n$ times and the mean squared error (obtained from the squared
residuals), $\mathrm{MSE}_{\mathrm{CV}}$, can then be used
to estimate the error in the model,
\begin{equation*}
\mathrm{MSE}_{\mathrm{CV}} = \frac{1}{n} \sum_{i=1}^{n} r_i^2 =  \frac{1}{n} \sum_{i=1}^{n} (y_i - \tilde{y}_i)^2,
\end{equation*}
where $y_i$ is the measured $y$ in experiment $i$, and $\tilde{y}_i$ is the
predicted $y$, using a model which was fitted using all point \emph{except} $y_i$.

*(i)* Implement the LOOCV approach and calculate the mean squared error for a linear fit
using all variables, and for a linear fit using just the weight and age as
variables.

**Hint:** The splitting into training and test sets can be done with the method \textsf{LeaveOneOut}
from `sklearn`:
```Python
from sklearn.model_selection import LeaveOneOut
# load X data...
# load y data...
loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Do fitting using X_train and y_train.
    # Calculate error using y_test and predicted y from the model.
```

*(ii)* One can show that for polynomial fitting, the mean squared error can
be obtained in a simpler way,
\begin{equation*}
\mathrm{MSE}_{\mathrm{CV}} = \frac{1}{n}\sum_{i=1}^{n} (y_i - \tilde{y}_i)^2 =
\frac{1}{n}\sum_{i=1}^{m} \left(\frac{y_i - \hat{y}_i}{1 - h_{ii}} \right)^2,
\end{equation*}
where the $\hat{y}_i$'s are predicted values using the
model fitted with \emph{all data points},
and $h_{ii}$ is the $i$'th diagonal element of the
$\mathbf{H}$ matrix (the projection matrix,
see Eq. (2) in exercise 2),
\begin{equation}
\mathbf{H} =
\mathbf{X} 
\left( 
  \mathbf{X}^\mathrm{T} \mathbf{X}
\right)^{-1}
\mathbf{X}^\mathrm{T}
\end{equation}
Using the equations given above, recalculate $\mathrm{MSE}_{\mathrm{CV}}$ and compare
with your previous answers.
Note the difference between $\hat{y}_i$ and $\tilde{y}_i$, and the
fact that we actually don't have to do the
refitting(!) in order to obtain the $\mathrm{MSE}_{\mathrm{CV}}$.

In [None]:
# (Optional) Your code here

**(Optional) Your answer to 4.1(e):** (double click here)

**Exercise 4.2**

In this exercise, we will investigate a data set using PCA and
our aim is to classify objects.
The data set is contained in the file ["Data/data_exercise4.txt"](Data/data_exercise4.txt)
where you will find columns as described in table 2.

| Label | Description                       |    Unit |
|:------|:----------------------------------|--------:|
| $x$   | Values of a measured quantity $x$ | Unknown |
| $y$   | Values of a measured quantity $y$ | Unknown |
| class | Classification of objects         |     --- |
||**Table 2:** *Data columns present in the file [data_exercise4.txt](Data/data_exercise4.txt)

Here, we have done a series of measurements of the variables
$x$ and $y$ for some objects, and we have labeled these objects as
belonging to one of two possible classes: "foo" or "bar".
Your task is now to investigate if you can separate
data points into these two classes, by performing a principal component analysis.

**(a)** As stated above, we are attempting to separate the data into two
classes. Should you include the labels (i.e. the "class" column)
in the data matrix you will run PCA on?

**Your answer to 4.2(a):** (double click here)

**(b)** Begin by plotting the raw data.
Below you will find some Python
code to get you started:

In [None]:
from matplotlib import pyplot as plt # Plotting
import numpy as np # Matrix operations
import pandas as pd # Reading the data set
from sklearn.preprocessing import StandardScaler # Scaling (if needed )
from sklearn.decomposition import PCA # Performing PCA
# Load data:
data = pd.read_csv('Data/data_exercise4.txt', delim_whitespace =True)
# Extract the two classes for plotting :
class1 = data[data['class'] == 'foo']
class2 = data[data['class'] == 'bar']
# We can plot the raw data as follows :
# Note: Here you could style the plot to your liking .
# Adding x- labels and y- labels is , for instance , a good idea.
plt.scatter(class1['x'], class1['y'], label='Class foo')
plt.scatter(class2['x'], class2['y'], label='Class bar')
plt.legend()
plt.show()

**(c)** Run the principal component analysis. Should you scale the data in this case?
Below you will find some Python
code to get you started:

In [None]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
plt.style.use('seaborn-talk')
# Load data:
data = pd.read_csv('Data/data_exercise4.txt', delim_whitespace =True)
variables = ['x','y'] # Insert the variables you will use here
X = data[variables]
# Uncomment the following line in order to scale the data:
# scaler = StandardScaler() # Initialise the scaler
# scaler.fit(X) # "fit" the scaller with the X data
# X = scaler.transform(X) # Transform X data with the fit
# To run PCA:
pca = PCA()
scores = pca.fit_transform(X)

**Your answer to 4.2(c):** (double click here)

**(d)** The `pca` object defined above contains the
results of the principal component analysis.

**(i)** How many principal components were
used here? Hint: Inspect `pca.n_components_`

**(ii)** Plot the explained variance by these components.
Hint: Inspect `pca.explained_variance_ratio_`

In [None]:
# Your code here

**Your answer to 4.2(d):** (double click here)

**(e)** The principal components themselves are contained as row vectors in
the matrix `pca.components_`:

*(i)* Obtain the principal component vectors.

*(ii)* Verify that they are normalized (Hint: dot products can be done with `np.dot`)

*(iii)* Verify that they are orthogonal to each other.

*(iv)* Plot them together with the raw data.
Do they point in the directions you would expect?


In [None]:
# Your code here

**Your answer to 4.2(e):** (double click here)

**(f)** Next, we will investigate the contributions from the
original variables to the principal components. The contribution from
a variable can be found as a column vector in the matrix `pca.components_`.
Make a plot where you show the contributions from each of the original variables to principal components 1 and 2.

In [None]:
# Your code here

**(g)** Plot the scores of the data points for
principal components 1 and 2. Color these points by the
class information.

*(i)* Which of the principal components seems more important for
separating between the two classes?

*(ii)*
For the following two points,
 * Point 1: $x = 4$, $y=6$
 * Point 2: $x = 6$, $y=4$
which class would you
predict?

*(iii)* Can you, based on this plot, make a simple
rule for determining if a point will belong to class "foo"  or class "bar"?
What would your rule be?

*(iv)* Can you "translate" your rule from the principal
component space to the original ($x$, $y$) space? What would the corresponding
rule be in the original variable space?

In [None]:
# Your code here

**Your answer to 4.2(g):** (double click here)

**(h)** PCA belongs to a class of methods that are called latent variable methods.
Latent variable methods typically discover new
variables using the original ones, intending to uncover
"hidden" relations in the data.
In PCA, we find
such latent variables by taking linear combinations of our
original variables, and we make the new latent variables point
in the directions of the largest variance in our data.

Another example of a latent variable method is 
Linear discriminant analysis (LDA).
LDA is similar to PCA, and in LDA we also
find new variables as linear combinations of our original variables.
However, the way we find them is different. In LDA, we do not look for
directions in which the variance is largest, but we rather look for
directions that *separate* the classes best. This means that class information
is needed as input when training the model.

Based on this would you say that PCA is a supervised or unsupervised method?
What about LDA? Is LDA a supervised or unsupervised method?

**Your answer to 4.2(h):** (double click here)