# Exercise set 6

The goal of this exercise is to go through some of the steps required
to make a predictive regression model. We will here try different types
of models, and we will also assess the models in greater detail, compared to the
previous exercises. In particular, this exercise will introduce the use of a
*training set*, a *test set*, *cross-validation* and different
measures of the errors in our models.


**Exercise 6.1:** 

Concrete is one of the most important materials in civil engineering, and it
has a rich chemical composition. The strength of concrete is a
function of its ingredients and age.
We will in this exercise investigate to what extent we can predict the
strength of concrete from the ingredients and its age, with linear models.

The [data set](https://doi.org/10.1016/S0008-8846(98)00165-3) we
will consider here contains $1030$ samples, and the
following variables have been measured:

|Variable|Unit|
|:-------|---:|
|Cement (component 1)                 | kg/m$^3$ |
|Blast Furnace Slag (component 2)     | kg/m$^3$ |
|Fly Ash (component 3)                | kg/m$^3$ |
|Water (component 4)                  | kg/m$^3$ |
|Superplasticizer (component 5)       | kg/m$^3$ |
|Coarse Aggregate (component 6)       | kg/m$^3$ |
|Fine Aggregate (component 7)         | kg/m$^3$ | 
|Age                                  | days     |
|Concrete compressive strength        | MPa      |
|**Table 1:** *Data columns present in the [Data file](Data/concrete_data.csv)*||

**(a)**
Begin by exploring the raw data. Here, it is a good idea to make scatter plots,
in particular of the strength as a function of the other variables. In addition,
you may find it useful to investigate correlations between the variables. Below,
you will find some Python code to get you started.

After looking at the raw data, are there some of the variables that seem to be
correlated with the strength of the concrete?

In [None]:
%matplotlib notebook
from pandas.plotting import scatter_matrix
from matplotlib import pyplot as plt
import pandas as pd

# Load the raw data:
data = pd.read_csv('Data/concrete_data.csv')
# Print out variables:
print(data.columns)
# Rename the variables to have shorter names:
rename = {
    'Cement (component 1)(kg in a m^3 mixture)': 'cement',
    'Blast Furnace Slag (component 2)(kg in a m^3 mixture)': 'slag',
    'Fly Ash (component 3)(kg in a m^3 mixture)': 'ash',
    'Water  (component 4)(kg in a m^3 mixture)': 'water',
    'Superplasticizer (component 5)(kg in a m^3 mixture)': 'super',
    'Coarse Aggregate  (component 6)(kg in a m^3 mixture)': 'coarse',
    'Fine Aggregate (component 7)(kg in a m^3 mixture)': 'fine',
    'Age (day)': 'age',
    'Concrete compressive strength(MPa, megapascals)': 'strength',
}
data = data.rename(columns=rename)
# Remove the ID of samples:
data = data.drop(columns=['Sample ID'])
# Print out information about the data:
print(data.describe())
# Investigate correlations:
corr = data.corr()
# Sort correlations for the strength:
print(corr['strength'].sort_values(ascending=False))
# Make scatter plots of the raw data (this will be a large figure...):
scatter_matrix(data, figsize=(14, 12), diagonal='kde')
plt.tight_layout()
# If some of the variables seem more interesting, they can be
# inspected in greater detail as follows:
data.plot(kind='scatter', x='age', y='strength', alpha=0.5, s=100)
data.plot(kind='scatter', x='cement', y='strength', alpha=0.5, s=100)
plt.tight_layout()

In [None]:
# Your code here

**Your answer to 6.1(a):** *(double click here)*

**(b)** Before we start the modeling, we will create a *training set* and a *test* set. With `sklearn`, there is a
method to do this, shown below.

Here, we create a test set using $20$% of the samples from `X` and `y`. Modify this code for
your Python script, and use it to create a training set and a test set.

What is the purpose of doing this split? That is, what is the *training set* and the *test set*
used for respectively?

In [None]:
# Create training set:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
xvars = ['cement', 'slag', 'ash', 'water', 'super', 'coarse', 'fine', 'age']
X = data[xvars]
y = data['strength']
X = scale(X)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=10
)

In [None]:
# Your code here

**Your answer to 6.1(b):** *(double click here)*

**(c)** Our first model will be a linear least-squares model. 
In the rest of the exercise, we will refer to this
model as "model 1". For creating model 1, we will use the
`LinearRegression` class from `sklearn.linear_model`. 

Below, you will find some Python code that can be used as a starting point for
creating a linear model.

Modify your Python code to do the linear least-squares regression and plot 
the measured strengths ($y_i$) vs. the predicted strengths ($\hat{y}_i$) for
your training data.
Also, plot the residuals, $y_i - \hat{y}_i$, as a function of $y_i$. Do the
residuals seem to be homoscedastic or heteroscedastic? What does this plot
indicate?

In [None]:
import numpy as np
from sklearn.preprocessing import scale

xvars = ['cement', 'slag', 'ash', 'water', 'super', 'coarse',
         'fine', 'age']
X = data[xvars]
y = data['strength']
X = scale(X)

# Create a test set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

    
# Do a linear regression:
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
y_hat = linear_model.predict(X_train)
fig, ax1 = plt.subplots(constrained_layout=True)
ax1.scatter(y_train, y_hat)
ax1.set(xlabel='y (measured, training)', ylabel='y (predicted, training)');

In [None]:
# Your code here

**Your answer to 6.1(c):** *(double click here)*

**(d)** 
We have now created a linear model (model 1) and you may have seen that
it does not seem to perform very well.
In this part of the exercise, we will define some ways we can
assess the performance and
we will use the same assessments in the rest of the exercise for the other models we
are going to create.

* In part **(b)** of this exercise, you created a test set. This test set can be used to check
  how well the model performs for data which was not included when making it.
  Predict the strength using model 1 and the test set data. Plot the predicted
  strengths vs. the measured strengths for the test set, and compare this with the
  plot you made when creating/training model 1.

* We have, in previous exercises, used the 
  [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) $R^2$,
  as a measure of well a model performs. This will also be the first metric we will use here.
  With `sklearn`, this is [available](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html) as the function `r2_score` from the module `sklearn.metrics`. 
  Here, we can calculate two different $R^2$ values:
  1. Using the training set (let us call this $R^2$ in the continuation)
  2. Using the test set (let us call this $R_\text{p}^2$ in the continuation).
  
  Calculate the $R^2$ and $R_\text{p}^2$ values for model 1 and compare them.
  
* Another set of metrics for the performance is based on the mean squared error (MSE).
  This error is calculated from the difference of measured y-values ($y_i$) and predicted y-vales ($\hat{y}_i$):
  \begin{equation*}
    \text{MSE} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2 ,
  \end{equation*}
  where $N$ is the number of samples. It is common to report the root mean squared error (RMSE) which
  is obtained by just taking the square root: $\text{RMSE} = \sqrt{\text{MSE}}$.
  Here, we can again calculate two of these values, one for the training set and one for the test set.
  We will give these two different values unique names:
  1. **RMSEC**: Root mean squared error of *calibration*, which is obtained by calculating the RMSE
     for the *training set*.
  2. **RMSEP**: Root mean squared error of *prediction*, which is obtained by calculating the RMSE
     for the *test set*.
     
  Calculate these two values for model 1.
  
  Hint: Here you can use the method [mean_squared_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error) from
  the [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)
  module of `sklearn`.

In [None]:
# Your code here

**Your answer to 6.1(d):** *(double click here)*

**(e)**
When we are doing the actual training (or fitting) of the model,
we can also do *cross-validation*. This is in particular
useful if we have additional parameters
we need to optimize. Such additional parameters could,
for instance, be the number
of variables to include in a least-squares model,
the number of components to use in a PCR model,
or the extra parameters in regularized
regression techniques. We could then optimize these
parameters in a cross-validation step, and test the
full model using the test set.

In cross-validation, we split the training set into $k$ smaller sets.
For each of these sets, we do the following:

1. We fit/train the model using the $k-1$ other sets.
2. We evaluate the performance using the set we kept out of the fitting.

Essentially, we have trained the model $k$ times,
and we have $k$ evaluations of
the fitting. The overall performance of the fitting
can then be obtained as the
average over all the $k$ performance measures.
See Fig. 1 for a graphical overview of this approach.
<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" width="500">
  
**Fig. 1:** Illustration of cross-validation where we split the training set
into $5$ smaller sets for cross-validation.
This illustration has been taken from the `sklearn`
[homepage](https://scikit-learn.org/stable/modules/cross_validation.html).
Here, we first split our original data into a training set and a test set.
We then split the training set into $5$ smaller sets in the cross-validation
step, and use this to obtain the parameters. Finally, we check the full
model using the test set.
  
We will now add a cross-validation step to our fitting.
In `sklearn`, there is a method that will do this for us,
[cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)
This method can be used as follows:

In [None]:
# Create a least-squares model:
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
# Run cross-validation:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(linear_model, X_train, y_train,
                         scoring='neg_mean_squared_error',
                         cv=10)
# Take the square root of -scores:
scores = np.sqrt(-scores)
print('Average score:', scores.mean())
print('Standard deviation for score:', scores.std())

Here, we set the following two parameters:

* `scoring`: Which defines how we evaluate the model.       
  Here we select the \emph{negative} mean squared error.
  The `cross_val_score` method expects
  that a higher score corresponds to a better
  model, which is the opposite of the meaning of the
  mean squared error. This is the reason
  or using the *negative* mean squared error.
                
* `cv`: Which defines how many
  splits we will do for the data.
  In this case, we will split the data $10$ times.

Update your script to include a cross-validation step.
It is common to report the average score as
a so-called "root mean squared error of cross-validation"
(abbreviated RMSECV). Calculate RMSECV for model 1,
and compare it with the RMSEC and RMSEP values you obtained previously.

In [None]:
# Your code here

**Your answer to 6.1(e):** *(double click here)*

**(f)**
We are not too happy with the performance of the model we have so far.
In the lectures, we have briefly mentioned regularized
regression methods as an alternative to least-squares regression.
We will here see if using a regularized fitting method, the so-called
[Ridge regression method](https://en.wikipedia.org/wiki/Tikhonov_regularization), will improve things.

In `sklearn`, this method is available from
[sklearn.linear_model.Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge).

When using this method, we have to specify one additional
parameter, $\alpha$, which determines how strongly we penalize
large coefficients. This is an unknown parameter, and we need to
find the "best" one to use. One approach to finding the best $\alpha$
is to just try different values and look for the $\alpha$ value
that gives the lowest RMSECV. Luckily, this process can be automated
in `sklearn` by using a method
called [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). This method will automatically
try different $\alpha$ values from a range we specify, and locate
the best parameter by scoring each parameter with cross-validation.
This can be implemented as follows:

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
ridge = Ridge()
# We will look for alpha parameters between 0 and 1.5 in steps of 0.01:
parameters = [{'alpha': np.arange(0, 1.5, 0.01)}]
grid = GridSearchCV(ridge, parameters,
                    cv=10,
                    scoring='neg_mean_squared_error',
                    return_train_score=True)
grid.fit(X_train, y_train)
print(np.sqrt(-grid.best_score_))
print(grid.best_params_)
print(grid.best_estimator_)

Create a new model, "model 2", in which you use Ridge regression.
Use the Python code given above as
a starting point for determining $\alpha$ and fit
the model using this $\alpha$ value. Further,
calculate $R^2$, $R_\text{p}^2$, RMSEC, RMSEP, and RMSECV for model 2.
How does model 2 compare with model 1?

In [None]:
# Your code here

**Your answer to 6.1(f):** *(double click here)*

**(g)**
A powerful alternative for doing regression is the PLSR method.
We will in this step investigate if PLSR can improve our ability
to predict the strength. Below, you will find some code for
running PLSR. Implement this in your script.

In [None]:
# Create a PLSR model:
from sklearn.cross_decomposition import PLSRegression
max_components = 8  # Maxmum number = number of original variables.
# Run cross-validation:
results = []
for i in range(1, max_components + 1):
    print('Trying with {} PLS components'.format(i))
    plsr_model = PLSRegression(n_components=i)
    scores = cross_val_score(plsr_model, X_train, y_train,
                             scoring='neg_mean_squared_error',
                             cv=10)
    rmsecv = np.average(np.sqrt(-scores))
    results.append((i, rmsecv))
results = np.array(results)
fig, axi = plt.subplots(constrained_layout=True)
axi.plot(results[:, 0], results[:, 1], marker='o')
axi.set(xlabel='Number of components', ylabel='RMSECV')
plt.show()

Note that we here use cross-validation for determining
the number of PLS components we use. What seems to be the best
number of components in this case?
Create a new model, "model 3", in which you use
*only* $2$ PLS components. Calculate
$R^2$ $R_\text{p}^2$, RMSEC, RMSEP, and RMSECV for model 3.
How does this model compare with model 1 and model 2?

In [None]:
# Your code here

**Your answer to 6.1(g):** *(double click here)*

**(h)**
So far, we have not found a good model for
predicting the strength. One option now is to try to use
even more advanced regression methods. However, it
is often a good idea to try to understand the problem we are dealing
with better, before using more complex methods.
Luckily, we have a colleague who is
an expert on concrete. From that colleague, we learn that the strength
depends on the variables we have measured in a highly non-linear way! Our
coworker also tells us that they often find that the strength depends on:

* The water to cement ratio.
* The logarithm of the age, $\ln(\text{age})$.

Motivated by this, create a new least-squared model, "model 4",
in which you include the water to cement ratio
and $\ln(\text{age})$ as variables.
Calculate $R^2$ $R_\text{p}^2$, RMSEC, RMSEP, and RMSECV for model 4, and
compare this with your previous models.

In [None]:
# Your code here

**Your answer to 6.1(h):** *(double click here)*