# Exercise on the use of OLS, Lasso, Ridge and PCR regression

In this exercise we'll check the difference between some regression algorithms.

The goal is to predict a measure of the progression of diabetes from some input features, such as age, body weight, etc.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes

data = load_diabetes()
features = data.feature_names
X, y = data.data, data.target

print(data.DESCR)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# create the test and train datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# scale and center the data
scalerX = StandardScaler()

X0_train = scalerX.fit_transform(X_train)

## Least-Squares regression

A general regression problem can be written as: $y = f(\mathbf{x})$  with $y \in \mathbb{R}$, $x \in \mathbb{R}^d$ and $f: \mathbb{R}^d \mapsto \mathbb{R}$.
In linear regression, the function is represented by an array of weights: $y = \mathbf{w}^T \mathbf{x} = \sum_{i}^d w_i x_i$.

We need to tune the weights to our process, so we collect some data on the inputs $\mathbf{x}$ and the target $y$. The objective is to tune the weights to minimize the euclidean distance between the observations $\mathbf{y} \in \mathbb{R}^n$ and the predictions $\mathbf{X} \mathbf{w}$:

$\mathbf{w} = \underset{\mathbf{w}}{\mathrm{min}} ||\mathbf{X}\mathbf{w} - \mathbf{y}||^2_2$

In [None]:
from sklearn.linear_model import LinearRegression

# Create the linear regression object
OLS_reg = LinearRegression().fit(X0_train, y_train)

# To test the regression, we need to scale and center also the test data
X0_test = scalerX.transform(X_test)
y_pred_OLS = OLS_reg.predict(X0_test)

plt.scatter(y_test, y_pred_OLS)
plt.plot(y_test, y_test, c='r', alpha=0.6, ls='--')
plt.xlim(y_test.min()-5, y_test.max()+5)
plt.ylim(y_test.min()-5, y_test.max()+5)
plt.xlabel('Observed target')
plt.ylabel('Predicted target')
plt.show()


## Metrics

To assess the quality of the regression model, we need to compare the predictions and observations for some points that were not used to train the model.
There are a huge number of different metrics that can be used depending on the case.
Some popular ones are the coefficient of determination $R^2$ and the (root) mean squared error (R)MSE:
\begin{equation}
R^2(\mathbf{y}, \hat{\mathbf{y}}) = 1-\frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2}
\end{equation}

\begin{equation}
\mathrm{MSE}(\mathbf{y}, \hat{\mathbf{y}}) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
\end{equation}

\begin{equation}
\mathrm{RMSE}(\mathbf{y}, \hat{\mathbf{y}}) = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}
\end{equation}

An overview of error metrics can be found here: https://scikit-learn.org/stable/modules/model_evaluation.html#which-scoring-function-should-i-use

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, root_mean_squared_error

r2_ols = r2_score(y_test, y_pred_OLS)
mse_ols = mean_squared_error(y_test, y_pred_OLS)
rmse_ols = root_mean_squared_error(y_test, y_pred_OLS)

print(f'R2 for OLS is: {r2_ols:.2f}')
print(f'MSE for OLS is: {mse_ols:.2f}')
print(f'RMSE for OLS is: {rmse_ols:.2f}')


## Lasso regression

In the OLS regression model we have included all the input features. However, if some features are not correlated with the output this can decrease the accuracy of the model. The LASSO regression model penalizes the coefficients that are different from zero, forcing the weights to be active only if they improve the model.

The objective function of the LASSO regression problem is:

$\mathbf{w} = \underset{\mathbf{w}}{\mathrm{min}} \frac{1}{2 n} ||\mathbf{X}\mathbf{w} - \mathbf{y}||^2_2 + \alpha ||\mathbf{w}||_1$

In which the coefficient $\alpha$ controls how much we regularize the model.

In [None]:
from sklearn.linear_model import Lasso
Lasso_reg = Lasso(alpha=1.).fit(X0_train, y_train)
y_pred_lasso = Lasso_reg.predict(X0_test)

print('LS coefficients: ')
print(np.round(OLS_reg.coef_, 3))

print('Lasso coefficients: ')
print(np.round(Lasso_reg.coef_, 3))

r2_ols = r2_score(y_test, y_pred_OLS)
r2_lasso = r2_score(y_test, y_pred_lasso)

print(f'R2 for OLS is: {r2_ols:.2f}')
print(f'R2 for Lasso is: {r2_lasso:.2f}')

The penalty on the L1 norm is used to promote the sparsity of the regression weights.

To infer the correct value of $\alpha$ to apply we can use the cross-validation.

In [None]:
from sklearn.linear_model import LassoCV

LassoCV_reg = LassoCV(cv=5, random_state=42).fit(X0_train, y_train)
y_pred_lassoCV = LassoCV_reg.predict(X0_test)

plt.scatter(y_test, y_pred_lassoCV)
plt.plot(y_test, y_test, c='r', alpha=0.6, ls='--')
plt.xlim(y_test.min()-1, y_test.max()+1)
plt.ylim(y_test.min()-1, y_test.max()+1)
plt.xlabel('Observed target')
plt.ylabel('Predicted target')
plt.show()

print('LS coefficients: ')
print(np.round(OLS_reg.coef_, 3))

print('LassoCV coefficients: ')
print(np.round(LassoCV_reg.coef_, 3))

r2_lassoCV = r2_score(y_test, y_pred_lassoCV)

print(f'R2 for OLS is: {r2_ols:.2f}')
print(f'R2 for LassoCV is: {r2_lassoCV:.2f}')

print(f'alpha = {LassoCV_reg.alpha_:.2f}')

## Ridge regression

In Ridge regression, the regularization is applied to the $l_2$ norm of the weights. We want to reduce the magnitude of the weights, so that the model is less sensitive to noise.
The objective function of a Ridge regression problem is:

$\mathbf{w} = \underset{\mathbf{w}}{\mathrm{min}} ||\mathbf{X}\mathbf{w} - \mathbf{y}||^2_2 + \alpha ||\mathbf{w}||_2$

In [None]:
from sklearn.linear_model import RidgeCV

RidgeCV_reg = RidgeCV(alphas=(0.1, 0.5, 1, 5, 10, 50), cv=5).fit(X0_train, y_train)
y_pred_RidgeCV = RidgeCV_reg.predict(X0_test)

plt.scatter(y_test, y_pred_RidgeCV)
plt.plot(y_test, y_test, c='r', alpha=0.6, ls='--')
plt.xlim(y_test.min()-1, y_test.max()+1)
plt.ylim(y_test.min()-1, y_test.max()+1)
plt.xlabel('Observed target')
plt.ylabel('Predicted target')
plt.show()

print('OLS coefficients: ')
print(np.round(OLS_reg.coef_, 3))

print('RidgeCV coefficients: ')
print(np.round(RidgeCV_reg.coef_, 3))

r2_ridgeCV = r2_score(y_test, y_pred_RidgeCV)

print(f'R2 for OLS is: {r2_ols:.2f}')
print(f'R2 for RidgeCV is: {r2_ridgeCV:.2f}')

print(f'alpha = {RidgeCV_reg.alpha_:.2f}')

## Principal components regression

The principal component regression is the same as the OLS regression, with an extra-step: the PCA is applied to the X matrix, and the linear regression is performed on the new projected data.

In [None]:
from sklearn.decomposition import PCA

pca = PCA().fit(X0_train)

plt.scatter(np.arange(X.shape[1]), pca.explained_variance_ratio_*100)
plt.xlabel('PCs')
plt.ylabel('Explained variance')
plt.show()

In [None]:
A_train = pca.components_.T
Z_train = X0_train @ A_train
Z_test = X0_test @ A_train

# The regressio has to be applied to the PC scores
PCR_reg = LinearRegression().fit(Z_train, y_train)
y_pred_PCR = PCR_reg.predict(Z_test)

plt.scatter(y_test, y_pred_PCR)
plt.plot(y_test, y_test, c='r', alpha=0.6, ls='--')
plt.xlim(y_test.min()-1, y_test.max()+1)
plt.ylim(y_test.min()-1, y_test.max()+1)
plt.xlabel('Observed log PSA')
plt.ylabel('Predicted log PSA')
plt.show()

print('LS coefficients: ')
print(np.round(OLS_reg.coef_, 3))

print('PCR coefficients: ')
print(np.round(PCR_reg.coef_, 3))

r2_pcr = r2_score(y_test, y_pred_PCR)

print(f'R2 for OLS is: {r2_ols:.2f}')
print(f'R2 for PCR is: {r2_pcr:.2f}')

This added step has two benefits:

* The features become uncorrelated between them.
* The dimensionality of the feature matrix can be reduced.

In [None]:
# We can test the regression with fewer features
q = 5
Z_train = X0_train @ A_train[:,:q]
Z_test = X0_test @ A_train[:,:q]

PCR_reg = LinearRegression().fit(Z_train, y_train)
y_pred_PCR = PCR_reg.predict(Z_test)

plt.scatter(y_test, y_pred_PCR)
plt.plot(y_test, y_test, c='r', alpha=0.6, ls='--')
plt.xlim(y_test.min()-1, y_test.max()+1)
plt.ylim(y_test.min()-1, y_test.max()+1)
plt.xlabel('Observed target')
plt.ylabel('Predicted target')
plt.show()

r2_pcr = r2_score(y_test, y_pred_PCR)

print(f'R2 for OLS is: {r2_ols:.2f}')
print(f'R2 for PCR is: {r2_pcr:.2f}')