# Exercise on the use of LS, Lasso, Ridge and PCR regression

In this exercise we'll check the difference in the application of different regression algorithms.

The goal is to predict the association between prostate specific antigen (PSA) and several clinical measures that are potentially associated with PSA in men who were about to receive a radical prostatectomy.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = pd.read_csv('cancer_data.txt', sep='\t', index_col=0) # The entire dataset
X = np.array(data.iloc[:,:-2]) # Matrix of the features
y = np.array(data.iloc[:,-2]) # Matrix of the target

print(data)

      lcavol   lweight  age      lbph  svi       lcp  gleason  pgg45  \
1  -0.579818  2.769459   50 -1.386294    0 -1.386294        6      0   
2  -0.994252  3.319626   58 -1.386294    0 -1.386294        6      0   
3  -0.510826  2.691243   74 -1.386294    0 -1.386294        7     20   
4  -1.203973  3.282789   58 -1.386294    0 -1.386294        6      0   
5   0.751416  3.432373   62 -1.386294    0 -1.386294        6      0   
..       ...       ...  ...       ...  ...       ...      ...    ...   
93  2.830268  3.876396   68 -1.386294    1  1.321756        7     60   
94  3.821004  3.896909   44 -1.386294    1  2.169054        7     40   
95  2.907447  3.396185   52 -1.386294    1  2.463853        7     10   
96  2.882564  3.773910   68  1.558145    1  1.558145        7     80   
97  3.471966  3.974998   68  0.438255    1  2.904165        7     20   

        lpsa train  
1  -0.430783     T  
2  -0.162519     T  
3  -0.162519     T  
4  -0.162519     T  
5   0.371564     T  
..       

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# To do: 
# - Split the dataset into a train and test portion. The test portion has to be 20% of the total size.
# - Center and scale the feature matrix to unitary variance.

# Hints:
# - To split the dataset, use the function train_test_split (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
# - To center and scale, use the function StandardScaler (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)


## Least-Squares regression

The objective function of a least-squares problem $ Ax = b$ is:

$x = \underset{x}{\mathrm{min}} ||Ax - b||^2_2$

In [3]:
from sklearn.linear_model import LinearRegression

# To do: 
# - use the function LinearRegression to fit the regression model to the data.
# - Plot the predicted values against the observed ones (parity plot).

# Hint:
# - The function documentation is in: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
# - A simple user guide on the different linear regression models can be found here: https://scikit-learn.org/stable/modules/linear_model.html


## Lasso regression

The objective function of a least-squares problem $ Ax = b$ is:

$x = \underset{x}{\mathrm{min}} \frac{1}{2 n_s} ||Ax - b||^2_2 + \alpha ||x||_1$

In [4]:
from sklearn.linear_model import Lasso

# To do: 
# - use the function Lasso to fit the regression model to the data. Use your choice for the alpha parameter.
# - Plot the predicted values against the observed ones (parity plot).
# - Compare the Lasso weights and the R2 to the LS regression.

# Hint:
# - The function documentation is in: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
# - Use the function's methods and attribute to calculate the weights and R2.

The penalty on the L1 norm is used to promote the sparsity of the regression weights.

To infer the correct value of $\alpha$ to apply we can use the cross-validation.

In [5]:
from sklearn.linear_model import LassoCV

# To do: 
# - use the function LassoCV to fit the regression model to the data.
# - Compare the LassoCV alpha parameter to the one you selected in the exercise before.

# Hint:
# - The function documentation is in: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html
# - Use the function's attribute to obtaine the optimal alpha.

## Ridge regression

The objective function of a least-squares problem $ Ax = b$ is:

$x = \underset{x}{\mathrm{min}} ||Ax - b||^2_2 + \alpha ||x||^2_2$

In [6]:
from sklearn.linear_model import RidgeCV

# To do: 
# - use the function RidgeCV to fit the regression model to the data.
# - Plot the predicted values against the observed ones (parity plot).
# - Compare the RidgeCV weights and the R2 to the LS regression.

# Hint:
# - The function documentation is in: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html
# - Use the function's methods and attribute to calculate the weights and R2.


## Principal components regression

The principal component regression is the same as the LS regression, with an extra-step: the PCA is applyied to the X matrix, and the linear regression is performed on the new projected data.

In [7]:
from sklearn.decomposition import PCA

# To do: 
# - use the function PCA to trasform the feature matrix.
# - Plot the explained variance ratio.

# Hint:
# - The function documentation is in: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html


In [8]:
# To do: 
# - Calculate the PC matrix (A) and the PC score matrix (Z).
# - Apply the linear regression to Z and y.
# - Plot the predicted values against the observed ones (parity plot).
# - Compare the PCR weights and the R2 to the LS regression.


This added step has two benefits:

* The features become uncorrelated between them.
* The dimensionality of the feature matrix can be reduced.

In [9]:
# To do: 
# - Modify the number of PC scores used in the regression and track the resulting R2 score.
