# CS342 Machine Learning
# Lab 4: Linear regression

## Department of Computer Science, University of Warwick

This lab focuses on the use of regularization for linear regression.

# Data files for the lab

If working on one of the DCS machines, the data may be found here:

```/modules/cs342/2020/lab4/data/prostate_data.csv ```

You may load the data directly from that directory.

If you are using your own machine, copy the data across by running the following command in a terminal window using the remote node corresponding to your username. The name of this remote node uses the last two digits of your username in the form remote-nn, for example, if your username is u1234567 you would connect to remote-67.dcs.warwick.ac.uk (recall to use your USERNAME and correpsonding REMOTE_NN):

```scp USERNAME@REMOTE_NN.dcs.warwick.ac.uk:/modules/cs342/2020/lab4/data/prostate_data.csv .```

After entering your DCS password, this will copy the data to your current working directory. You should now have the following file:
```
├──[your working directory]
   └── prostate_data.csv
```
**Please make sure to use the correct path to these files when working on your own machine. The scripts below assume you are working on the DCS machines. Recall that the *.ipynb file (this file) should be in your working directory.**

The prostate dataset (file *prostate_data.csv* see: https://web.stanford.edu/~hastie/ElemStatLearn//datasets/prostate.data) will be used to predict the numerical target variable *lpsa* based on 8 features (*lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45*). There are 97 observations in total. The last column is a Train/Predict flag to be used to separate the observations into two subsets. The *train = T* subset will be used for model fitting and cross-validation (CV), while the *train = F* subset will be used for testing after model selection and training.

Import the data into a Pandas data frame and standardize the features. 

In [1]:
import pandas as pd

#import prostate dataset
prostate = pd.read_csv("./prostate_data.csv")

#standardise features
features = prostate.drop(["lpsa", "train"], axis=1)
standardised = (features - features.mean()) / features.std()

print(standardised)

      lcavol   lweight       age      lbph       svi       lcp   gleason  \
0  -1.637356 -2.006212 -1.862426 -1.024706 -0.522941 -0.863171 -1.042157   
1  -1.988980 -0.722009 -0.787896 -1.024706 -0.522941 -0.863171 -1.042157   
2  -1.578819 -2.188784  1.361163 -1.024706 -0.522941 -0.863171  0.342627   
3  -2.166917 -0.807994 -0.787896 -1.024706 -0.522941 -0.863171 -1.042157   
4  -0.507874 -0.458834 -0.250631 -1.024706 -0.522941 -0.863171 -1.042157   
..       ...       ...       ...       ...       ...       ...       ...   
92  1.255920  0.577607  0.555266 -1.024706  1.892548  1.073572  0.342627   
93  2.096506  0.625489 -2.668323 -1.024706  1.892548  1.679542  0.342627   
94  1.321402 -0.543304 -1.593794 -1.024706  1.892548  1.890377  0.342627   
95  1.300290  0.338384  0.555266  1.004813  1.892548  1.242632  0.342627   
96  1.800367  0.807764  0.555266  0.232904  1.892548  2.205279  0.342627   

       pgg45  
0  -0.864467  
1  -0.864467  
2  -0.155348  
3  -0.864467  
4  -0.864467

### Non-regularized linear regression 

Scikit-learn has a plethora of linear models:http://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model. It includes all the implementations needed for this lab, including implementations that employ cross-validation to select hyper-parameters for regularization. 

Fit a non-regularized linear regression model to the *train = T* subset. Use your fitted model to predict the target variable in the *train = F* subset. 

In [2]:
from sklearn.linear_model import LinearRegression

#Fit and test a non-regularized linear regression model

X = standardised[prostate["train"] == "T"]
y = prostate[prostate["train"] == "T"]["lpsa"]

reg = LinearRegression()
reg.fit(X, y)

X_test = standardised[prostate["train"] == "F"]
y_test = prostate[prostate["train"] == "F"]["lpsa"]

reg.predict(X_test)

array([1.96903844, 1.16995577, 1.26117929, 1.88375914, 2.54431886,
       1.93275402, 2.04233571, 1.83091625, 1.99115929, 1.32347076,
       2.93843111, 2.20314404, 2.166421  , 2.79456237, 2.67466879,
       2.18057291, 2.40211068, 3.02351576, 3.21122283, 1.38441459,
       3.41751878, 3.70741749, 2.54118337, 2.72969658, 2.64055575,
       3.48060024, 3.17136269, 3.2923494 , 3.11889686, 3.76383999])

### L2-regularized linear regression (ridge regression)

Fit an L2-regularized linear regression model to the *train = T* subset once you use 3-fold CV  to tune the hyper-parameter for regularization. Use the model fitted with the best hyper-parameter to predict the target variable in the *train = F* subset. 

In [3]:
import math
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import KFold

# Fit and test an L2-regularized linear regression model


def error(predicted: np.ndarray, true: np.ndarray) -> float:
    errors = (predicted - true) ** 2
    return math.sqrt(sum(errors))


def ridge(alpha: float, X: pd.DataFrame, y: pd.DataFrame, test: pd.DataFrame):
    clf = Ridge(alpha=alpha)
    clf.fit(X, y)
    return clf.predict(test)


def threefold(alpha: float, X: pd.DataFrame, y: pd.DataFrame):
    kf = KFold(n_splits=3, random_state=None)
    errors = []
    for train_index, test_index in kf.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        prediction = ridge(alpha, X_train, y_train, X_test)
        errors.append(error(prediction, y_test))
    return sum(errors) / len(errors)


minimum = np.inf
param = None
for alpha in range(50):
    value = threefold(alpha, X, y)
    print(f"alpha = {alpha}:", value)
    if value < minimum:
        minimum = value
        param = alpha

print("Best Error:", minimum)
print("Best Hyper-Parameter:", param)


alpha = 0: 6.99484229135647
alpha = 1: 6.910620619160588
alpha = 2: 6.847810511130106
alpha = 3: 6.799303327633505
alpha = 4: 6.760904753682314
alpha = 5: 6.729940094625884
alpha = 6: 6.704607553167296
alpha = 7: 6.6836425963121835
alpha = 8: 6.666128958415913
alpha = 9: 6.651385437101944
alpha = 10: 6.638894725825529
alpha = 11: 6.628256894306021
alpha = 12: 6.619157986566301
alpha = 13: 6.611348243002422
alpha = 14: 6.604626644299121
alpha = 15: 6.5988297207843045
alpha = 16: 6.593823307234456
alpha = 17: 6.589496373309037
alpha = 18: 6.585756343106307
alpha = 19: 6.582525500251362
alpha = 20: 6.57973819572553
alpha = 21: 6.5773386570446135
alpha = 22: 6.575279253243868
alpha = 23: 6.573519109079748
alpha = 24: 6.572022989431187
alpha = 25: 6.570760394667133
alpha = 26: 6.56970482211949
alpha = 27: 6.568833159362072
alpha = 28: 6.568125182839343
alpha = 29: 6.567563141270651
alpha = 30: 6.5671314077072465
alpha = 31: 6.5668161875169515
alpha = 32: 6.566605272185548
alpha = 33: 6.5664

### L1-regularized linear regression (lasso regression)

Fit an L1-regularized linear regression model to the *train = T* subset once you use 3-fold CV  to tune the hyper-parameter for regularization. Use the model fitted with the best hyper-parameter to predict the target variable in the *train = F* subset. 

In [4]:
#Fit and test an L1-regularized linear regression model


1. Which model performes the best for this dataset?
2. Which features are irrelevant for this task? **Hint:** display the learned coefficients (weights) for each model and recall which type of regularization allows for feature selection.