# Topic IV: Shrinkage and Variable Selection

**Information:**  
We are using the book 'G. James et al. -  An Introduction to Statistical Learning (with Applications in Python)'. You can find a copy of it for free [here](https://www.statlearning.com/).

In this exercise, we will predict the number of applications received using the other variables in the `College` data set.

## Import modules, packages and libraries

First, we import some useful modules, packages and libraries. These are needed for carrying out the computations and for plotting the results.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 

# sci-kit learn specifics
# We will use the sklearn package to obtain ridge regression and lasso models.
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error

## Load the `College` data set

In [2]:
college = pd.read_csv('College10.csv', index_col = 0)

# Display information about the data set
college.info()

# Return summary statistics for each column
college.describe()

# Return first five rows of the data set
college.head()

<class 'pandas.core.frame.DataFrame'>
Index: 78 entries, Abilene Christian University to Wofford College
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Private      78 non-null     object 
 1   Apps         78 non-null     int64  
 2   Accept       78 non-null     int64  
 3   Enroll       78 non-null     int64  
 4   Top10perc    78 non-null     int64  
 5   Top25perc    78 non-null     int64  
 6   F.Undergrad  78 non-null     int64  
 7   P.Undergrad  78 non-null     int64  
 8   Outstate     78 non-null     int64  
 9   Room.Board   78 non-null     int64  
 10  Books        78 non-null     int64  
 11  Personal     78 non-null     int64  
 12  PhD          78 non-null     int64  
 13  Terminal     78 non-null     int64  
 14  S.F.Ratio    78 non-null     float64
 15  perc.alumni  78 non-null     int64  
 16  Expend       78 non-null     int64  
 17  Grad.Rate    78 non-null     int64  
dtypes: float64(1), in

Unnamed: 0,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
Abilene Christian University,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
Alfred University,Yes,1732,1425,472,37,75,1830,110,16548,5406,500,600,82,88,11.3,31,10932,73
Antioch University,Yes,713,661,252,25,44,712,23,15476,3336,400,1100,69,82,11.3,35,42926,48
Augustana College,Yes,761,725,306,21,58,1337,300,10990,3244,600,1021,66,70,10.4,30,6871,69
Beaver College,Yes,1163,850,348,23,56,878,519,12850,5400,400,800,78,89,12.2,30,8954,73


In [3]:
### PREPROCESSING HERE
# make the column "Private" in to a binary variable
college['Private'] = college['Private'].apply(lambda x: 1 if x == 'Yes' else 0)


**(a) Normalize the data and split it into a training set and a test set.**

In [4]:
### YOUR CODE HERE
# normalize the data
scaler = StandardScaler()
X = college.drop('Apps', axis = 1)
X = scaler.fit_transform(X)
y = college['Apps']

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)


**(b) Fit a linear model using least squares on the training set, and report the test error obtained.**

In [5]:
### YOUR CODE HERE
# linear regression using ols
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
y_pred = lm.predict(X_test)
print('OLS MSE:', mean_squared_error(y_test, y_pred))


OLS MSE: 2916275.3351350143


**(c) Fit a ridge regression model on the training set, with $ \lambda $ chosen by cross-validation. Report the test error obtained.**

In [6]:
### YOUR CODE HERE
# ridge regression with lambda chosen by cross validation
from sklearn.linear_model import Ridge
ridge = Ridge()
parameters = {'alpha': np.logspace(-5, 5, 100)}
# ridSearchCV is used to systematically traverse multiple parameter combinations and determine the best parameters through cross-validation. 5-fold cross-validation is used.
ridge_cv = GridSearchCV(ridge, parameters, cv = 5)
ridge_cv.fit(X_train, y_train)
y_pred = ridge_cv.predict(X_test)
print('Ridge MSE:', mean_squared_error(y_test, y_pred))


Ridge MSE: 1123150.419091357


**(d) Fit a lasso model on the training set, with $ \lambda $ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.**

In [8]:
### YOUR CODE HERE
# lasso regression with lambda chosen by cross validation
from sklearn.linear_model import Lasso
lasso = Lasso()
parameters = {'alpha': np.logspace(-5, 5, 100)}
# ridSearchCV is used to systematically traverse multiple parameter combinations and determine the best parameters through cross-validation. 5-fold cross-validation is used.
lasso_cv = GridSearchCV(lasso, parameters, cv = 5)
lasso_cv.fit(X_train, y_train)
y_pred = lasso_cv.predict(X_test)
print('Lasso MSE:', mean_squared_error(y_test, y_pred))

Lasso MSE: 1725606.852975984


**(g) Comment on the results obtained. How accurately can we predict the number of college applications received?**

In [9]:
### YOUR CODE HERE
# compute the accuracy of the model, the R^2 value which equals 1 - (MSE of the model / MSE of the baseline model)
print('OLS R^2:', lm.score(X_test, y_test))
print('Ridge R^2:', ridge_cv.score(X_test, y_test))
print('Lasso R^2:', lasso_cv.score(X_test, y_test))


OLS R^2: 0.5856596801300409
Ridge R^2: 0.8404243596954996
Lasso R^2: 0.7548281923802778


\### YOUR COMMENTS HERE