# 2-27: Intro to Model Selection Techniques


---
<img src = "http://izquotes.com/quotes-pictures/quote-essentially-all-models-are-wrong-but-some-are-useful-george-e-p-box-212711.jpg" style = "width: 500px; height: 275px;" />

**Model Selection** is a key step in the machine learning process. Generally, you will have several candidate models that you fit to your data, and must select the one that you will use on out-of-sample data. These techniques help you determine which model is the "best."

*Estimated Time: 30 minutes*

---

**Dependencies:**

In [1]:
import numpy as np
import scipy
import datetime as dt
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import r2_score

## The Data: Bike Sharing

We'll be using the same bike sharing data as last week!

In [13]:
bike = pd.read_csv(('data/Bike-Sharing-Dataset/day.csv'))

# reformat the date column to integers representing the day of the year, 001-366
bike['dteday'] = pd.to_datetime(bike['dteday']).dt.strftime('%j')

# # get rid of the index column
bike = bike.drop(0)

bike.head(4)


Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
1,2,2,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,3,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,4,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,5,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


## 1. Test-Train-Validation Split

Recall that we typically want to split our data into training, validation, and test sets for the purposes of developing and tweaking our Machine Learning models. Below we reproduce the code from last lab:

In [14]:
# the features used to predict riders
X = bike.drop(columns=['casual', 'registered', 'cnt'])

# the number of riders
y = bike['cnt']

# set the random seed

np.random.seed(10)

# split the data
# train_test_split returns 4 values: X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.80, test_size=0.20)

# split the data
# Returns 4 values: X_train, X_validate, y_train, y_validate

X_train, X_validate, y_train, y_validate = train_test_split(X_train, y_train,
                                                    train_size=0.75, test_size=0.25)

## 2. Specify Regression Models

Specify the linear regression, ridge, and lasso models that we explored in the last labs on the training data. This time you don't need to report the coefficients or plot the results.

In [15]:
# Linear Regression

## Create lin_reg method and fit model


# Ridge

## Create ridge_reg method and fit model

# LASSO

## Create lasso_reg and fit


## 3. Information Criterion Approaches

We now have three candidate models! First, let's implement the AIC and BIC approaches. Write code that calculates the AIC and BIC for each of the models. The (simplified) formulas for each are: 

$AIC = 2k - 2ln(sse)$

$BIC = n*ln(sse/n) + k*ln(n)$

Where $k$ is the number of features, $n$ is the number of observations, $ln$ is the natural log (hint: use np.log), and sse is the "sum of squared errors" or the squared residuals between the predicted and actual responses.

In [16]:
len(X_train.columns)

13

In [17]:
len(X_train.index)

438

In [18]:
# Columns
k = 
# Rows
n = 

SyntaxError: invalid syntax (<ipython-input-18-5553f49a2f5a>, line 2)

In [19]:
lin_reg_hat = 
lin_reg_resid = 
lin_reg_sse = 

lin_AIC = 
lin_AIC

SyntaxError: invalid syntax (<ipython-input-19-6a454f92ff0f>, line 1)

In [20]:
ridge_reg_hat = 
ridge_reg_resid = 
ridge_reg_sse = 

ridge_AIC = 
ridge_AIC

SyntaxError: invalid syntax (<ipython-input-20-e0a7ff4c4eb1>, line 1)

In [21]:
lasso_reg_hat = 
lasso_reg_resid = 
lasso_reg_sse = 

lasso_AIC = 
lasso_AIC

SyntaxError: invalid syntax (<ipython-input-21-8b385ff9c02a>, line 1)

Which model should we prefer based on the AIC? How confident are you about this choice?

Next, calculate the BIC for each model.

In [22]:
lin_BIC = 
lin_BIC

SyntaxError: invalid syntax (<ipython-input-22-3ddf8de7f385>, line 1)

In [23]:
ridge_BIC = 
ridge_BIC

SyntaxError: invalid syntax (<ipython-input-23-5d2473024150>, line 1)

In [24]:
lasso_BIC = 
lasso_BIC

SyntaxError: invalid syntax (<ipython-input-24-c29400a09ba5>, line 1)

Which model should you prefer based on the BIC? Again, how confident are you about this choice?

## 4. Cross-Validation

Next, let's try a cross-validation approach. The basic logic of cross-validation is as follows:

1. Randomly split the data into k-folds
2. Build the model on k-1 folds, then test on the last fold
3. Record prediction error
4. Cycle until each fold has served as the test set
5. The average of the errors is the cv-error

Luckily a lot of this functionality is already packaged up for us in sklearn's [cross-validation methods](http://scikit-learn.org/stable/modules/cross_validation.html). First, generate predictions for each of the models by using "cross_val_predict." Use "cv = 3" and return the r^2 score, and plot the predicted vs. actual values. For this section, disregard the train-validation-split you did earlier, and instead use the full data.

In [25]:
lin_predicted = cross_val_predict(...)
r2_score(...)

TypeError: cross_val_predict() missing 1 required positional argument: 'X'

In [26]:
# plot the residuals on a scatter plot
plt.scatter(...)
plt.title('Linear Model (OLS)')
plt.xlabel('actual value')
plt.ylabel('predicted value')
plt.show()

TypeError: scatter() missing 1 required positional argument: 'y'

In [27]:
ridge_predicted = cross_val_predict(...)
r2_score(...)

TypeError: cross_val_predict() missing 1 required positional argument: 'X'

In [28]:
# plot the residuals on a scatter plot
plt.scatter(...)
plt.title('Linear Model (OLS)')
plt.xlabel('actual value')
plt.ylabel('predicted value')
plt.show()

TypeError: scatter() missing 1 required positional argument: 'y'

In [29]:
lasso_predicted = cross_val_predict(...)
r2_score(...)


TypeError: cross_val_predict() missing 1 required positional argument: 'X'

In [30]:
# plot the residuals on a scatter plot
plt.scatter(...)
plt.title('Linear Model (OLS)')
plt.xlabel('actual value')
plt.ylabel('predicted value')
plt.show()

TypeError: scatter() missing 1 required positional argument: 'y'

How well does cross-validation fit the data? In general, explicitly calculating $r^2$ like this tends to overfit the data. The preferred method is to use "cross_val_score." Use "cross_val_score" to return the mean prediction from a 3-fold cross validation for each of the models. How do the $r^2$ metrics compare?

Experiment with different [metrics](http://scikit-learn.org/stable/modules/model_evaluation.html), and select your preferred model. Keep in mind that higher return values are better than lower return values in this method.

If you're feeling ambitious, experiment with the full range of cross-validation options (including Leave-One-Out)!