<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Train-test Split and Cross-Validation Lab

_Authors: Joseph Nelson (DC), Kiefer Katovich (SF)_

---

## Review of train/test validation methods

We've discussed overfitting, underfitting, and how to validate the "generalizeability" of your models by testing them on unseen data. 

In this lab you'll practice two related validation methods: 
1. **train/test split**
2. **k-fold cross-validation**

Train/test split and k-fold cross-validation both serve two useful purposes:
- We prevent overfitting by not using all the data, and
- We retain some remaining data to evaluate our model.

In the case of cross-validation, the model fitting and evaluation is performed multiple times on different train/test splits of the data.

Ultimately we can the training and testing validation framework to compare multiple models on the same dataset. This could be comparisons of two linear models, or of completely different models on the same data.


In [1]:
from matplotlib import pyplot as plt

import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.style.use('fivethirtyeight')

In [2]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

boston = load_boston()

X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target

### 1. Clean up any data problems

Load the Boston housing data.  Fix any problems, if applicable.

In [3]:
# Boston data is from SKlearn so it is clean

### Calculate a null baseline score by comparing the observed target values to each average target value

In [3]:
# import mse
from sklearn.metrics import mean_squared_error
# get an array of average values of boston.target the same length as boston.target
target_mean_list = [boston.target.mean() for x in boston.target]
# passing the boston.target values and target_mean_list values into the mean squared error function and take 
# the square root
np.sqrt(mean_squared_error(boston.target, target_mean_list))

9.188011545278203

### 2. Select 3-4 variables with your dataset to perform a 70/30 test train split on

- Use sklearn.
- Score and plot your predictions.

In [4]:
X.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'],
      dtype='object')

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

predictors = ['CRIM', u'RM', 'B', 'LSTAT']

X_train, X_test, y_train, y_test = train_test_split(X[predictors], y, train_size=0.7, random_state=8)


lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

NameError: name 'X' is not defined

In [None]:
yhat = lr.predict(X_test)
sns.jointplot(x=y_test, y=yhat);

### 4. Interpret your coefficients using the coef_ attribute

In [8]:
# for every 1 unit increase in RM, prediction increases by 5.4
# for every 1 unit increase in CRIM, prediction decreases by -0.07...
pd.DataFrame(lr.coef_,predictors).T

Unnamed: 0,CRIM,RM,B,LSTAT
0,-0.077744,5.399703,0.010551,-0.584165


### 5. Standardize your training split and fit a regression on it,  compute the rmse on your testing split, and analyze your features

In [10]:
from sklearn.preprocessing import StandardScaler

# instantiate scaler
ss = StandardScaler()

# standardize numeric features in training data
X_train_scaled = ss.fit_transform(X_train)

# instantiate estimator
lr = LinearRegression()

# fit regression on training data
lr.fit(X_train_scaled, y_train)

# standardize numeric features in test data
X_test_scaled = ss.transform(X_test)

# make predictions
y_preds = lr.predict(X_test_scaled)

# score predictions
print(np.sqrt(mean_squared_error(y_test, y_preds)))

# analyze 
# for every 1 standard deviation increase in RM, prediction increases by 3.6
# for every 1 standard deviation increase in CRIM, prediction decreases by -0.71...
pd.DataFrame(lr.coef_,predictors).T

5.583126677986091


Unnamed: 0,CRIM,RM,B,LSTAT
0,-0.716997,3.625816,1.002232,-4.131158


### 6. Try K-Folds cross-validation with a K of 5 on your training split. 

In [30]:
from sklearn.model_selection import cross_validate

def run_cv(estimator):
    
    # pass estimator, predictor matrix, target variable, scoring function, number of folds into function
    scores = cross_validate(estimator, X_train_scaled, y_train, scoring='neg_mean_squared_error',cv=5, 
                            return_train_score=False)

    # looking at the spread of the individual error values gives us a sense of our variance - score here are erratic
    print(np.sqrt(abs(scores['test_score'])))

    # looking at the average error across folds points to the average error for a prediction
    print(np.mean(np.sqrt(abs(scores['test_score']))))
    
    return

In [31]:
# with linear regression
lr = LinearRegression()
run_cv(lr)

[5.28969877 5.63830346 6.28396516 5.83071032 4.20550099]
5.449635736544363


In [37]:
# use cross validation to compare output to another estimator
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
run_cv(rfr)

[4.10681781 3.35909987 4.30042699 4.56127875 3.76659114]
4.018842910848226


### what to do after training best model

In [38]:
# after cv, fit regression on training data
rfr.fit(X_train_scaled, y_train)
# analyze what's driving predictions - can't get this from cross validation
pd.DataFrame(rfr.feature_importances_,predictors).T

Unnamed: 0,CRIM,RM,B,LSTAT
0,0.098134,0.305222,0.029059,0.567585


In [39]:
# analyze your model's testing error

# make predictions
y_preds = rfr.predict(X_test_scaled)

# score predictions
print(np.sqrt(mean_squared_error(y_test, y_preds)))

4.262101013893451


In [28]:
# we could save best model and use it again later
from joblib import dump, load


dump(rfr, 'rfr_boston_crim_rm_b_lstat.joblib')
rfr = load('rfr_boston_crim_rm_b_lstat.joblib')
rfr.predict(X_test_scaled)

array([20.175, 15.698, 48.797, 26.642, 47.627, 21.835, 14.744, 31.188,
       24.06 , 38.768, 16.295, 12.366, 20.634, 21.472, 15.944, 30.46 ,
       21.502, 20.674, 19.224,  5.819, 21.114, 24.257, 28.56 , 37.942,
       26.167, 35.528,  8.476, 16.151, 21.437, 16.271, 29.11 , 14.463,
       18.186, 20.643, 28.609, 32.103, 36.524, 23.2  , 17.462, 11.908,
       18.719, 23.235, 45.066, 31.135, 29.518, 18.626, 14.156, 20.637,
       19.631, 20.711, 20.572, 10.929, 28.723, 13.734, 17.668, 24.915,
       19.872,  9.23 , 20.571, 30.581, 23.362, 25.044, 19.264,  9.03 ,
       16.458, 23.933, 10.239, 20.761, 28.723, 43.994, 18.296, 22.534,
       25.144, 30.841, 19.156, 23.844, 27.303, 22.104, 17.165, 16.081,
       16.69 ,  8.863, 14.981, 20.893, 20.727, 27.388, 14.896, 16.953,
       24.939, 33.757, 20.414, 20.424, 30.98 , 43.482, 15.918, 35.22 ,
       20.781, 24.037, 21.819, 13.095,  9.568, 27.816,  8.733, 47.495,
       30.515, 21.952, 15.654, 21.216, 24.708, 21.129, 18.502, 20.837,
      

**or we could move back in the process and see if we can improve our results**
* by adding/deriving features
* adding more data
* trying more estimators
* tuning hyperparameters
