<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Train-test Split and Cross-Validation Lab

_Authors: Joseph Nelson (DC), Kiefer Katovich (SF)_

---

## Review of train/test validation methods

We've discussed overfitting, underfitting, and how to validate the "generalizeability" of your models by testing them on unseen data. 

In this lab you'll practice two related validation methods: 
1. **train/test split**
2. **k-fold cross-validation**

Train/test split and k-fold cross-validation both serve two useful purposes:
- We prevent overfitting by not using all the data, and
- We retain some remaining data to evaluate our model.

In the case of cross-validation, the model fitting and evaluation is performed multiple times on different train/test splits of the data.

Ultimately we can the training and testing validation framework to compare multiple models on the same dataset. This could be comparisons of two linear models, or of completely different models on the same data.


## Instructions

For your independent practice, fit **three different models** on the Boston housing data. For example, you could pick three different subsets of variables, one or more polynomial models, or any other model that you like. 

**Start with train/test split validation:**
* Fix a testing/training split of the data
* Train each of your models on the training data
* Evaluate each of the models on the test data
* Rank the models by how well they score on the testing data set.

**Then try K-Fold cross-validation:**
* Perform a k-fold cross validation and use the cross-validation scores to compare your models. Did this change your rankings?
* Try a few different K-splits of the data for the same models.

If you're interested, try a variety of response variables.  We start with **MEDV** (the `.target` attribute from the dataset load method).

In [1]:
from matplotlib import pyplot as plt

import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.style.use('fivethirtyeight')

In [2]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

boston = load_boston()

In [3]:
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.DataFrame(boston.target, columns=["MEDV"])
boston = pd.concat([y,X], axis=1)

In [4]:
boston.head()

Unnamed: 0,MEDV,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,24.0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,21.6,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,34.7,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,33.4,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,36.2,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


### 1. Select 3-4 variables with your dataset to perform a 50/50 test train split on

- Use sklearn.
- Score and plot your predictions.

In [5]:
features = ["CRIM", "AGE", "NOX", "RAD"]

In [6]:
import patsy

In [7]:
y, X = patsy.dmatrices("MEDV ~ CRIM + AGE + NOX + RAD", data=boston, return_type="dataframe")

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99, test_size=0.5)

In [11]:
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression()

In [12]:
y_pred = lr.predict(X_test)

In [14]:
from sklearn import metrics

In [15]:
metrics.mean_squared_error(y_test, y_pred)

63.65094145493664

In [16]:
lr.score(X_test, y_test)

0.24408320004632467

In [5]:
# A:

### 2. Try 70/30 and 90/10
- Score and plot.  
- How do your metrics change?

In [38]:
# A:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99, test_size=0.3)
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print(metrics.mean_squared_error(y_test, y_pred))
print(lr.score(X_test, y_test))

62.86488482114697
0.2575044134047697


In [40]:
# sns.jointplot(y_test, y_pred)

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99, test_size=0.1)
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print(metrics.mean_squared_error(y_test, y_pred))
print(lr.score(X_test, y_test))

41.60575423567512
0.40878509860090906


### 3. Try K-Folds cross-validation with K between 5-10 for your regression. 

- What seems optimal? 
- How do your scores change?  
- What the variance of scores like?
- Try different folds to get a sense of how this impacts your score.

In [36]:
from sklearn.model_selection import cross_val_score, KFold

for i in range(5, 11):
    lr = LinearRegression()
    kf = KFold(n_splits=i, shuffle=True)
    print("="*100)
    print("MSE Values:", -cross_val_score(lr, X, y, cv=kf, scoring="neg_mean_squared_error"))
    print("Avg MSE:", np.mean(-cross_val_score(lr, X, y, cv=kf, scoring="neg_mean_squared_error")))
    print("Std MSE:", np.std(-cross_val_score(lr, X, y, cv=kf, scoring="neg_mean_squared_error")))
    print("Avg R2:", np.mean(cross_val_score(lr, X, y, cv=kf, )))

MSE Values: [61.45879426 48.41574558 61.92637235 82.20405679 74.12884868]
Avg MSE: 64.85833263814219
Std MSE: 16.881796836707476
Avg R2: 0.22260232229003912
MSE Values: [52.36279178 66.95809302 67.23226747 70.46260044 56.82803678 76.84623947]
Avg MSE: 64.81510460434623
Std MSE: 10.095802098989207
Avg R2: 0.22199936484438187
MSE Values: [58.98819838 53.93220713 42.06058985 82.01064863 84.40345619 69.43100896
 63.30198424]
Avg MSE: 65.49744482499459
Std MSE: 12.759741557727999
Avg R2: 0.22796637266178882
MSE Values: [116.36467561  50.18377569  51.95380356  86.33582131  48.55419369
 104.02131576  28.24990035  41.02417857]
Avg MSE: 65.248944551432
Std MSE: 17.66439848601043
Avg R2: 0.21780361493473172
MSE Values: [40.89922873 40.19643719 52.48818519 91.85649373 56.0122542  76.74964419
 59.33250728 94.98647408 73.11341793]
Avg MSE: 65.08215506168091
Std MSE: 20.186335915459697
Avg R2: 0.2163039170703873
MSE Values: [63.78124486 58.80022616 70.97991688 85.65335178 84.37340993 76.10086476
 66