<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Train-test Split and Cross-Validation Lab

_Authors: Joseph Nelson (DC), Kiefer Katovich (SF)_

---

## Review of train/test validation methods

We've discussed overfitting, underfitting, and how to validate the "generalizeability" of your models by testing them on unseen data. 

In this lab you'll practice two related validation methods: 
1. **train/test split**
2. **k-fold cross-validation**

Train/test split and k-fold cross-validation both serve two useful purposes:
- We prevent overfitting by not using all the data, and
- We retain some remaining data to evaluate our model.

In the case of cross-validation, the model fitting and evaluation is performed multiple times on different train/test splits of the data.

Ultimately we can use the training and testing validation framework to compare multiple models on the same dataset. This could be comparisons of two linear models, or of completely different models on the same data.


In [3]:
from matplotlib import pyplot as plt

import numpy as np
import pandas as pd
#from scipy import stats
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.style.use('fivethirtyeight')

In [2]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

boston = load_boston()

X = pd.DataFrame(boston.data, columns=boston.feature_names)
# the target variable is MEDV
y = boston.target

### 1. EDA. Always EDA.

In [3]:
# A:

### 2. Calculate a null baseline score by comparing the observed target values to each average target value

In [4]:
# import mse
from sklearn.metrics import mean_squared_error
# get an array of average values of boston.target the same length as boston.target
target_mean_list = [boston.target.mean() for x in boston.target]


In [5]:
# pass the boston.target values and target_mean_list values into the mean squared error function and take 
# the square root


### 3. Select 3-4 variables with your dataset to perform a 70/30 test train split on

- Use sklearn.
- Score and plot your predictions.

In [6]:
# A:

### 4. Interpret your coefficients using the coef_ attribute


### 5. Standardize your feature matrix and fit your regression with a 70/30 split

To standardize, you substract the column mean from each variable and divide by the column standard deviation

### $$ X_{std} = \frac{X - \bar{X}}{s_{X}} $$
- Do your metrics change?
- Do your coefficients change and if so why?

<a id='standard-scaler'></a>
### Using sklearn's `StandardScaler`

Sklearn comes packaged with a class `StandardScaler` that will preform the standardization on a matrix for you. 

Load in the package like so:

```python
from sklearn.preprocessing import StandardScaler
```

Once instantiated, the standard scaler object has three primary methods built in:
- `.fit(X)` will calculate the mean and standard deviations for each column of X
- `.transform(X)` will take X and return a transformed version of X where each column is standardized according to their means and standard deviations (must have run `.fit()` first).
- `.fit_transform(X)` combines the `.fit()` method and the `.transform()` method.

In [7]:
from sklearn.preprocessing import StandardScaler
X.head(1)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98


In [8]:
ss = StandardScaler()
#example - single column
X['CRIM'] = ss.fit_transform(X[['CRIM']])
X.head(1)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,-0.419782,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98


In [9]:
#example - multiple columns
X[['B','LSTAT']] = ss.fit_transform(X[['B','LSTAT']])
X.head(1)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,-0.419782,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,0.441052,-1.075562


In [10]:
# standardize numeric features

In [11]:
# create new train/test split

In [12]:
# fit regression on training data

In [13]:
# score predictions

In [14]:
# analyze coefficients

### 6. Try K-Folds cross-validation with a K of 3, 5, and 10 for your regression. 
First, review documentation for [cross validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)

- How does your average score change?  
- What is the variance of scores like?

In [15]:
# A:

### 7. Try to improve your score by deriving a feature (or features) and test your model in cross validation with K of your choice