<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Train/Test Splits

_Authors: Joseph Nelson (DC), Kevin Markham (DC)_

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
%matplotlib inline

In [3]:
ames_df = pd.read_csv('../assets/data/ames_train.csv')
X = ames_df.drop('SalePrice', axis='columns')
y = ames_df.loc[:, 'SalePrice']

## Simple Train/Test Splits

So far in this course we have used *simple train/test splits*:

1. Split the data set into two pieces: a **training set** and a **testing set**.
2. Fit the model on the **training set**.
3. Evaluate the model on the **testing set**.

A common rule of thumb that sklearn follows by default is to set aside 25% of your data set for testing.

### Understanding the `train_test_split` Function

In [4]:
# Do a train/test split on the Ames housing data
# /scrub/
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [5]:
# Confirm that X_train and X_test comes from splitting X by row
# /scrub/
# Before splitting
print('X shape before splitting:', X.shape)

# After splitting
print('X shape after splitting:', X_train.shape, X_test.shape)

X shape before splitting: (1460, 80)
X shape after splitting: (1095, 80) (365, 80)


In [6]:
# Confirm that y_train and y_test comes from splitting y by row
# /scrub/
# Recall that (1,) is a tuple.
# The trailing comma distinguishes it as being a tuple, not an integer.

# Before splitting
print(y.shape)

# After sy plitting
print(y_train.shape)
print(y_test.shape)

(1460,)
(1095,)
(365,)


![train_test_split](../assets/images/train_test_split.png)

### Understanding the `random_state` Parameter

The `random_state` is a pseudo-random number that allows us to reproduce our results every time we run them. However, it makes it impossible to predict what are exact results will be if we chose a new `random_state`.

`random_state` is very useful for testing that your model was made correctly since it provides you with the same split each time. However, make sure you remove it if you are testing for model variability!

In [7]:
# WITHOUT a random_state parameter:
#  (If you run this code several times, you get different results!)
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Print the first row of X_Train.
print(X_train.head(1))

        Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
1069  1070          45       RL         60.0     9600   Pave   NaN      Reg   

     LandContour Utilities      ...       ScreenPorch PoolArea PoolQC Fence  \
1069         Lvl    AllPub      ...                 0        0    NaN   NaN   

     MiscFeature MiscVal MoSold  YrSold  SaleType  SaleCondition  
1069         NaN       0      5    2007        WD         Normal  

[1 rows x 80 columns]


In [8]:
# WITH a random_state parameter:
#  (Same split every time! Note you can change the random state to any integer.)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Print the first row of each X_train.
print(X_train.head(1))

   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
6   7          20       RL         75.0    10084   Pave   NaN      Reg   

  LandContour Utilities      ...       ScreenPorch PoolArea PoolQC Fence  \
6         Lvl    AllPub      ...                 0        0    NaN   NaN   

  MiscFeature MiscVal MoSold  YrSold  SaleType  SaleCondition  
6         NaN       0      8    2007        WD         Normal  

[1 rows x 80 columns]


In [9]:
# Load the Boston housing dataset from sklearn and print its documentation
from sklearn.datasets import load_boston

boston = load_boston()
print(boston.DESCR)

X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.DataFrame(boston.target, columns=['MEDV'])

boston = pd.concat([y, X], axis=1)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

**Exercise.**

- Make `X` a DataFrame with columns "AGE" and "RM" from the Boston housing dataset, and make `y` a series with column "MEDV."

In [10]:
# /scrub/
feature_cols = ['AGE', 'RM']
X = boston.loc[:, feature_cols]
y = boston.loc[:, 'MEDV']

- Split `X` and `y` from the previosu step into a training set and a test state, setting a random state for reproducibility.

In [11]:
# /scrub/
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)

- Train a linear regression model on the training set.

In [12]:
# /scrub/
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

- Print mean squared error (not R^2) for both the training set and the test set. You can use the `mean_squared_error` function imported below. Consult the printed documentation if you are not sure how to use it.

In [13]:
from sklearn.metrics import mean_squared_error

In [14]:
mean_squared_error?

[0;31mSignature:[0m
[0mmean_squared_error[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m[[0m[0;34m'y_true'[0m[0;34m,[0m [0;34m'y_pred'[0m[0;34m,[0m [0;34m'sample_weight=None'[0m[0;34m,[0m [0;34m"multioutput='uniform_average'"[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Mean squared error regression loss

Read more in the :ref:`User Guide <mean_squared_error>`.

Parameters
----------
y_true : array-like of shape = (n_samples) or (n_samples, n_outputs)
    Ground truth (correct) target values.

y_pred : array-like of shape = (n_samples) or (n_samples, n_outputs)
    Estimated target values.

sample_weight : array-like of shape = (n_samples), optional
    Sample weights.

multioutput : string in ['raw_values', 'uniform_average']
    or array-like of shape (n_outputs)
    Defines aggregating of multiple output values.
    Array-like value defines weights used to average errors.

    'raw_values' :
      

In [15]:
# /scrub/
print(mean_squared_error(y_train, lr.predict(X_train)))
print(mean_squared_error(y_test, lr.predict(X_test)))

38.96905578768683
42.00175221486844


- Write down the equation that characterizes your fitted model. (*Hint*: use the relevant methods to print the model's intercept and coefficients, and then insert those parameters into the relevant linear regression equation.)

In [16]:
# /scrub/
print(lr.intercept_)
print(lr.coef_)

-24.58202253903385
[-0.07483512  8.2721415 ]


/scrub/

$MEDV = -24.58 + -0.075 * AGE + 8.27 * RM$

- **BONUS:** See if you can get a better score on the test set by using a different set of features.

## $K$-Fold Cross-Validation

You might have noticed that if you repeat the process of doing a train/test split and evaluating performance on the test set without setting the random state, you do not get the same score every time. If your dataset is not large or if it contains outliers, then the scores you get might vary a lot.

In [17]:
# Run this cell repeatedly to see how the scores can vary.
feature_cols = ['GarageCars',
                'GrLivArea',
                'OverallQual'
               ]
X = ames_df.loc[:, feature_cols]
y = ames_df.loc[:, 'SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y)

lr = LinearRegression()

lr.fit(X_train, y_train)

print(lr.score(X_train, y_train))
print(lr.score(X_test, y_test))

0.7365753260789031
0.7463329537130272


$K$-fold cross validation helps address this problem by doing multiple train/test splits and averaging the results. For instance, five-fold cross validation consists of doing five train/test splits, each time holding out a different one-fifth of the rows, and averaging the scores over those test sets.

![](../assets/images/cross_validation_diagram.png)

You would generally use $k$-folds cross validation to develop your model and then retrain that model on the entire dataset.

- **Pro:** $K$-fold cross-validation provides more reliable estimates of model performance than a simple train/test split, especially for small datasets.
- **Con:** $K$-fold cross-validation takes longer to run because it requires training and running the model $k$ times instead of just once.

###  Varying K

Increasing $k$ augments both the advantages and disadvantages of $k$-fold cross-validation over a simple train/test split.

The extreme case of k-fold cross-validation is leave-one-out cross-validation, where you hold out just one data point at a time. This approach provides the best estimates of model perforamnce, but is requires training the model once for every row in the dataset.

A more common approach is to set $k$ somewhere between 5 and 10.

### Cross-Validation With the Boston Data

Scikit-learn provides a single function `cross_val_score` that does all of the work of doing cross-validation splits, training the model on each training set, and scoring it on the corresponding test set.

In [18]:
from sklearn.model_selection import cross_val_score

cross_val_score?

[0;31mSignature:[0m
[0mcross_val_score[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m[[0m[0;34m'estimator'[0m[0;34m,[0m [0;34m'X'[0m[0;34m,[0m [0;34m'y=None'[0m[0;34m,[0m [0;34m'groups=None'[0m[0;34m,[0m [0;34m'scoring=None'[0m[0;34m,[0m [0;34m"cv='warn'"[0m[0;34m,[0m [0;34m'n_jobs=None'[0m[0;34m,[0m [0;34m'verbose=0'[0m[0;34m,[0m [0;34m'fit_params=None'[0m[0;34m,[0m [0;34m"pre_dispatch='2*n_jobs'"[0m[0;34m,[0m [0;34m"error_score='raise-deprecating'"[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Evaluate a score by cross-validation

Read more in the :ref:`User Guide <cross_validation>`.

Parameters
----------
estimator : estimator object implementing 'fit'
    The object to use to fit the data.

X : array-like
    The data to fit. Can be for example a list, or an array.

y : array-like, optional, default: None
    The target variable to try to predict in the case of
    supe

In [19]:
# Use `cross_val_score` to do five-fold cross-validation on the Ames dataset
# /scrub/
lr = LinearRegression()

feature_cols = ['GarageCars',
                'GrLivArea',
                'OverallQual'
               ]
X = ames_df.loc[:, feature_cols]
y = ames_df.loc[:, 'SalePrice']


scores = cross_val_score(estimator=lr, X=X, y=y, cv=5)
print(scores)
print(scores.mean())

[0.77734746 0.74587302 0.73696934 0.72418656 0.69035314]
0.7349459032805816


`cross_val_score` allows you to pass a cross-validation splitter object to the `cv` argument instead of an integer for greater control over how the splits are done. For instance, it is often a good idea to shuffle your rows before doing cross-validation in case something changes about the data as you move down the DataFrame.

In [20]:
# Use a K-fold object with `shuffle=True` to ensure that results aren't
# biased by ordering effects.
# /scrub/
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True)

scores = cross_val_score(estimator=lr, X=X, y=y, cv=kf)
print(scores)
print(scores.mean())

[0.75778919 0.74106227 0.77234129 0.69648506 0.69277114]
0.7320897907358488


Unfortunately, `cross_val_score` does not allow us to assess whether a model is overfitting or underfitting because it returns only test-set scores. To get training set scores as well, we need to write our own cross-validation scoring loop.

In [21]:
# Write our own cross-validation loop.
# Run this cell a few times to see how much more stable the scores are
# than they were with a simple train/test split.
# /scrub/
from sklearn import model_selection

kf = model_selection.KFold(n_splits=5, shuffle=True)

train_scores = []
test_scores = []
for train_rows, test_rows in kf.split(X, y):
    X_train = X.iloc[train_rows, :]
    X_test = X.iloc[test_rows, :]
    y_train = y.iloc[train_rows]
    y_test = y.iloc[test_rows]

    lr = LinearRegression()
    lr.fit(X_train, y_train)

    train_scores.append(lr.score(X_train, y_train))
    test_scores.append(lr.score(X_test, y_test))
    
print('Training-set R^2:', np.array(train_scores).mean())
print('Test-set R^2:', np.array(test_scores).mean())

Training-set R^2: 0.7393239937290222
Test-set R^2: 0.7379224408241367


## Train/Test Splits for Time-Series Data

In time-series modeling, you are predicting future values of a variable such as a stock price from past values of that variable (perhaps along with other inputs).

The point of a train-test split is to simulate what will happen when you deploy the model. As a result, **in a useful train-test split for time series data, all of the training data comes from before all of the test data.** After all, when the model is deployed it will be used to predict future observations based on past observations.

There are many decisions to make in doing train/test splits for time series models (sometimes called *backtesting*:

- How much of a "burn-in period" will you allow before you start scoring the model?
- Will you retrain the model periodically after you start scoring it? If so, when? On all past data, or on a "rolling window" of just the most recent data?
- When you make a prediction for scoring, will you predict just one time point into the future, or multiple time points? Will you use all past data as inputs to the prediction, or just a rolling window of the most recent data?

The key principle to keep in mind in all of these decisions is that **the purpose of applying the model to test data is to simulate what will happen when you deploy it**. As a result, you should make test conditions mimic what will happen in production as much as is feasible, erring on the side of making the test more difficult so that if anything you underpromise on model performance.

## Summary

- $K$-folds cross-validation consists of doing $k$ train/test splits with disjoint test sets and averaging the resulting test-set metrics. It takes longer to run than a simple train/test split but provides more reliable estimates of test-set performance.
- Doing proper train/test splits for time-series models is tricky. You need to make sure that all of the training data comes from earlier in time than all of the test-set data and in general to simulate how the model will be used in production.