# Data Splits for Predictive Modeling

When trying to build a predictive model we will have to split up our data set. In this notebook we touch on the reasons why and the standard procedures for doing so.

## What we will accomplish

In this notebook we will:
- Discuss the rationale for splitting our data set,
- Introduce train test splits,
- Define a validation set,
- Review cross-validation and
- Touch on scenarios in which you may prefer validation sets to cross-validation and versa.

In [1]:
## We will now start importing a common set
## of items at the onset of most notebooks
import numpy as np
import matplotlib.pyplot as plt
from seaborn import set_style
set_style("whitegrid")

## Rationale for data splits

### Goal of predictive modeling

Imagine we are solving a predictive modeling problem, as we soon will. Once we have identified the problem we go out and randomly collect some data, $(X,y)$, say $n$ observations of $m$ features and $n$ corresponding outputs. 

Our goal is to use these data to identify a model with the lowest <i>generalization error</i>. Generalization error is defined to be the error of the model on a new randomly collected set $(X^*, y^*)$. If we fix this new data set, then the fact that the data we collected originally, $(X,y)$, was randomly collected makes the generalization error of any particular model a random variable. Let's call this variable $G$.

To determine the "best" model from a set of candidate models we would want to choose the one such that the corresponding $G$ is smallest. In other words for each candidate model we want to know something about the corresponding $G$ and its distribution.

In an ideal world we would simply collect many sets, $(X,y)$, fit our models on each of them and compare the resulting distributions. However, it may not be practical, possible or ethical to continually collect data for model selection purposes. In practice, thus, we are often constrained to a single data set for model fitting and comparisons.

### Data splits

In order to estimate $G$ we typically split our data (often more than once) so that we can use one part of the split to fit or estimate the model and the other part to estimate $G$.

These splits are typically random because we want to be able to assume that the data used to fit the model follows the same distribution as the data used to measure generalization performance.

We will now cover three splitting steps/strategies employed in data science and machine learning.

## Train test splits

The first split we will touch on is the first split you would do in a new data science project, the <i>train test split</i>.

The purpose of the train test split is to create two data sets:
1. <b>The training set</b> - This subset is used to fit models and compare model candidates. This data set is usually split further.
2. <b>The testing set</b> - This subset is used as a final check on your selected model prior to putting your model into its desired final state.

The training set usually contains the majority of the original data. Common train test split percentage divisions are $80\% - 20\%$ or $75\% - 25\%$, but it may sometimes be appropriate to use different split sizes. Train test splits are done randomly, with the form of randomness dependent upon your project.

Here is an illustration of a train test split:

<img src="train_test.png" width="40%"></img>


#### A potential point of confusion

Perhaps confusingly, the test set is not directly used to compare models, model comparison is typically done using a subset(s) of the training set as we will soon see. The main purpose of the test is to serve as a final check on your chosen model. This final check is important! Checking model performance on the test set let's you look for coding/modeling errors as well as <i>overfitting</i> on the training set (we will talk about this more explicitly soon).

### Performing train test splits in `sklearn`

While you can use the `random` or `numpy.random` packages to perform the train test split by hand, the `sklearn` package has a useful `train_test_split` function that will perform the train test split. Here is a link to the documentation, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html</a>.

In [2]:
## First we will make a data set
X = np.random.random((1000,10))
y = np.random.randn(1000)

In [3]:
## Now we import train_test_split
from sklearn.model_selection import train_test_split

In [4]:
## Here we make the split
## train_test_split returns 4 outputs: X_train, X_test, y_train and y_test
##
## First you input the X and y for your data
##
## then set the shuffle argument to True, this randomly shuffles the
## data before it is split
##
## The random_state ensures that the random split is the same each time
## someone runs the code chunk, it can be any strictly positive integer
##
## You can specify the size of the test set with test_size,
## here I want 20% of the data
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    shuffle = True,
                                                    random_state = 614,
                                                    test_size = .2)

In [5]:
## check the data lengths to see that they match
## what we'd expect
print("The shape of X_train is",X_train.shape)
print("The shape of X_test is",X_test.shape)
print("The length of y_train is",len(y_train))
print("The length of y_test is",len(y_test))

The shape of X_train is (800, 10)
The shape of X_test is (200, 10)
The length of y_train is 800
The length of y_test is 200


##### Exercise

Make a train test split for the following data. Set aside $25\%$ of the data, $(Z, w)$ as a test set. <i>Do not overwrite the data stored in `X_train`, `X_test`, `y_train` and `y_test`.

In [None]:
## Your data to split
Z = np.random.random((1000,10))
w = np.random.randn(1000)

In [None]:
## Split the data here



## Two split types for model comparison and selection

We will now cover two data splits you can make from the training set for model comparison purposes. Which you choose depends upon the project you are working on, but we will give some reasons to choose one over the other below.

### Validation sets

A <i>validation set</i> is a subset of the training data (the result of the train test split defined above) used solely for the purpose of comparing candidate models. This split is typically also performed randomly. Further, the validation set should be a small subset, common sizes range from $10\%-25\%$ of the training set depending on the training set size. An illustration of this concept is given below:

<img src="validation_set.png" width="45%"></img>

The best model in this setting would be the one that has the best performance metric on the validation set.

#### In practice

In practice we can once again use `sklearn`'s `train_test_split` function to make the validation split. Note that it is good practice to not overwrite the original `X_train` or `y_train` sets when making the validation split.

In [6]:
## Here we make a validation set with 15% of the 
## training data in the validation set
## X_t
X_train_train, X_val, y_train_train, y_val = train_test_split(X_train, y_train,
                                                                 shuffle=True,
                                                                 random_state=440,
                                                                 test_size=.2)

##### Exercise

Make a validation set with $22\%$ of your training set for the $(Z, w)$ data set.

In [None]:
## split the data here




### $k$-Fold cross-validation

The validation set approach in essence gives us a <i>point estimate</i> of $G$. An issue with this approach for model selection is that point estimates are not always reflective of overall or even average model performance. What would be really nice is knowing something about the distribution from which $G$ is drawn. While this is difficult, we can leverage a well known rule from probability theory called the <i>law of large numbers</i>.

#### The law of large numbers

As a quick review let us remind ourselves what the law of large numbers says.

Let $V_1, V_2, \dots, V_n$ denote a sequence of independent identically distributed random variables with true mean $\mu$. Let $\overline{V} = \frac{1}{n} \left(V_1 + V_2 + \dots + V_n \right)$. The law of large numbers says that $\lim_{n\rightarrow\infty} \overline{V} = \mu$

What this says is that the arithmetic mean of a set of random draws will be "close" to the expected value of the distribution given enough draws.

#### Leveraging the law of large numbers

We can use this probability rule to our advantage to estimate the average (or expected) generalization error. If we can somehow generate a sequence of observations of this error, say $G_1, G_2, \dots, G_n$, then we know that $\overline{G} \approx E(G)$. How do we generate such a sequence? Enter $k$-fold cross-validation.

After conducting your train test split, you will randomly break your training set into $k$ equally sized (or roughly equal depending on how the division works out) chunks. You then generate "observations" of $G$ by cycling through each of the $k$ chunks. For each chunk you train your model on the $k-1$ other chunks and then calculate the error of that model on the chunk you held out. At the end you will have $k$ observations of $G$. Calculating the arithmetic mean of $G_1, \dots, G_k$ gives you an estimate of $E(G)$.

If those words were confusing let's look at some pictures instead.


<img src="cv1.png" width="60%"></img>

<br>
<br>
<br>

<img src="cv2.png" width="60%"></img>

Common values for $k$ are $5$ and $10$.

#### Implementing $k$-fold cross-validation in `sklearn`.

You can implement cross-validation using `sklearn`'s `KFold` object. Documentation for this method can be found here <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html</a>.

In [7]:
## import KFold
from sklearn.model_selection import KFold

In [8]:
## make a KFold object
## n_splits controls the value of k
## shuffle=True, randomly shuffles the data prior to splitting
## random_state is the same as for train_test_split
kfold = KFold(5,
                 shuffle=True,
                 random_state=123)

In [9]:
## demonstrate.split
kfold.split(X_train, y_train)

<generator object _BaseKFold.split at 0x13a8ee5f0>

In [10]:
## use for loop to demonstrate .split
for train_index, test_index in kfold.split(X_train, y_train):
    print("Train index:", train_index)
    print("Test index:", test_index)
    print()
    print()

Train index: [  0   1   2   3   6   7   8   9  10  12  14  15  16  17  18  19  20  21
  22  23  24  25  26  27  28  29  30  31  32  33  34  37  38  39  40  42
  44  45  46  47  49  51  52  53  56  58  59  60  61  62  63  64  65  66
  67  68  69  70  71  72  73  76  77  80  81  82  83  84  85  86  87  88
  89  92  93  94  95  96  98  99 101 103 104 105 106 108 109 110 111 112
 113 114 115 116 118 119 120 121 122 123 124 125 126 127 128 129 130 132
 133 135 136 137 139 140 141 142 143 144 146 148 149 151 152 153 154 155
 157 158 159 160 161 163 165 167 168 169 173 174 175 176 179 180 181 182
 183 184 186 187 189 190 191 192 193 194 195 196 197 198 199 201 203 204
 205 206 207 208 211 212 213 214 215 216 218 219 220 221 222 223 224 225
 226 227 228 231 233 234 235 238 241 242 243 244 245 247 248 249 251 253
 254 255 256 257 258 259 261 262 264 265 266 268 271 272 275 276 277 278
 279 281 282 284 285 286 287 288 290 293 295 296 297 298 299 300 301 302
 304 305 308 309 310 311 312 313 315 3

In [11]:
## When fitting a model we'd do something like the following
for train_index, test_index in kfold.split(X_train, y_train):
    ## get the kfold training data
    X_train_train = X_train[train_index,:]
    y_train_train = y_train[train_index]
    
    ## get the holdout data
    X_holdout = X_train[test_index,:]
    y_holdout = y_train[test_index]
    
    ## Then you'd fit your model
    ## Then you'd record the error on the holdout set here

##### Exercise

Using the $(Z, w)$ data from before perform $10$-fold cross-validation on the training set.

In [None]:
## Work here





In [None]:
## Work here





### Validation set or cross-validation

Cross-validation, when feasible, is preferred to a single validation set. In general it is better to have a collection of estimates than just a single point estimate.

However, it is not always feasible to perform cross-validation. Two limiting factors to consider are:
1. Data set size and
2. Model training time.

In the case of 1., if you have too few observations cross-validation is not possible. This is because splitting your dataset into too many different sets can lead to deficiencies in both model fitting and estimation of $G$.

Regarding 2. models that take prohibitively long to train limit the usefulness of cross-validation. $k$-fold cross-validation requires you to train the model $k$ distinct times.

In either of those cases a validation set is preferred.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)