# Lesson: Train-Test

## Introduction

In this lesson, we'll set up our training and testing regimen appropriately.

### Learning Outcomes

By the end of this lesson, you will be able to:

- Explain the concept of **overfitting** and **underfitting**.
- Define **bias** and **variance** in the context of machine learning.
- Analyze the trade-off between bias and variance.
- Explain the necessity of training and testing.

## Overfitting and Underfitting

In every real-world dataset, there is an element of statistical noise in the data. There's randomness in physical and environmental processes. For example, identical twins don't have the same height. Then there's randomness in sampling. Our dataset may have two outliers that will disproportionately affect our model. **Data collection** is subject to issues like measurement error, recording error, and rounding error.

Every dataset has quirks that don't reflect any real underlying patterns or associations. It's impossible to know where to draw the line between the _signal_ (patterns and associations) and the _noise_ (random quirks of our sample). When we're training an ML model, we want it to perform well on new, unseen instances. We don't need a model that's good at predicting things that already happen or giving us answers we already have.

A model that has too much noise is described as **overfitted**. The biggest indicator of an overfitted model is great results on the training set while drastically worse results on new, unseen instances. It’s like memorizing the answers to last year’s exam instead of learning the material. We ace last year's test, but fail when we have new content. Overfitting is picking up too much _noise_.

**Underfitting** is also a problem. This occurs when a model is too simple and doesn't capture enough of the data's structure. An underfitted model performs poorly in both training and testing. It’s like barely studying for an exam. We don’t learn the basic materials on the practice questions, let alone the real test. Underfitting is ignoring the _signal_.

Overfitting is easy because CPUs and GPUs make it feasible to train many models. The training set is perfect but the test set is not. Underfitting is rare.

## Bias and Variance

The goal of an ML model is to produce useful, generalizable results that test the dataset it is trained on. (No model is perfect unless we're using an toy dataset. It's adequate to use a toy dataset: iris, penguins, wine, Kaggle datasets, etc.) Clients have their own private data.

When an error pops up, it's more appropriate to manage and mitigate errors.

There are three sources of error:

1. Irreducible error due to randomness, imprecision, human fallibility, etc. There's not much we can do about it. We have to live with it.
2. **Bias**: failure to learn underlying patterns in the data. High bias results in an underfitted model.
3. **Variance**: randomness in the training set being learned. High variance leads to an overfitted model.

We can disregard the first one, since we can't do anything about it. Bias and variance are under our control. We have to be careful. Consider the image below.

![Underfit, Optimal, Overfit](../assets/underfit-optimum-overfit.png)

If a model is too simple, it will perform poorly on the training data as well as new, unseen instances. If a model is too complex, it may fit the training data well, but will perform poorly on new instances. We have the Goldilocks model, the one that's "just right."

<p style="background-color:white;">
<img src="../assets/bias_variance.png" alt="Bias and Variance">
</p>

## Training and Test Sets

It's vital to extract our dataset for training and testing. The test set must be completely isolated from the training process. Models will focus on training and testing, and will have no correlation (underfitted), slight correlation, or high correlation.

We'll give up some valuable data during the training process, but we'd like to provide the model the complete picture of true underlying distribution. Training on less data increases the bias of the model. Training on more data increases the variance of the model. Testing also requires a subset of the dataset.

A general rule of thumb is to use 25% of the dataset as our test set. 75% of the data will be used for training.

### Analogy

**Training Set: Study**

Imagine we're studying, in-depth, say, rudimentary biology. We spend weeks practicing. We study over and over, find patterns, fix mistakes, and improve our results.

Those study habits are your training set. They teach us how to learn. We see them repeatedly, so our learning improves rudimentary biology.

**Test Set: Exam**

Next we take the exam. We've never practiced before, but they belong to the same concepts: cell biology, genetics, evolution, and ecology.

Those fuzzy concepts are your testing set. They measure whether we can generalize our rudimentary biology to new, unseen concepts. We cannot study during the exam. We can only recall what we've already learned.

### Target Variables

A target variable predicts and depends on other variables. They are mutually exclusive. We can't combine the target variable _and_ variables including the target.

- **target** -> features
- **predicted** -> predictor
- **dependent** -> independent
- **explained** -> explanatory
- **measured** -> manipulated
- **response** -> react
- **answer** -> learns

The model has an answer, the target variable, and learns from new data. What that answer is is anyone's guess.

There are multiple targets, though that may be a bad idea. We want to keep it simple.

In [27]:
from sklearn.model_selection import train_test_split
from sklearn import datasets

iris = datasets.load_iris(as_frame=True)
X, y = iris['data'], iris['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,112.0,112.0,112.0,112.0
mean,5.830357,3.040179,3.807143,1.214286
std,0.819123,0.43712,1.73531,0.747953
min,4.3,2.0,1.1,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.3,1.3
75%,6.4,3.3,5.1,1.8
max,7.7,4.2,6.7,2.5


In [28]:
X_test.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,38.0,38.0,38.0,38.0
mean,5.881579,3.107895,3.613158,1.155263
std,0.863949,0.433952,1.867238,0.811637
min,4.4,2.2,1.0,0.1
25%,5.125,2.8,1.525,0.3
50%,5.9,3.1,4.45,1.3
75%,6.475,3.375,5.075,1.875
max,7.9,4.4,6.9,2.3


In [29]:
y_train.describe()

count    112.000000
mean       1.026786
std        0.810514
min        0.000000
25%        0.000000
50%        1.000000
75%        2.000000
max        2.000000
Name: target, dtype: float64

In [30]:
y_test.describe()

count    38.000000
mean      0.921053
std       0.850487
min       0.000000
25%       0.000000
50%       1.000000
75%       2.000000
max       2.000000
Name: target, dtype: float64