# Video: Splitting Data to Train and Test

This video demonstrates the practice of splitting training data into a training and test data, and scikit-learn's support for this practice.

## What is a Holdout Set?

* A holdout set is a subset of our training data that we reserve for later evaluation.



## Train/Test Splits

* Most common holdout usage is train/test split.
  * Split training data into "train" and "test" sets.
  * Build the model from the "train" set.
  * Final evaluation on the "test" set.

## What If the Test Loss is Bad?

* The pedantic view:
  * Never make decisions based off of the test set!!! 🔥🔥🔥
* The practical view:
  * Something went wrong.
  * Restarting the modeling process is better than shipping a known bad model.
  * Add validation to your process.


## Does Overfitting Really Happen?

![Chart of ImageNet accuracy comparing performance of original test set vs a new test set](https://github.com/user-attachments/assets/5d66cf3a-4ecf-4ebc-ac83-c3aa80767390)

Image from "Do ImageNet Classifiers Generalize to ImageNet?" by Recht et al, 2019.

## Train/Test Recommendations

* Check test as little as possible.
* Do let bad test scores save you from bad choices.
* Use validation holdout sets (coming up next) to reduce test checks.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
abalone = pd.read_csv("https://raw.githubusercontent.com/bu-cds-omds/dx602-examples/main/data/abalone.tsv", sep="\t")

In [None]:
abalone_target = abalone["Rings"]
abalone_features = abalone.drop(["Rings", "Sex"], axis=1)

In [None]:
train_features, test_features, train_target, test_target = train_test_split(abalone_features, abalone_target, test_size=0.2, random_state=2024)

In [None]:
train_features.shape

(3341, 7)

In [None]:
test_features.shape

(836, 7)

In [None]:
train_target.shape

(3341,)

In [None]:
test_target.shape

(836,)

In [None]:
import sklearn.linear_model

In [None]:
linear_model = sklearn.linear_model.LinearRegression()
linear_model.fit(train_features, train_target)

In [None]:
linear_model.score(train_features, train_target)

0.5372237227524753

In [None]:
linear_model.score(test_features, test_target)

0.47997882018702587

In [None]:
import sklearn.tree

decision_tree_model = sklearn.tree.DecisionTreeRegressor()
decision_tree_model.fit(train_features, train_target)

In [None]:
decision_tree_model.score(train_features, train_target)

1.0

In [None]:
decision_tree_model.score(test_features, test_target)

-0.05522725490316738