# Video: Using Cross Validation with Scikit-Learn

This video shows how to use scikit-learn's support for cross validation.

In [None]:
import pandas as pd

In [None]:
abalone = pd.read_csv("https://raw.githubusercontent.com/bu-cds-omds/dx602-examples/main/data/abalone.tsv", sep="\t")

In [None]:
abalone_target = abalone["Rings"]
abalone_features = abalone.drop(["Rings", "Sex"], axis=1)

Script:
* Scikit-learn includes a number of classes and functions to support cross validation.
* I will show you the `cross_validate` function now.
* This function will wraps up the built-in classes for handling $k$ folds and will automatically use a fancier splitting method for stratified sampling to make sure classes are evenly distributed if that is appropriate.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.model_selection import cross_validate

Script:
* Like the `train_test_split` function, `cross_validate` is in the subpackage `sklearn.model_selection`.
* I will repeat the train test split from a previous video, and demonstrate the use of the cross validate function to pick between linear models, and a couple decision tree choices.

In [None]:
train_features, test_features, train_target, test_target = train_test_split(abalone_features, abalone_target, test_size=0.2, random_state=2024)

In [None]:
import sklearn.linear_model

linear_model = sklearn.linear_model.LinearRegression()
linear_model_cv = cross_validate(linear_model, train_features, train_target)
linear_model_cv

{'fit_time': array([0.03686595, 0.04597354, 0.03516054, 0.01233387, 0.00497794]),
 'score_time': array([0.01032877, 0.00903773, 0.00328207, 0.00370979, 0.00520945]),
 'test_score': array([0.56422239, 0.5031016 , 0.54365272, 0.53685935, 0.49316824])}

Script:
* The `cross_validate` function takes in a model, features, and a target, and has many optional features which I will skip now.
* The model does not need to be trained.
* In fact, you are expected to provide an untrained model which will be copied for each fold of the cross validation.
* The output from `cross_validate` has some stats on the running time, and a `test_score` key with validation losses per fold.
* That `test_score` key is what we are looking for.
* Let's take the average of those values.

In [None]:
linear_model_cv['test_score'].mean()

0.5282008608429815

Script:
* Let's repeat with a decision tree next.

In [None]:
import sklearn.tree

decision_tree_model = sklearn.tree.DecisionTreeRegressor()
decision_tree_cv = cross_validate(decision_tree_model, train_features, train_target)
decision_tree_cv

{'fit_time': array([0.05203772, 0.03475904, 0.04750347, 0.03187346, 0.11232376]),
 'score_time': array([0.00450063, 0.00859714, 0.00474834, 0.00442433, 0.01538777]),
 'test_score': array([0.17903315, 0.26667689, 0.12970868, 0.02829154, 0.13862586])}

In [None]:
decision_tree_cv['test_score'].mean()

0.1484672244623154

Script:
* That's a lot worse than the linear model.
* I will make one more decision tree with some tweaks to generalize better.
* You do not need to understand how this tweak works now.

In [None]:
decision_tree_model_2 = sklearn.tree.DecisionTreeRegressor(max_depth=2)
decision_tree_cv_2 = cross_validate(decision_tree_model_2, train_features, train_target)
decision_tree_cv_2

{'fit_time': array([0.00871205, 0.00624514, 0.0075264 , 0.00717497, 0.00568771]),
 'score_time': array([0.00295758, 0.00264716, 0.00366211, 0.00299931, 0.00235558]),
 'test_score': array([0.36439348, 0.35201534, 0.3657017 , 0.35338235, 0.36565501])}

In [None]:
decision_tree_cv_2['test_score'].mean()

0.36022957492304175

Script:
* That is better than the original decision tree, but worse than the linear regression.
* Based on these validation losses, the original decision tree is the worst, then the second decision tree is better, and the linear model is best.
* Let's build the full models now.
* Before I do that, I should emphasize that the proper practice would be to just go with the linear model and test just that one, but I will show you the others so you see how we would avoid bad choices.

In [None]:
decision_tree_model.fit(train_features, train_target)
decision_tree_model.score(test_features, test_target)

0.03210371681153401

Script:
* That's even worse than the previous 15%.

In [None]:
decision_tree_model_2.fit(train_features, train_target)
decision_tree_model_2.score(test_features, test_target)

0.3185606188329424

Script:
* In comparison to the previous 36% in validation, this decision tree is relatively close.
* Just 4% off.

In [None]:
linear_model.fit(train_features, train_target)
linear_model.score(test_features, test_target)

0.47997882018702587

Script:
* The linear model is also relatively close to the previous 53% in validation.
* So while the test scores all dropped compared to their validation scores, and the first decision tree dropped a huge amount, the orders were maintained between the different model parameter choices.
* We do not expect the orders to be exactly the same, but we hope that the one that is best in validation is close to the best on test.
