Up to now I have been using a single train and test split. That is fine for quick checks, but it can mislead if the split is lucky or unlucky. In this chapter I will create an explicit validation set, then switch to cross-validation. After that I will tune hyperparameters with grid search and randomized search. I will keep the model simple, a decision tree, so I can focus on the evaluation process itself.

# Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import (
train_test_split, 
cross_validate, 
StratifiedKFold, 
GridSearchCV, 
RandomizedSearchCV)
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import uniform, randint

# Load data and make a validation split

In [2]:
wine = pd.read_csv('https://bit.ly/wine_csv_data')
X = wine[['alcohol', 'sugar', 'pH']]
y = wine['class']

In [3]:
wine.head()

Unnamed: 0,alcohol,sugar,pH,class
0,9.4,1.9,3.51,0.0
1,9.8,2.6,3.2,0.0
2,9.8,2.3,3.26,0.0
3,9.8,1.9,3.16,0.0
4,9.4,1.9,3.51,0.0


In [4]:
# First split: hold out test set
train_X, test_X, train_y, test_y = train_test_split(
    X, y, test_size = 0.2, random_state = 42)

In [5]:
# Second split: from the training portion, hold out validation set
sub_X, val_X, sub_y, val_y = train_test_split(
    train_X, train_y, test_size = 0.2, random_state = 42)

In [6]:
sub_X.shape, val_X.shape

((4157, 3), (1040, 3))

The original train set of 5197 samples has been reduced to 4157 samples, and the validation set has been set to 1040 samples. 

# Fit a tree on sub-train, evaluate on validation

I am not scaling since trees are scale-invariant. The point is to see the train-validation gap.

In [7]:
dt = DecisionTreeClassifier(random_state = 42)
dt.fit(sub_X, sub_y)

print(dt.score(sub_X, sub_y))
print(dt.score(val_X, val_y))

0.9971133028626413
0.864423076923077


We can see a very high sub-train score and a lower validation score. That is the usual overfitting pattern for an unpruned tree.

# Cross-validation on the original train split

The validation split is informative, but we want a more stable estimate. Cross-validation averages over multiple folds, which reduces variance from any single split.

In [8]:
scores = cross_validate(dt, train_X, train_y)

print(scores)
print(np.mean(scores['test_score']))

{'fit_time': array([0.00510907, 0.00456619, 0.00486827, 0.00455022, 0.00434589]), 'score_time': array([0.00069785, 0.00058079, 0.00063062, 0.00056291, 0.00054502]), 'test_score': array([0.87019231, 0.84615385, 0.87680462, 0.84889317, 0.83541867])}
0.8554925223957948


"cross_validate" returns fit_time, score_time and test_score. The last key is the 5 cross validation of 5 folds. By default, cross_validate performs 5-fold cross validation.

The mean value of the five values of 'test_score' gives me the cross-validated score. I can also control the splitter.

In [9]:
# Use StratifiedKFold to preserve class balance in each fold

scores = cross_validate(dt, train_X, train_y, cv = StratifiedKFold())

print(np.mean(scores['test_score']))

0.8554925223957948


In [10]:
# Setting the number of splits and shuffling helps when I want more robustness

splitter = StratifiedKFold(n_splits = 10, shuffle = True, random_state = 42)
scores = cross_validate(dt, train_X, train_y, cv = splitter)

print(np.mean(scores['test_score']))

0.8581873425226026


The average scores are close across settings, which is a good sign that my estimate is stable.

# Hyperparameter Tuning

Now lets tune a small set of hyperparameters. The first pass will only vary min_impurity_decrease. This is a simple regularizer for trees. I expect very small values to help prune tiny branches.

## Grid Search

In [11]:
params = {'min_impurity_decrease' : [0.0001, 0.0002, 0.0003, 0.0004, 0.0005]}

gs = GridSearchCV(
    DecisionTreeClassifier(random_state = 42),
    params,
    n_jobs = -1
)

gs.fit(train_X, train_y)

dt = gs.best_estimator_

print(dt.score(train_X, train_y))
print(gs.best_params_)
print(gs.cv_results_['mean_test_score'])
print(gs.cv_results_['params'][gs.best_index_]) # check whether best_params_  = params + best_index_

0.9615162593804117
{'min_impurity_decrease': 0.0001}
[0.86800067 0.86453617 0.86492226 0.86780891 0.86761605]
{'min_impurity_decrease': 0.0001}


I can extract the best model and its score on the training split, and also the parameters that performed best during cross-validation.

This was a narrow grid. I will expand to include depth and split controls. I will sweep reasonably, not excessively, to keep runtime under control.

In [12]:
params = {'min_impurity_decrease' : np.arange(0.0001, 0.001, 0.0001),
          'max_depth' : range(5, 20, 1),
          'min_samples_split' : range(2, 100, 10)
         }

gs = GridSearchCV(
    DecisionTreeClassifier(random_state = 42),
    params,
    n_jobs = -1
)

gs.fit(train_X, train_y)

print(gs.best_params_)
print(np.max(gs.cv_results_['mean_test_score']))

{'max_depth': 14, 'min_impurity_decrease': np.float64(0.0004), 'min_samples_split': 12}
0.8683865773302731


I keep the best settings, but I still want to confirm on the test set later. For now I will only look at cross-validated means.

## Randomized Search

Grid search is exhaustive over a grid. If the search space grows, it becomes expensive or too coarse. Randomized search samples from continuous distributions and finds good regions faster.

I will create integer and continuous distributions for the tree controls, which define a parameter distribution for the tree. 
Note that impurity decrease is a small positive value, depth is moderately large, and sample counts are modest ranges.

In [13]:
params = {'min_impurity_decrease' : uniform(0.0001, 0.001),
          'max_depth' : randint(20, 50),
          'min_samples_split' : randint(2, 25),
          'min_samples_leaf' : randint(1, 25)
         }
# Run the randomized search. I will use 100 iterations here, which is already plenty for this small dataset.
rs = RandomizedSearchCV(
    DecisionTreeClassifier(random_state = 42),
    params,
    n_iter = 100,
    n_jobs = -1,
    random_state = 42
)

rs.fit(train_X, train_y)

print(rs.best_params_)
print(np.max(rs.cv_results_['mean_test_score']))

{'max_depth': 39, 'min_impurity_decrease': np.float64(0.00034102546602601173), 'min_samples_leaf': 7, 'min_samples_split': 13}
0.8695428296438884


In [14]:
# Now evaluate the best model on the held-out test set.

dt = rs.best_estimator_
print(dt.score(test_X, test_y))

0.86


This tells me how well the chosen configuration generalizes to completely unseen data. I can speculate for a small but real improvement over earlier settings. If the score is similar, that is also fine. Stability matters more than chasing small gains sometimes.

# What I learned

Using a single validation split showed clear overfitting on the unpruned tree, but cross-validation gave me a steadier estimate. A small regularization like min_impurity_decrease consistently improved generalization. Grid search worked well for a narrow set of choices, but randomized search was more efficient when I expanded the space to continuous ranges. The final test score did not jump dramatically, which is realistic on a small tabular dataset.