Skip to content

AutoMLSearch uses slightly different splits for each pipeline #1471

@freddyaboulton

Description

@freddyaboulton

Repro (need to checkout the random-split-seeds branch because we don't store the state of the random seed of the data split):

from evalml.demos import load_breast_cancer
from evalml.automl import AutoMLSearch
from evalml.utils.gen_utils import check_random_state_equality
import numpy as np
import itertools

def make_seed_from_state(state):
    rs = np.random.RandomState()
    rs.set_state(state)
    return rs

def check_random_state(state_1, state_2):
    rs_1 = make_seed_from_state(state_1)
    rs_2 = make_seed_from_state(state_2)
    return check_random_state_equality(rs_1, rs_2)

X, y = load_breast_cancer()

automl = AutoMLSearch(max_batches=2, problem_type="binary")
automl.search(X, y)

seeds_equal = []
for i, j in itertools.combinations(range(14), 2): 
    are_equal = check_random_state(automl.data_split_seeds[i], automl.data_split_seeds[j])
    seeds_equal.append(are_equal)
    
assert not all(seeds_equal)

The issue with having a different random state everytime data_split.split is called is that the split will be slightly different each time:

from sklearn.model_selection import StratifiedKFold
seed_1 = make_seed_from_state(automl.data_split_seeds[0])
seed_2 = make_seed_from_state(automl.data_split_seeds[1])
split_1 = StratifiedKFold(n_splits=3, random_state=seed_1, shuffle=True)
split_2 = StratifiedKFold(n_splits=3, random_state=seed_2, shuffle=True)

for (train_index_1, test_index_1), (train_index_2, test_index_2) in zip(split_1.split(X, y), split_2.split(X, y)):
    assert not set(train_index_1) == set(train_index_2)
    assert not set(test_index_1) == set(test_index_2)

I think we should change this because it is introducing more variability into the results of automl than is necessary and prevents a true apples-to-apples comparison between pipelines. That being said, I don't think fixing this would substantially impact the results of automl search (the pipeline ranking would probably be the same).

One possible solution is to create the split class with an integer random seed as opposed to the np.random.RandomState that is stored in the automl state. I believe the indices will be the same in repeated calls:

from sklearn.model_selection import StratifiedKFold
split_1 = StratifiedKFold(n_splits=3, random_state=10, shuffle=True)

first_train_set = []
first_test_set = []
for (train_index_1, test_index_1) in split_1.split(X, y):
    first_train_set.append(set(train_index_1))
    first_test_set.append(set(test_index_1))

second_train_set = []
second_test_set = []
for (train_index_2, test_index_2) in split_1.split(X, y):
    second_train_set.append(set(train_index_2))
    second_test_set.append(set(test_index_2))

assert first_train_set == second_train_set
assert first_test_set == second_test_set

Metadata

Metadata

Assignees

Labels

bugIssues tracking problems with existing features.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions