-
Notifications
You must be signed in to change notification settings - Fork 91
Description
Repro (need to checkout the random-split-seeds branch because we don't store the state of the random seed of the data split):
from evalml.demos import load_breast_cancer
from evalml.automl import AutoMLSearch
from evalml.utils.gen_utils import check_random_state_equality
import numpy as np
import itertools
def make_seed_from_state(state):
rs = np.random.RandomState()
rs.set_state(state)
return rs
def check_random_state(state_1, state_2):
rs_1 = make_seed_from_state(state_1)
rs_2 = make_seed_from_state(state_2)
return check_random_state_equality(rs_1, rs_2)
X, y = load_breast_cancer()
automl = AutoMLSearch(max_batches=2, problem_type="binary")
automl.search(X, y)
seeds_equal = []
for i, j in itertools.combinations(range(14), 2):
are_equal = check_random_state(automl.data_split_seeds[i], automl.data_split_seeds[j])
seeds_equal.append(are_equal)
assert not all(seeds_equal)The issue with having a different random state everytime data_split.split is called is that the split will be slightly different each time:
from sklearn.model_selection import StratifiedKFold
seed_1 = make_seed_from_state(automl.data_split_seeds[0])
seed_2 = make_seed_from_state(automl.data_split_seeds[1])
split_1 = StratifiedKFold(n_splits=3, random_state=seed_1, shuffle=True)
split_2 = StratifiedKFold(n_splits=3, random_state=seed_2, shuffle=True)
for (train_index_1, test_index_1), (train_index_2, test_index_2) in zip(split_1.split(X, y), split_2.split(X, y)):
assert not set(train_index_1) == set(train_index_2)
assert not set(test_index_1) == set(test_index_2)I think we should change this because it is introducing more variability into the results of automl than is necessary and prevents a true apples-to-apples comparison between pipelines. That being said, I don't think fixing this would substantially impact the results of automl search (the pipeline ranking would probably be the same).
One possible solution is to create the split class with an integer random seed as opposed to the np.random.RandomState that is stored in the automl state. I believe the indices will be the same in repeated calls:
from sklearn.model_selection import StratifiedKFold
split_1 = StratifiedKFold(n_splits=3, random_state=10, shuffle=True)
first_train_set = []
first_test_set = []
for (train_index_1, test_index_1) in split_1.split(X, y):
first_train_set.append(set(train_index_1))
first_test_set.append(set(test_index_1))
second_train_set = []
second_test_set = []
for (train_index_2, test_index_2) in split_1.split(X, y):
second_train_set.append(set(train_index_2))
second_test_set.append(set(test_index_2))
assert first_train_set == second_train_set
assert first_test_set == second_test_set