AutoMLSearch uses slightly different splits for each pipeline

Repro (need to checkout the `random-split-seeds` branch because we don't store the state of the random seed of the data split):

```python
from evalml.demos import load_breast_cancer
from evalml.automl import AutoMLSearch
from evalml.utils.gen_utils import check_random_state_equality
import numpy as np
import itertools

def make_seed_from_state(state):
    rs = np.random.RandomState()
    rs.set_state(state)
    return rs

def check_random_state(state_1, state_2):
    rs_1 = make_seed_from_state(state_1)
    rs_2 = make_seed_from_state(state_2)
    return check_random_state_equality(rs_1, rs_2)

X, y = load_breast_cancer()

automl = AutoMLSearch(max_batches=2, problem_type="binary")
automl.search(X, y)

seeds_equal = []
for i, j in itertools.combinations(range(14), 2): 
    are_equal = check_random_state(automl.data_split_seeds[i], automl.data_split_seeds[j])
    seeds_equal.append(are_equal)
    
assert not all(seeds_equal)
```

The issue with having a different random state everytime `data_split.split` is called is that the split will be slightly different each time:
```python
from sklearn.model_selection import StratifiedKFold
seed_1 = make_seed_from_state(automl.data_split_seeds[0])
seed_2 = make_seed_from_state(automl.data_split_seeds[1])
split_1 = StratifiedKFold(n_splits=3, random_state=seed_1, shuffle=True)
split_2 = StratifiedKFold(n_splits=3, random_state=seed_2, shuffle=True)

for (train_index_1, test_index_1), (train_index_2, test_index_2) in zip(split_1.split(X, y), split_2.split(X, y)):
    assert not set(train_index_1) == set(train_index_2)
    assert not set(test_index_1) == set(test_index_2)
```

I think we should change this because it is introducing more variability into the results of automl than is necessary and prevents a true apples-to-apples comparison between pipelines. That being said, I don't think fixing this would substantially impact the results of automl search (the pipeline ranking would probably be the same). 


One possible solution is to create the split class with an integer random seed as opposed to the `np.random.RandomState` that is stored in the automl state. I believe the indices will be the same in repeated calls:

```python
from sklearn.model_selection import StratifiedKFold
split_1 = StratifiedKFold(n_splits=3, random_state=10, shuffle=True)

first_train_set = []
first_test_set = []
for (train_index_1, test_index_1) in split_1.split(X, y):
    first_train_set.append(set(train_index_1))
    first_test_set.append(set(test_index_1))

second_train_set = []
second_test_set = []
for (train_index_2, test_index_2) in split_1.split(X, y):
    second_train_set.append(set(train_index_2))
    second_test_set.append(set(test_index_2))

assert first_train_set == second_train_set
assert first_test_set == second_test_set
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AutoMLSearch uses slightly different splits for each pipeline #1471

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AutoMLSearch uses slightly different splits for each pipeline #1471

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions