New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ShuffleSplit makes same split for every iteration if random_state is set #380

Closed
humaneffect opened this Issue Oct 4, 2018 · 0 comments

Comments

Projects
None yet
1 participant
@humaneffect

humaneffect commented Oct 4, 2018

Hi!
ShuffleSplit produces the same data split for every iteration when the random_state is set to a number. I would expect that each iteration would produce a different data split.

import dask
from dask_ml.model_selection import ShuffleSplit
import sklearn.datasets

x, y = sklearn.datasets.make_classification(n_samples=10000, n_features=100, n_informative=50, n_clusters_per_class=5, n_classes=5)
x = dask.array.from_array(x, chunks=(100,-1))
y = dask.array.from_array(y, chunks=(100))

cv = ShuffleSplit(n_splits=2, test_size=0.3, train_size=0.7, random_state=0)
for train_idx, test_idx in cv.split(X=x, y=y):
    print(train_idx.compute())

[  25    1   45 ... 9967 9951 9918]
[  25    1   45 ... 9967 9951 9918]

Note: It works as expected if random_state is not set

cv = ShuffleSplit(n_splits=2, test_size=0.3, train_size=0.7)
for train_idx, test_idx in cv.split(X=x, y=y):
    print(train_idx.compute())

[  46   70   25 ... 9912 9907 9959]
[   5   93   47 ... 9984 9929 9946]

The issue seems to lie in dask-ml/dask_ml/model_selection/_split.py method _split_blockwise(self, X): wherein the rng is set to rng = check_random_state(self.random_state) for each new cross validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment