Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error of random seed when using train_test_split() #230

Closed
PeterFogh opened this issue Jun 27, 2018 · 16 comments
Closed

Error of random seed when using train_test_split() #230

PeterFogh opened this issue Jun 27, 2018 · 16 comments

Comments

@PeterFogh
Copy link

@PeterFogh PeterFogh commented Jun 27, 2018

I get this error when executing the following code:

import dask.array as da
from dask_ml.datasets import make_regression
from dask_ml.model_selection import train_test_split

print('Local dask.__version__: {}'.format(dask.__version__))
print('Local dask_ml.__version__: {}'.format(dask_ml.__version__))
print('Client dask.__version__: {}'.format(dask.delayed(dask.__version__).compute()))
print('Client dask_ml.__version__: {}'.format(dask.delayed(dask_ml.__version__).compute()))

Local dask.version: 0.16.1
Local dask_ml.version: 0.6.0
Client dask.version: 0.16.1
Client dask_ml.version: 0.6.0

X, y = make_regression(n_samples=10000, n_features=4, random_state=0, chunks=4)
X

dask.array<da.random.normal, shape=(10000, 4), dtype=float64, chunksize=(4, 4)>

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-51-bd652d57a653> in <module>()
----> 1 X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in train_test_split(*arrays, **options)
    276                             train_size=train_size, blockwise=blockwise,
    277                             random_state=random_state)
--> 278     train_idx, test_idx = next(splitter.split(*arrays))
    279 
    280     train_test_pairs = ((_blockwise_slice(arr, train_idx),

~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in split(self, X, y, groups)
    137         for i in range(self.n_splits):
    138             if self.blockwise:
--> 139                 yield self._split_blockwise(X)
    140             else:
    141                 yield self._split(X)

~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in _split_blockwise(self, X)
    144         chunks = X.chunks[0]
    145         rng = check_random_state(self.random_state)
--> 146         seeds = rng.randint(0, 2**32 - 1, size=len(chunks))
    147 
    148         train_pct, test_pct = _maybe_normalize_split_sizes(self.train_size,

mtrand.pyx in mtrand.RandomState.randint()

ValueError: high is out of bounds for int32
@TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Jun 27, 2018

Thanks for the report. I assume your Python is 32 bit? We don't do any testing with 32-bit builds.

Anyway, the bug is that

seeds = rng.randint(0, 2**32 - 1, size=len(chunks))

should be

seeds = rng.randint(0, 2**32 - 1, size=len(chunks), dtype='u8')

Any interest in making a PR to fix it? Otherwise I'll get to it later today or tomorrow.

@PeterFogh
Copy link
Author

@PeterFogh PeterFogh commented Jun 27, 2018

Thanks. I have tried executing the correction and it works. (y)

Hm... My Python should be 64 bit.

ipython
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import ctypes

In [2]: print(ctypes.sizeof(ctypes.c_voidp))
8

I'm using the conda installation, so making a PR is a bit of work for me :)

@TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Jun 27, 2018

According to numpy/numpy#4085 (comment), on windows the default integer type is int32.

In [4]: np.intp
Out[4]: numpy.int64

will probably be numpy.int32 for you.

so making a PR is a bit of work for me :)

No worries, I'll take care of it!

@PeterFogh
Copy link
Author

@PeterFogh PeterFogh commented Jun 27, 2018

Nope.

ipython
Python 3.6.5 |Anaconda custom (64-bit)| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np

In [2]: np.intp
Out[2]: numpy.int64

So I guess it has changed since 1'st Dec 2013 where the comment is from.
But, it's a mystery to me why the error occurs on my machine.
Looking forward to the update, thanks.

@EmanuelFontelles
Copy link

@EmanuelFontelles EmanuelFontelles commented Nov 5, 2018

This error continues on version 0.9, probably it's internal from train_test_split of dask-ml, the library are using a random seed of 32bits instead of 64bits

@TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Nov 5, 2018

@osheari1
Copy link

@osheari1 osheari1 commented Nov 28, 2018

I'm still getting this error after 0.11(@master).
I am running this in a conda env on a windows 64 bit machine.

>>> print(dask_ml.__version__)
... print(dask.__version__)
... print(np.intp)
... 
0.11.1.dev18+gdd2c616
1.0.0
<class 'numpy.int64'>

All of the snippets below throw the 'high is out of bounds for int32' error.

nrs = np.random.RandomState(1)
clust = dask_ml.cluster.SpectralClustering(n_clusters=2, random_state=nrs)
clust.fit(data_arr)
clust = dask_ml.cluster.SpectralClustering(n_clusters=2, random_state=1)
clust.fit(data_arr)
clust = dask_ml.cluster.SpectralClustering(n_clusters=2, random_state=None)
clust.fit(data_arr)
@TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Nov 29, 2018

Stange. We really need to start testing on a 32-bit setup.

It seems like newer numpy's support passing a dtype to randint would hopefully solve this for us. We should be able to pass that through in _utils.draw_seed if anyone wants to try that out.

@SindhujaVakkalagadda
Copy link

@SindhujaVakkalagadda SindhujaVakkalagadda commented Dec 7, 2018

I tried changing the dtype in draw_seed to uint64, float32 and also float64. But getting the same error everytime.

km = KMeans(
193 n_clusters=n_clusters,
--> 194 random_state=draw_seed(rng, 2 ** 32 - 1, dtype="float64"),
195 )

@jsignell
Copy link
Member

@jsignell jsignell commented Feb 1, 2019

I just ran into this on windows 64 with dask-ml 0.11.0 from conda-forge. https://ci.appveyor.com/project/jsignell/earthml/builds/22061998

@AndreCAndersen
Copy link

@AndreCAndersen AndreCAndersen commented Mar 11, 2019

The issue here is that dtype is not set when generating the seed.

In dask-ml's draw_seed(...) function a random integer is drawn from the random number generator using randint(...), but no datatype is set:

seed = random_state.randint(low, high, **kwargs)

To solve this we need to add the dtype argument with a value such as 'uint64':

seed = random_state.randint(low, high, dtype='uint64', **kwargs)

A minimum example of this failing is:

from sklearn.utils import check_random_state
rng = check_random_state(42)
rng.randint(0, 2 ** 32 - 1, size=None)

... which gives the error ValueError: high is out of bounds for int32, while the following solves it:

from sklearn.utils import check_random_state
rng = check_random_state(42)
rng.randint(0, 2 ** 32 - 1, size=None, dtype='uint64')

My workaround is to monkeypatch dask-ml.

from dask_ml import _utils
import numpy as np
import dask.array as da

def draw_seed(random_state, low, high=None, size=None, dtype=None, chunks=None):
    kwargs = {"size": size, "dtype": "uint64"}
    if chunks is not None:
        kwargs["chunks"] = chunks

    seed = random_state.randint(low, high, **kwargs)
    if dtype is not None and isinstance(seed, (da.Array, np.ndarray)):
        seed = seed.astype(dtype)

    return seed

_utils.draw_seed = draw_seed

The same change might be suitable as a real PR.

@TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Mar 11, 2019

@AndreCAndersen
Copy link

@AndreCAndersen AndreCAndersen commented Mar 11, 2019

Same as the one mentioned in the title, i.e., from dask_ml.model_selection import train_test_split.

@TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Mar 11, 2019

@AndreCAndersen
Copy link

@AndreCAndersen AndreCAndersen commented Mar 11, 2019

I just ran

import dask_ml
dask_ml.__version__

It reports '0.12.0'

The issue isn't that dtype isn't passed into draw_seed(...), it is that draw_seed(...) doesn't pass a dtype into random_state.randint(...):

seed = random_state.randint(low, high, **kwargs)

My suggestion of doing kwargs = {"size": size, "dtype": "uint64"} here, will probably solve it. "uint" seems to work too.

@TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Mar 11, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants