New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error of random seed when using train_test_split() #230

Closed
PeterFogh opened this Issue Jun 27, 2018 · 6 comments

Comments

Projects
None yet
3 participants
@PeterFogh

PeterFogh commented Jun 27, 2018

I get this error when executing the following code:

import dask.array as da
from dask_ml.datasets import make_regression
from dask_ml.model_selection import train_test_split

print('Local dask.__version__: {}'.format(dask.__version__))
print('Local dask_ml.__version__: {}'.format(dask_ml.__version__))
print('Client dask.__version__: {}'.format(dask.delayed(dask.__version__).compute()))
print('Client dask_ml.__version__: {}'.format(dask.delayed(dask_ml.__version__).compute()))

Local dask.version: 0.16.1
Local dask_ml.version: 0.6.0
Client dask.version: 0.16.1
Client dask_ml.version: 0.6.0

X, y = make_regression(n_samples=10000, n_features=4, random_state=0, chunks=4)
X

dask.array<da.random.normal, shape=(10000, 4), dtype=float64, chunksize=(4, 4)>

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-51-bd652d57a653> in <module>()
----> 1 X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in train_test_split(*arrays, **options)
    276                             train_size=train_size, blockwise=blockwise,
    277                             random_state=random_state)
--> 278     train_idx, test_idx = next(splitter.split(*arrays))
    279 
    280     train_test_pairs = ((_blockwise_slice(arr, train_idx),

~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in split(self, X, y, groups)
    137         for i in range(self.n_splits):
    138             if self.blockwise:
--> 139                 yield self._split_blockwise(X)
    140             else:
    141                 yield self._split(X)

~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in _split_blockwise(self, X)
    144         chunks = X.chunks[0]
    145         rng = check_random_state(self.random_state)
--> 146         seeds = rng.randint(0, 2**32 - 1, size=len(chunks))
    147 
    148         train_pct, test_pct = _maybe_normalize_split_sizes(self.train_size,

mtrand.pyx in mtrand.RandomState.randint()

ValueError: high is out of bounds for int32
@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Jun 27, 2018

Thanks for the report. I assume your Python is 32 bit? We don't do any testing with 32-bit builds.

Anyway, the bug is that

seeds = rng.randint(0, 2**32 - 1, size=len(chunks))

should be

seeds = rng.randint(0, 2**32 - 1, size=len(chunks), dtype='u8')

Any interest in making a PR to fix it? Otherwise I'll get to it later today or tomorrow.

@PeterFogh

This comment has been minimized.

PeterFogh commented Jun 27, 2018

Thanks. I have tried executing the correction and it works. (y)

Hm... My Python should be 64 bit.

ipython
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import ctypes

In [2]: print(ctypes.sizeof(ctypes.c_voidp))
8

I'm using the conda installation, so making a PR is a bit of work for me :)

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Jun 27, 2018

According to numpy/numpy#4085 (comment), on windows the default integer type is int32.

In [4]: np.intp
Out[4]: numpy.int64

will probably be numpy.int32 for you.

so making a PR is a bit of work for me :)

No worries, I'll take care of it!

@PeterFogh

This comment has been minimized.

PeterFogh commented Jun 27, 2018

Nope.

ipython
Python 3.6.5 |Anaconda custom (64-bit)| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np

In [2]: np.intp
Out[2]: numpy.int64

So I guess it has changed since 1'st Dec 2013 where the comment is from.
But, it's a mystery to me why the error occurs on my machine.
Looking forward to the update, thanks.

@EmanuelFontelles

This comment has been minimized.

EmanuelFontelles commented Nov 5, 2018

This error continues on version 0.9, probably it's internal from train_test_split of dask-ml, the library are using a random seed of 32bits instead of 64bits

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Nov 5, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment