Error of random seed when using train_test_split() #230

PeterFogh · 2018-06-27T08:36:15Z

I get this error when executing the following code:

import dask.array as da
from dask_ml.datasets import make_regression
from dask_ml.model_selection import train_test_split

print('Local dask.__version__: {}'.format(dask.__version__))
print('Local dask_ml.__version__: {}'.format(dask_ml.__version__))
print('Client dask.__version__: {}'.format(dask.delayed(dask.__version__).compute()))
print('Client dask_ml.__version__: {}'.format(dask.delayed(dask_ml.__version__).compute()))

Local dask.version: 0.16.1
Local dask_ml.version: 0.6.0
Client dask.version: 0.16.1
Client dask_ml.version: 0.6.0

X, y = make_regression(n_samples=10000, n_features=4, random_state=0, chunks=4)
X

dask.array<da.random.normal, shape=(10000, 4), dtype=float64, chunksize=(4, 4)>

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-51-bd652d57a653> in <module>()
----> 1 X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in train_test_split(*arrays, **options)
    276                             train_size=train_size, blockwise=blockwise,
    277                             random_state=random_state)
--> 278     train_idx, test_idx = next(splitter.split(*arrays))
    279 
    280     train_test_pairs = ((_blockwise_slice(arr, train_idx),

~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in split(self, X, y, groups)
    137         for i in range(self.n_splits):
    138             if self.blockwise:
--> 139                 yield self._split_blockwise(X)
    140             else:
    141                 yield self._split(X)

~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in _split_blockwise(self, X)
    144         chunks = X.chunks[0]
    145         rng = check_random_state(self.random_state)
--> 146         seeds = rng.randint(0, 2**32 - 1, size=len(chunks))
    147 
    148         train_pct, test_pct = _maybe_normalize_split_sizes(self.train_size,

mtrand.pyx in mtrand.RandomState.randint()

ValueError: high is out of bounds for int32

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-06-27T10:55:36Z

Thanks for the report. I assume your Python is 32 bit? We don't do any testing with 32-bit builds.

Anyway, the bug is that

seeds = rng.randint(0, 2**32 - 1, size=len(chunks))

should be

seeds = rng.randint(0, 2**32 - 1, size=len(chunks), dtype='u8')

Any interest in making a PR to fix it? Otherwise I'll get to it later today or tomorrow.

PeterFogh · 2018-06-27T12:53:07Z

Thanks. I have tried executing the correction and it works. (y)

Hm... My Python should be 64 bit.

ipython
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import ctypes

In [2]: print(ctypes.sizeof(ctypes.c_voidp))
8

I'm using the conda installation, so making a PR is a bit of work for me :)

TomAugspurger · 2018-06-27T12:57:23Z

According to numpy/numpy#4085 (comment), on windows the default integer type is int32.

In [4]: np.intp
Out[4]: numpy.int64

will probably be numpy.int32 for you.

so making a PR is a bit of work for me :)

No worries, I'll take care of it!

PeterFogh · 2018-06-27T13:22:33Z

Nope.

ipython
Python 3.6.5 |Anaconda custom (64-bit)| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np

In [2]: np.intp
Out[2]: numpy.int64

So I guess it has changed since 1'st Dec 2013 where the comment is from.
But, it's a mystery to me why the error occurs on my machine.
Looking forward to the update, thanks.

EmanuelFontelles · 2018-11-05T02:03:09Z

This error continues on version 0.9, probably it's internal from train_test_split of dask-ml, the library are using a random seed of 32bits instead of 64bits

TomAugspurger · 2018-11-05T02:22:27Z

These should all be fixed on 0.11

…

________________________________ From: Emanuel Fontelles <notifications@github.com> Sent: Sunday, November 4, 2018 8:03:11 PM To: dask/dask-ml Cc: Tom Augspurger; State change Subject: Re: [dask/dask-ml] Error of random seed when using train_test_split() (#230) This error continues on version 0.9, probably it's internal from train_test_split of dask-ml, the library are using a random seed of 32bits instead of 64bits — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub<#230 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIrA2e3SgUNo0N4p9hZzt5U1a3yQQks5ur5xfgaJpZM4U5SCy>.

osheari1 · 2018-11-28T19:49:55Z

I'm still getting this error after 0.11(@master).
I am running this in a conda env on a windows 64 bit machine.

>>> print(dask_ml.__version__)
... print(dask.__version__)
... print(np.intp)
... 
0.11.1.dev18+gdd2c616
1.0.0
<class 'numpy.int64'>

All of the snippets below throw the 'high is out of bounds for int32' error.

nrs = np.random.RandomState(1)
clust = dask_ml.cluster.SpectralClustering(n_clusters=2, random_state=nrs)
clust.fit(data_arr)

clust = dask_ml.cluster.SpectralClustering(n_clusters=2, random_state=1)
clust.fit(data_arr)

clust = dask_ml.cluster.SpectralClustering(n_clusters=2, random_state=None)
clust.fit(data_arr)

TomAugspurger · 2018-11-29T15:12:51Z

Stange. We really need to start testing on a 32-bit setup.

It seems like newer numpy's support passing a dtype to randint would hopefully solve this for us. We should be able to pass that through in _utils.draw_seed if anyone wants to try that out.

SindhujaVakkalagadda · 2018-12-07T07:23:27Z

I tried changing the dtype in draw_seed to uint64, float32 and also float64. But getting the same error everytime.

km = KMeans(
193 n_clusters=n_clusters,
--> 194 random_state=draw_seed(rng, 2 ** 32 - 1, dtype="float64"),
195 )

jsignell · 2019-02-01T19:56:54Z

I just ran into this on windows 64 with dask-ml 0.11.0 from conda-forge. https://ci.appveyor.com/project/jsignell/earthml/builds/22061998

AndreCAndersen · 2019-03-11T10:42:13Z

The issue here is that dtype is not set when generating the seed.

In dask-ml's draw_seed(...) function a random integer is drawn from the random number generator using randint(...), but no datatype is set:

seed = random_state.randint(low, high, **kwargs)

To solve this we need to add the dtype argument with a value such as 'uint64':

seed = random_state.randint(low, high, dtype='uint64', **kwargs)

A minimum example of this failing is:

from sklearn.utils import check_random_state
rng = check_random_state(42)
rng.randint(0, 2 ** 32 - 1, size=None)

... which gives the error ValueError: high is out of bounds for int32, while the following solves it:

from sklearn.utils import check_random_state
rng = check_random_state(42)
rng.randint(0, 2 ** 32 - 1, size=None, dtype='uint64')

My workaround is to monkeypatch dask-ml.

from dask_ml import _utils
import numpy as np
import dask.array as da

def draw_seed(random_state, low, high=None, size=None, dtype=None, chunks=None):
    kwargs = {"size": size, "dtype": "uint64"}
    if chunks is not None:
        kwargs["chunks"] = chunks

    seed = random_state.randint(low, high, **kwargs)
    if dtype is not None and isinstance(seed, (da.Array, np.ndarray)):
        seed = seed.astype(dtype)

    return seed

_utils.draw_seed = draw_seed

The same change might be suitable as a real PR.

TomAugspurger · 2019-03-11T11:26:36Z

Thanks. What method are you calling that raises the error? `draw_seed` does take a dtype argument, which we use in most places. I see a usage in PCA._fit and one in k_means.init_random where we fail to pass it.

…

On Mon, Mar 11, 2019 at 5:42 AM André Christoffer Andersen < ***@***.***> wrote: The issue here is that dtype is not set when generating the seed. In dask-ml's draw_seed(...) <https://github.com/dask/dask-ml/blob/master/dask_ml/_utils.py#L12> function a random integer is drawn from the random number generator using randint(...) <https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.RandomState.randint.html>, but no datatype is set: seed = random_state.randint(low, high, **kwargs) To solve this we need to add the dtype argument with a value such as 'uint64': seed = random_state.randint(low, high, dtype='uint64', **kwargs) A minimum example of this failing is: from sklearn.utils import check_random_state rng = check_random_state(42) rng.randint(0, 2 ** 32 - 1, size=None) ... which gives the error ValueError: high is out of bounds for int32, while the following solves it: from sklearn.utils import check_random_state rng = check_random_state(42) rng.randint(0, 2 ** 32 - 1, size=None, dtype='uint64') My workaround is to monkeypatch <https://en.wikipedia.org/wiki/Monkey_patch> dask-ml. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#230 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIrnnzAh7DXjWXG7rwnsTfwx3nhAGks5vVjMFgaJpZM4U5SCy> .

AndreCAndersen · 2019-03-11T13:01:25Z

Same as the one mentioned in the title, i.e., from dask_ml.model_selection import train_test_split.

TomAugspurger · 2019-03-11T13:04:08Z

What version of dask-ml? On master, and I believe 0.12.0, we pass dtype in https://github.com/dask/dask-ml/blob/d68c7fa9dbfac2b037a0caf5b4feb4940efa6a2c/dask_ml/model_selection/_split.py#L156

…

On Mon, Mar 11, 2019 at 8:01 AM André Christoffer Andersen < ***@***.***> wrote: Same as the one mentioned in the title, i.e., from dask_ml.model_selection import train_test_split. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#230 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIpX7UPKnXD2phzCe114nR3eXxWkTks5vVlOmgaJpZM4U5SCy> .

AndreCAndersen · 2019-03-11T15:55:27Z

I just ran

import dask_ml
dask_ml.__version__

It reports '0.12.0'

The issue isn't that dtype isn't passed into draw_seed(...), it is that draw_seed(...) doesn't pass a dtype into random_state.randint(...):

dask-ml/dask_ml/_utils.py

Line 17 in d68c7fa

seed = random_state.randint(low, high, **kwargs)

My suggestion of doing kwargs = {"size": size, "dtype": "uint64"} here, will probably solve it. "uint" seems to work too.

TomAugspurger · 2019-03-11T16:02:32Z

Ah, sorry I misread your earlier comment. Dask.array's RandomState.randint doesn't support dtype yet. Opened dask/dask#4579 to track that.

…

On Mon, Mar 11, 2019 at 10:55 AM André Christoffer Andersen < ***@***.***> wrote: I just ran import dask_ml dask_ml.__version__ It reports '0.12.0' The issue isn't that dtype isn't passed into draw_seed(...), it is that draw_seed(...) doesn't pass a dtype into random_state.randint(...): https://github.com/dask/dask-ml/blob/d68c7fa9dbfac2b037a0caf5b4feb4940efa6a2c/dask_ml/_utils.py#L17 — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#230 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIl1OABz0R8YH23X5zc0EdDQ-Bk1oks5vVnx9gaJpZM4U5SCy> .

TomAugspurger mentioned this issue Jun 27, 2018

BUG: Dtype for random seeds #238

Merged

TomAugspurger closed this as completed in #238 Jun 28, 2018

vecchp mentioned this issue Dec 14, 2018

Gather CV Results as Completed #433

Merged

This was referenced Feb 1, 2019

Adding appveyor config holoviz-topics/EarthML#81

Merged

dask-ml examples break on windows holoviz-topics/EarthML#89

Closed

dsherry mentioned this issue Apr 1, 2020

Support numpy.random.RandomState objects (take 2) alteryx/evalml#556

Merged

sweverett mentioned this issue Sep 23, 2020

fit function of fitter class won't run sweverett/CluStR#39

Closed

cthoyt mentioned this issue Sep 26, 2020

Issue with random number generation in TriplesFactory.split pykeen/pykeen#92

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error of random seed when using train_test_split() #230

Error of random seed when using train_test_split() #230

PeterFogh commented Jun 27, 2018

TomAugspurger commented Jun 27, 2018

PeterFogh commented Jun 27, 2018

TomAugspurger commented Jun 27, 2018

PeterFogh commented Jun 27, 2018

EmanuelFontelles commented Nov 5, 2018

TomAugspurger commented Nov 5, 2018 via email

osheari1 commented Nov 28, 2018 •

edited

TomAugspurger commented Nov 29, 2018

SindhujaVakkalagadda commented Dec 7, 2018

jsignell commented Feb 1, 2019

AndreCAndersen commented Mar 11, 2019 •

edited

TomAugspurger commented Mar 11, 2019 via email

AndreCAndersen commented Mar 11, 2019

TomAugspurger commented Mar 11, 2019 via email

AndreCAndersen commented Mar 11, 2019 •

edited

TomAugspurger commented Mar 11, 2019 via email

Error of random seed when using train_test_split() #230

Error of random seed when using train_test_split() #230

Comments

PeterFogh commented Jun 27, 2018

TomAugspurger commented Jun 27, 2018

PeterFogh commented Jun 27, 2018

TomAugspurger commented Jun 27, 2018

PeterFogh commented Jun 27, 2018

EmanuelFontelles commented Nov 5, 2018

TomAugspurger commented Nov 5, 2018 via email

osheari1 commented Nov 28, 2018 • edited

TomAugspurger commented Nov 29, 2018

SindhujaVakkalagadda commented Dec 7, 2018

jsignell commented Feb 1, 2019

AndreCAndersen commented Mar 11, 2019 • edited

TomAugspurger commented Mar 11, 2019 via email

AndreCAndersen commented Mar 11, 2019

TomAugspurger commented Mar 11, 2019 via email

AndreCAndersen commented Mar 11, 2019 • edited

TomAugspurger commented Mar 11, 2019 via email

osheari1 commented Nov 28, 2018 •

edited

AndreCAndersen commented Mar 11, 2019 •

edited

AndreCAndersen commented Mar 11, 2019 •

edited