New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error of random seed when using train_test_split() #230
Comments
Thanks for the report. I assume your Python is 32 bit? We don't do any testing with 32-bit builds. Anyway, the bug is that
should be seeds = rng.randint(0, 2**32 - 1, size=len(chunks), dtype='u8') Any interest in making a PR to fix it? Otherwise I'll get to it later today or tomorrow. |
Thanks. I have tried executing the correction and it works. (y) Hm... My Python should be 64 bit.
I'm using the conda installation, so making a PR is a bit of work for me :) |
According to numpy/numpy#4085 (comment), on windows the default integer type is int32. In [4]: np.intp
Out[4]: numpy.int64 will probably be
No worries, I'll take care of it! |
Nope.
So I guess it has changed since 1'st Dec 2013 where the comment is from. |
This error continues on version 0.9, probably it's internal from |
These should all be fixed on 0.11
…________________________________
From: Emanuel Fontelles <notifications@github.com>
Sent: Sunday, November 4, 2018 8:03:11 PM
To: dask/dask-ml
Cc: Tom Augspurger; State change
Subject: Re: [dask/dask-ml] Error of random seed when using train_test_split() (#230)
This error continues on version 0.9, probably it's internal from train_test_split of dask-ml, the library are using a random seed of 32bits instead of 64bits
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub<#230 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIrA2e3SgUNo0N4p9hZzt5U1a3yQQks5ur5xfgaJpZM4U5SCy>.
|
I'm still getting this error after 0.11(@master).
All of the snippets below throw the 'high is out of bounds for int32' error.
|
Stange. We really need to start testing on a 32-bit setup. It seems like newer numpy's support passing a |
I tried changing the dtype in draw_seed to uint64, float32 and also float64. But getting the same error everytime. km = KMeans( |
I just ran into this on windows 64 with dask-ml 0.11.0 from conda-forge. https://ci.appveyor.com/project/jsignell/earthml/builds/22061998 |
The issue here is that In dask-ml's
To solve this we need to add the
A minimum example of this failing is:
... which gives the error
My workaround is to monkeypatch dask-ml.
The same change might be suitable as a real PR. |
Thanks. What method are you calling that raises the error? `draw_seed` does
take a dtype argument, which we use in most places.
I see a usage in PCA._fit and one in k_means.init_random where we fail to
pass it.
…On Mon, Mar 11, 2019 at 5:42 AM André Christoffer Andersen < ***@***.***> wrote:
The issue here is that dtype is not set when generating the seed.
In dask-ml's draw_seed(...)
<https://github.com/dask/dask-ml/blob/master/dask_ml/_utils.py#L12>
function a random integer is drawn from the random number generator using
randint(...)
<https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.RandomState.randint.html>,
but no datatype is set:
seed = random_state.randint(low, high, **kwargs)
To solve this we need to add the dtype argument with a value such as
'uint64':
seed = random_state.randint(low, high, dtype='uint64', **kwargs)
A minimum example of this failing is:
from sklearn.utils import check_random_state
rng = check_random_state(42)
rng.randint(0, 2 ** 32 - 1, size=None)
... which gives the error ValueError: high is out of bounds for int32,
while the following solves it:
from sklearn.utils import check_random_state
rng = check_random_state(42)
rng.randint(0, 2 ** 32 - 1, size=None, dtype='uint64')
My workaround is to monkeypatch
<https://en.wikipedia.org/wiki/Monkey_patch> dask-ml.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#230 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIrnnzAh7DXjWXG7rwnsTfwx3nhAGks5vVjMFgaJpZM4U5SCy>
.
|
Same as the one mentioned in the title, i.e., |
What version of dask-ml? On master, and I believe 0.12.0, we pass dtype in
https://github.com/dask/dask-ml/blob/d68c7fa9dbfac2b037a0caf5b4feb4940efa6a2c/dask_ml/model_selection/_split.py#L156
…On Mon, Mar 11, 2019 at 8:01 AM André Christoffer Andersen < ***@***.***> wrote:
Same as the one mentioned in the title, i.e., from
dask_ml.model_selection import train_test_split.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#230 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIpX7UPKnXD2phzCe114nR3eXxWkTks5vVlOmgaJpZM4U5SCy>
.
|
I just ran
It reports The issue isn't that Line 17 in d68c7fa
My suggestion of doing |
Ah, sorry I misread your earlier comment.
Dask.array's RandomState.randint doesn't support dtype yet. Opened
dask/dask#4579 to track that.
…On Mon, Mar 11, 2019 at 10:55 AM André Christoffer Andersen < ***@***.***> wrote:
I just ran
import dask_ml
dask_ml.__version__
It reports '0.12.0'
The issue isn't that dtype isn't passed into draw_seed(...), it is that
draw_seed(...) doesn't pass a dtype into random_state.randint(...):
https://github.com/dask/dask-ml/blob/d68c7fa9dbfac2b037a0caf5b4feb4940efa6a2c/dask_ml/_utils.py#L17
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#230 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIl1OABz0R8YH23X5zc0EdDQ-Bk1oks5vVnx9gaJpZM4U5SCy>
.
|
I get this error when executing the following code:
Local dask.version: 0.16.1
Local dask_ml.version: 0.6.0
Client dask.version: 0.16.1
Client dask_ml.version: 0.6.0
dask.array<da.random.normal, shape=(10000, 4), dtype=float64, chunksize=(4, 4)>
The text was updated successfully, but these errors were encountered: