-
-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PolynomialTransformer #347
Comments
If you don't mind, I will work on that. |
That'd be great!
…On Wed, Sep 12, 2018 at 2:16 PM Jan Koch ***@***.***> wrote:
If you don't mind, I will work on that
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#347 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIhDWdWjchryrHE-4UFs-5gy0RpBEks5uaV2UgaJpZM4WSUjE>
.
|
So I am making some progress here and fitting works so far. Now I just realized, that typical dask-ml transformer that work on numpy arrays, also put numpy arrays out. However, for the polynomial features, this might not be desired. If an array, fits into ram, due to the added columns it might be to big. So my suggestion would be to always return a dask-array or dask data frame object with the chunk size, related to the size of the input object. Does this make sense? Different suggestions? |
I think I'd prefer returning NumPy array output for NumPy array input.
Generally, Dask-ML only works on arrays that are partitioned vertically. At
some point, you'll need to have an entire block's worth of columns in
memory.
For the user whose original ndarray fits in memory, but the the transformed
array doesn't, I would recommend they
convert the ndarray to a Dask Array with some number of blocks before
passing it to PolynomialTransformer.
Does that make sense?
…On Fri, Sep 14, 2018 at 3:32 PM Jan Koch ***@***.***> wrote:
So I am making some progress here and fitting works so far.
Now I just realized, that typical dask-ml transformer that work on numpy
arrays, also put numpy arrays out. However, for the polynomial features,
this might not be desired. If an array, fits into ram, due to the added
columns it might be to big. So my suggestion would be to always return a
dask-array or dask data frame object with the chunk size, related to the
size of the input object.
Does this make sense? Different suggestions?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#347 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIlqIm5VRKhj8rFkkCtYlpUHV0gahks5ubBI_gaJpZM4WSUjE>
.
|
Thanks for the quick response Tom. It makes sense, and putting some burden on the user is completely fine for me. |
I don't know if it's better to discuss here or in the WIP pull requests. Until now I learned a lot about desk's internals which is nice. However, for the next steps, I'd need some feed back on the following issues:
Any further comment (on anything) is welcome. |
http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures
This should be relatively straightforward, but there may be unexpected difficulties along the way.
The text was updated successfully, but these errors were encountered: