Add PolynomialTransformer #347

TomAugspurger · 2018-08-29T21:01:36Z

http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures

This should be relatively straightforward, but there may be unexpected difficulties along the way.

datajanko · 2018-09-12T19:16:35Z

If you don't mind, I will work on that.
Code probably in preprocessing/data.py, right?

TomAugspurger · 2018-09-12T21:15:31Z

That'd be great!

…

On Wed, Sep 12, 2018 at 2:16 PM Jan Koch ***@***.***> wrote: If you don't mind, I will work on that — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#347 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIhDWdWjchryrHE-4UFs-5gy0RpBEks5uaV2UgaJpZM4WSUjE> .

datajanko · 2018-09-14T20:31:59Z

So I am making some progress here and fitting works so far.

Now I just realized, that typical dask-ml transformer that work on numpy arrays, also put numpy arrays out. However, for the polynomial features, this might not be desired. If an array, fits into ram, due to the added columns it might be to big. So my suggestion would be to always return a dask-array or dask data frame object with the chunk size, related to the size of the input object.

Does this make sense? Different suggestions?

TomAugspurger · 2018-09-14T20:37:00Z

I think I'd prefer returning NumPy array output for NumPy array input. Generally, Dask-ML only works on arrays that are partitioned vertically. At some point, you'll need to have an entire block's worth of columns in memory. For the user whose original ndarray fits in memory, but the the transformed array doesn't, I would recommend they convert the ndarray to a Dask Array with some number of blocks before passing it to PolynomialTransformer. Does that make sense?

…

On Fri, Sep 14, 2018 at 3:32 PM Jan Koch ***@***.***> wrote: So I am making some progress here and fitting works so far. Now I just realized, that typical dask-ml transformer that work on numpy arrays, also put numpy arrays out. However, for the polynomial features, this might not be desired. If an array, fits into ram, due to the added columns it might be to big. So my suggestion would be to always return a dask-array or dask data frame object with the chunk size, related to the size of the input object. Does this make sense? Different suggestions? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#347 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIlqIm5VRKhj8rFkkCtYlpUHV0gahks5ubBI_gaJpZM4WSUjE> .

datajanko · 2018-09-14T20:44:37Z

Thanks for the quick response Tom. It makes sense, and putting some burden on the user is completely fine for me.

datajanko · 2018-09-16T08:01:07Z

I don't know if it's better to discuss here or in the WIP pull requests.

Until now I learned a lot about desk's internals which is nice.

However, for the next steps, I'd need some feed back on the following issues:

What should I do with chunks of unknown size/shape. This can happen if you have a dask data frame and call values.
Concerning DataFrames: I'd tackle it by converting the data frame to a dask-array/numpy array, do the transforms there and just construct a new frame with columns from the get_feature_names method. Comments on that?
Sparse matrices/frames. Not sure what is necessary here and what should be implemented.

Any further comment (on anything) is welcome.

TomAugspurger added good first issue Algorithm Implement a new algorithm labels Aug 29, 2018

datajanko mentioned this issue Sep 16, 2018

[MRG] Poly trans: Issue #347 #367

Merged

TomAugspurger pushed a commit that referenced this issue Oct 14, 2018

[MRG+1] Poly trans: Issue #347 (#367)

ac2fdb7

TomAugspurger mentioned this issue Mar 11, 2019

Fix the link in dask documentation #479

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PolynomialTransformer #347

Add PolynomialTransformer #347

TomAugspurger commented Aug 29, 2018

datajanko commented Sep 12, 2018 •

edited

Loading

TomAugspurger commented Sep 12, 2018 via email

datajanko commented Sep 14, 2018

TomAugspurger commented Sep 14, 2018 via email

datajanko commented Sep 14, 2018

datajanko commented Sep 16, 2018

Add PolynomialTransformer #347

Add PolynomialTransformer #347

Comments

TomAugspurger commented Aug 29, 2018

datajanko commented Sep 12, 2018 • edited Loading

TomAugspurger commented Sep 12, 2018 via email

datajanko commented Sep 14, 2018

TomAugspurger commented Sep 14, 2018 via email

datajanko commented Sep 14, 2018

datajanko commented Sep 16, 2018

datajanko commented Sep 12, 2018 •

edited

Loading