Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PolynomialTransformer #347

Open
TomAugspurger opened this issue Aug 29, 2018 · 6 comments
Open

Add PolynomialTransformer #347

TomAugspurger opened this issue Aug 29, 2018 · 6 comments
Labels
Algorithm Implement a new algorithm good first issue

Comments

@TomAugspurger
Copy link
Member

http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures

This should be relatively straightforward, but there may be unexpected difficulties along the way.

@TomAugspurger TomAugspurger added good first issue Algorithm Implement a new algorithm labels Aug 29, 2018
@datajanko
Copy link
Contributor

datajanko commented Sep 12, 2018

If you don't mind, I will work on that.
Code probably in preprocessing/data.py, right?

@TomAugspurger
Copy link
Member Author

TomAugspurger commented Sep 12, 2018 via email

@datajanko
Copy link
Contributor

So I am making some progress here and fitting works so far.

Now I just realized, that typical dask-ml transformer that work on numpy arrays, also put numpy arrays out. However, for the polynomial features, this might not be desired. If an array, fits into ram, due to the added columns it might be to big. So my suggestion would be to always return a dask-array or dask data frame object with the chunk size, related to the size of the input object.

Does this make sense? Different suggestions?

@TomAugspurger
Copy link
Member Author

TomAugspurger commented Sep 14, 2018 via email

@datajanko
Copy link
Contributor

Thanks for the quick response Tom. It makes sense, and putting some burden on the user is completely fine for me.

@datajanko
Copy link
Contributor

I don't know if it's better to discuss here or in the WIP pull requests.

Until now I learned a lot about desk's internals which is nice.

However, for the next steps, I'd need some feed back on the following issues:

  • What should I do with chunks of unknown size/shape. This can happen if you have a dask data frame and call values.
  • Concerning DataFrames: I'd tackle it by converting the data frame to a dask-array/numpy array, do the transforms there and just construct a new frame with columns from the get_feature_names method. Comments on that?
  • Sparse matrices/frames. Not sure what is necessary here and what should be implemented.

Any further comment (on anything) is welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algorithm Implement a new algorithm good first issue
Projects
None yet
Development

No branches or pull requests

2 participants