-
-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Column transformer and mixed sparse arrays #394
base: main
Are you sure you want to change the base?
Conversation
So far this is just a test. We have a problem because we want the following:
but currently we're not able to differentiate between dense and sparse arrays. I've raised this upstream here: dask/dask#4070 |
I'm looking around for other mechanisms to tell if we have sparse outputs. It looks like there is fairly common use of the |
For things like
|
Do we want to do this as a stop-gap, or is there some other approach we can take? We could also take in |
I'm not sure. That feels fragile but barring something like dask/dask#2977 I'm not sure there's a perfect way to do this. |
FWIW, sklearn's ColumnTranformer has a
That may be new in sklearn dev. I think setting that to 1 would be the same as accepting |
Another option is that we could reverse the default and always have dataframe + array -> array. This has some obvious usability drawbacks though. |
As a workaround for the criteo case study, you could manually convert your dataframes to arrays when you know you're done with them with something like #372, or by subclassing the dask-ml estimator and just doing class MyTransformer(Transformer):
def predict(self, X):
return super().predict(X).values and using MyTransformer in the pipeline. Not the most elegant, but it may get you unstuck for now. |
I realize now that the previous implementation was buggy. |
We could convert arrays to dataframes though. I pushed up a commit doing what you just said, but it's not clear to me that it's best. I'm playing around a bit. |
I'm not sure what we should do here. I don't think that there is a clear best solution (barring dask.array improving significantly in the near future). Usually I would break this tie by importance of different applications that we see, but we're currently short on those. |
A keyword is inelegant, but maybe the best we can do here. Leave it up to the user to decide. |
What should the default be? |
I think an array, but either seems fine.
…On Fri, Oct 12, 2018 at 11:34 AM Matthew Rocklin ***@***.***> wrote:
What should the default be?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#394 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIi18BM2jxFumxXVjuWIA3-Kbx_3Fks5ukMSxgaJpZM4XT2s2>
.
|
I started working on this and then noticed that we already had a |
As you note |
Yeah, I'm mostly concerned about the user experience. A user might reasonably expect that if they set |
We would like for column transformer to work well when some of the results are scipy.sparse arrays.