Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column transformer and mixed sparse arrays #394

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

mrocklin
Copy link
Member

@mrocklin mrocklin commented Oct 9, 2018

We would like for column transformer to work well when some of the results are scipy.sparse arrays.

@mrocklin
Copy link
Member Author

mrocklin commented Oct 9, 2018

So far this is just a test. We have a problem because we want the following:

dataframe + numpy -> dataframe
dataframe + sparse array -> array

but currently we're not able to differentiate between dense and sparse arrays. I've raised this upstream here: dask/dask#4070

@mrocklin
Copy link
Member Author

mrocklin commented Oct 9, 2018

I'm looking around for other mechanisms to tell if we have sparse outputs. It looks like there is fairly common use of the sparse_output property within scikit-learn. I wonder if that is a convention or not and, if so, if we might consider copying/extending it.

@TomAugspurger
Copy link
Member

For things like OneHotEncoder (probably others) we could inspect get_params for a sparse key.

In [6]: OneHotEncoder().get_params()['sparse']
Out[6]: True

@mrocklin
Copy link
Member Author

mrocklin commented Oct 9, 2018

Do we want to do this as a stop-gap, or is there some other approach we can take? We could also take in sparse_output as a keyword to the make_column_transformer function.

@TomAugspurger
Copy link
Member

I'm not sure. That feels fragile but barring something like dask/dask#2977 I'm not sure there's a perfect way to do this.

@TomAugspurger
Copy link
Member

FWIW, sklearn's ColumnTranformer has a sparse_threshold keyword.

sparse_threshold : float, default = 0.3
    If the transformed output consists of a mix of sparse and dense data,
    it will be stacked as a sparse matrix if the density is lower than this
    value. Use ``sparse_threshold=0`` to always return dense.
    When the transformed output consists of all sparse or all dense data,
    the stacked result will be sparse or dense, respectively, and this
    keyword will be ignored.

That may be new in sklearn dev. I think setting that to 1 would be the same as accepting sparse_output?

@mrocklin
Copy link
Member Author

Another option is that we could reverse the default and always have dataframe + array -> array. This has some obvious usability drawbacks though.

@TomAugspurger
Copy link
Member

dataframe_threashod? :) I'm not sure what's best here.

As a workaround for the criteo case study, you could manually convert your dataframes to arrays when you know you're done with them with something like #372, or by subclassing the dask-ml estimator and just doing

class MyTransformer(Transformer):
   def predict(self, X):
        return super().predict(X).values

and using MyTransformer in the pipeline. Not the most elegant, but it may get you unstuck for now.

@TomAugspurger
Copy link
Member

I realize now that the previous implementation was buggy. dd.concat and pd.concat don't know how to handle arrays, so a dataframe + array has to be an array.

@mrocklin
Copy link
Member Author

We could convert arrays to dataframes though.

I pushed up a commit doing what you just said, but it's not clear to me that it's best. I'm playing around a bit.

@mrocklin
Copy link
Member Author

I'm not sure what we should do here. I don't think that there is a clear best solution (barring dask.array improving significantly in the near future).

Usually I would break this tie by importance of different applications that we see, but we're currently short on those.

@TomAugspurger
Copy link
Member

A keyword is inelegant, but maybe the best we can do here. Leave it up to the user to decide.

@mrocklin
Copy link
Member Author

What should the default be?

@TomAugspurger
Copy link
Member

TomAugspurger commented Oct 12, 2018 via email

@mrocklin
Copy link
Member Author

A keyword is inelegant, but maybe the best we can do here. Leave it up to the user to decide.

I started working on this and then noticed that we already had a preserve_dataframe keyword, which has a semi-overlapping purpose. It seems like we might want to combine these two in some way, although I'm not sure how. Any creative thoughts would be welcome here :)

@mrocklin mrocklin changed the title WIP - column transformer and mixed sparse arrays Column transformer and mixed sparse arrays Oct 15, 2018
@TomAugspurger
Copy link
Member

As you note preserve_dataframe isn't entirely overlapping. The fact that its used elsewhere in dask_ml led me to not suggest it as the keyword to control this behavior here.

@mrocklin
Copy link
Member Author

Yeah, I'm mostly concerned about the user experience. A user might reasonably expect that if they set preserve_dataframe=True that they'll get a dataframe out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants