Column transformer and mixed sparse arrays #394

mrocklin · 2018-10-09T18:20:21Z

We would like for column transformer to work well when some of the results are scipy.sparse arrays.

mrocklin · 2018-10-09T18:21:25Z

So far this is just a test. We have a problem because we want the following:

dataframe + numpy -> dataframe
dataframe + sparse array -> array

but currently we're not able to differentiate between dense and sparse arrays. I've raised this upstream here: dask/dask#4070

mrocklin · 2018-10-09T20:21:23Z

I'm looking around for other mechanisms to tell if we have sparse outputs. It looks like there is fairly common use of the sparse_output property within scikit-learn. I wonder if that is a convention or not and, if so, if we might consider copying/extending it.

TomAugspurger · 2018-10-09T20:42:30Z

For things like OneHotEncoder (probably others) we could inspect get_params for a sparse key.

In [6]: OneHotEncoder().get_params()['sparse']
Out[6]: True

mrocklin · 2018-10-09T20:45:50Z

Do we want to do this as a stop-gap, or is there some other approach we can take? We could also take in sparse_output as a keyword to the make_column_transformer function.

TomAugspurger · 2018-10-09T21:13:03Z

I'm not sure. That feels fragile but barring something like dask/dask#2977 I'm not sure there's a perfect way to do this.

TomAugspurger · 2018-10-09T21:14:30Z

FWIW, sklearn's ColumnTranformer has a sparse_threshold keyword.

sparse_threshold : float, default = 0.3
    If the transformed output consists of a mix of sparse and dense data,
    it will be stacked as a sparse matrix if the density is lower than this
    value. Use ``sparse_threshold=0`` to always return dense.
    When the transformed output consists of all sparse or all dense data,
    the stacked result will be sparse or dense, respectively, and this
    keyword will be ignored.

That may be new in sklearn dev. I think setting that to 1 would be the same as accepting sparse_output?

mrocklin · 2018-10-10T00:11:43Z

Another option is that we could reverse the default and always have dataframe + array -> array. This has some obvious usability drawbacks though.

TomAugspurger · 2018-10-10T11:42:47Z

dataframe_threashod? :) I'm not sure what's best here.

As a workaround for the criteo case study, you could manually convert your dataframes to arrays when you know you're done with them with something like #372, or by subclassing the dask-ml estimator and just doing

class MyTransformer(Transformer):
   def predict(self, X):
        return super().predict(X).values

and using MyTransformer in the pipeline. Not the most elegant, but it may get you unstuck for now.

TomAugspurger · 2018-10-10T14:06:19Z

I realize now that the previous implementation was buggy. dd.concat and pd.concat don't know how to handle arrays, so a dataframe + array has to be an array.

mrocklin · 2018-10-10T14:08:25Z

We could convert arrays to dataframes though.

I pushed up a commit doing what you just said, but it's not clear to me that it's best. I'm playing around a bit.

mrocklin · 2018-10-12T15:26:46Z

I'm not sure what we should do here. I don't think that there is a clear best solution (barring dask.array improving significantly in the near future).

Usually I would break this tie by importance of different applications that we see, but we're currently short on those.

TomAugspurger · 2018-10-12T15:30:27Z

A keyword is inelegant, but maybe the best we can do here. Leave it up to the user to decide.

mrocklin · 2018-10-12T16:34:57Z

What should the default be?

TomAugspurger · 2018-10-12T16:49:51Z

I think an array, but either seems fine.

…

On Fri, Oct 12, 2018 at 11:34 AM Matthew Rocklin ***@***.***> wrote: What should the default be? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#394 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIi18BM2jxFumxXVjuWIA3-Kbx_3Fks5ukMSxgaJpZM4XT2s2> .

mrocklin · 2018-10-15T15:43:10Z

A keyword is inelegant, but maybe the best we can do here. Leave it up to the user to decide.

I started working on this and then noticed that we already had a preserve_dataframe keyword, which has a semi-overlapping purpose. It seems like we might want to combine these two in some way, although I'm not sure how. Any creative thoughts would be welcome here :)

TomAugspurger · 2018-10-15T20:56:13Z

As you note preserve_dataframe isn't entirely overlapping. The fact that its used elsewhere in dask_ml led me to not suggest it as the keyword to control this behavior here.

mrocklin · 2018-10-15T20:58:07Z

Yeah, I'm mostly concerned about the user experience. A user might reasonably expect that if they set preserve_dataframe=True that they'll get a dataframe out.

Add test for mixed sparse-dataframe output for column transformer

4414c42

dataframe + array -> array

b811adb

flake8

218d314

Add test for mixed arrays/dataframes

efc4e00

mrocklin changed the title ~~WIP - column transformer and mixed sparse arrays~~ Column transformer and mixed sparse arrays Oct 15, 2018

Merge branch 'master' into test-column-transformer-sparse

3f89336

mrocklin mentioned this pull request Mar 21, 2019

Exposing chunk array types in dask.array (e.g., sparsed and masked) dask/dask#2977

Closed

Base automatically changed from master to main February 2, 2021 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column transformer and mixed sparse arrays #394

Column transformer and mixed sparse arrays #394

mrocklin commented Oct 9, 2018

mrocklin commented Oct 9, 2018

mrocklin commented Oct 9, 2018

TomAugspurger commented Oct 9, 2018

mrocklin commented Oct 9, 2018

TomAugspurger commented Oct 9, 2018

TomAugspurger commented Oct 9, 2018

mrocklin commented Oct 10, 2018

TomAugspurger commented Oct 10, 2018

TomAugspurger commented Oct 10, 2018

mrocklin commented Oct 10, 2018

mrocklin commented Oct 12, 2018

TomAugspurger commented Oct 12, 2018

mrocklin commented Oct 12, 2018

TomAugspurger commented Oct 12, 2018 via email

mrocklin commented Oct 15, 2018

TomAugspurger commented Oct 15, 2018

mrocklin commented Oct 15, 2018

Column transformer and mixed sparse arrays #394

Are you sure you want to change the base?

Column transformer and mixed sparse arrays #394

Conversation

mrocklin commented Oct 9, 2018

mrocklin commented Oct 9, 2018

mrocklin commented Oct 9, 2018

TomAugspurger commented Oct 9, 2018

mrocklin commented Oct 9, 2018

TomAugspurger commented Oct 9, 2018

TomAugspurger commented Oct 9, 2018

mrocklin commented Oct 10, 2018

TomAugspurger commented Oct 10, 2018

TomAugspurger commented Oct 10, 2018

mrocklin commented Oct 10, 2018

mrocklin commented Oct 12, 2018

TomAugspurger commented Oct 12, 2018

mrocklin commented Oct 12, 2018

TomAugspurger commented Oct 12, 2018 via email

mrocklin commented Oct 15, 2018

TomAugspurger commented Oct 15, 2018

mrocklin commented Oct 15, 2018