Candidate transformers #6

TomAugspurger · 2017-09-25T14:19:20Z

I think it'd be nice to have some transformers that work on dask and numpy arrays, & dask and pandas DataFrames. This would be good since

We can depend on dask and pandas, scikit-learn can't
A basic transformer is generally much less work than a full-blown estimato

API

See https://github.com/tomaugspurger/sktransformers for some inspiration (and maybe some tests)?

We should match scikit-learn as closely as possible where things overlap
All transformers should take an optional columns argument. When specified, the transformation will be limited to just columns (e.g. if doing a standard scaling, and columns=['A', 'B'], only 'A' and 'B' are scaled). By default, all columns are scaled
We should operate on np.ndarray, dask.array.Array, pandas.core.NDFrame, dask.dataframe._Frame.
Should our operations be nan-safe?

The big question right now is should fitting be eager, and fitted values concrete? e.g.

scaler = StandardScaler()
scaler.fit(X)  # X is a dask.array

So, has scaler.mean_ been computed, and is it a dask.array or a numpy.array? This is a big decision.

Candidates

Imputer
CategoricalEncoder (TODO: check on Joris' recent work in sklearn here)
DummyEncoder
VarianceThreshold
MinMaxScaler
PolynomialFeatures
QuantileTransformer
High-Cardinality Categorical (see https://gist.github.com/ogrisel/b6a97ed87939e3b559568ac2f6599cba)

The text was updated successfully, but these errors were encountered:

mrocklin · 2017-09-25T14:24:01Z

What are the cases where you don't want to compute immediately after calling fit?

Fitting multiple estimators at once?
Parameter searches?
...?

cc @jcrist who I think thought about this for dask-searchcv

TomAugspurger · 2017-09-25T14:31:37Z

What are the cases where you don't want to compute immediately after calling fit?

I was trying to think of cases where multiple stages of a Pipeline could be fused together by dask into a single .compute. Something like

scale columns [0, 1] by mean and variance
categorical encode columns [2, 3]

In a pipeline that's

make_pipeline(
    StandardScaler(columns=[0, 1]),
    DummyEncoder(columns=[2, 3])
)

As a programmer, I know that DummyEncoding operation doesn't depend on the scaling step in this specific case. Ideally, I could share the common tasks like reading X off disk, across the two operations. I haven't thought about this very deeply yet :)

jcrist · 2017-09-25T15:19:33Z

The big question right now is should fitting be eager, and fitted values concrete? e.g.

Off the top of my head I can think of a few different solutions:

Keyword to the transformer init compute=False, defaults to true. Programmer is responsible for this. Easy, intuitive, this would be my recommendation.
Always return dask.array objects from dask based transformers, and let non-compatible scikit-learn transformers convert via np.asarray. This is potentially fragile if transforms are not robust to array-like objects.
Pipeline subclass that always runs transforms collectively as a graph. Non-dask inputs to fit/fit_transform are coerced into dask inputs, then fed lazily through the pipeline. Non-dask transformers are run as a single task each, similar to if they were wrapped with dask.delayed, dask based transformers are free to use dask-array/dataframe api. Compute is called at the end of the pipeline by default to match scikit-learn eager evaluation, but a delayed option is available to support dask pipelines containing dask pipelines.

You might implement this with either a base-class check to see if a lazy kwarg is supported, or a method check to see if a transformer supports lazy fit/transform. Something like:
```
# Using a base-class and a supported kwarg to `fit`/`fit_transform`
if isinstance(transformer, DaskBaseEstimator):
   # lazily fit the next stage in the pipeline
   est, Xt = transformer.fit_transform(Xt, y, compute=False)

# Using duck-typing and custom method names
if hasattr(transformer, 'fit_transform_dask'):
    # lazily fit the next stage in the pipeline
    est, Xt = transformer.fit_transform_dask(Xt, y)
```

Of these I'd probably go with either 1 (easy, clear, intuitive) or 3 (harder to implement, may not play nice with all of scikit-learn, but probably more robust than 2).

So, has scaler.mean_ been computed, and is it a dask.array or a numpy.array? This is a big decision.

I think that after calling fit in immediate mode (whether this is the default or the only option depends on the solution picked above), all attributes (e.g. .mean_) should be concrete. This will mesh better with scikit-learn, and matches their eager evaluation model.

TomAugspurger · 2017-09-25T19:16:24Z

Thanks @jcrist. Your proposal 1 seems the best. I think option 3, of running all the transforms as a single graph, will be interesting to experiment with at some point.

dsevero · 2017-09-26T05:49:26Z

Cool!

@TomAugspurger does it make sense to implement the MinMaxScaler? If so, I'll do it. Looks like a good entry point to the dask-ml philosophy, given that I'm familiar with sklearn.

TomAugspurger · 2017-09-26T11:03:59Z

@daniel-severo yep, that'd be great to have. I'll add it to the list.

dsevero · 2017-09-26T22:24:48Z

With respect to being nan safe: I think the user should handle this using the Imputer scaler for inputs.

If operations within daskml end up throwing up np.nan results, it is also probably due to some misuse by the user.

dsevero · 2017-09-27T03:22:14Z

MinMaxScaler: #9

dsevero · 2017-10-03T13:41:01Z

Imputer: #11

jorisvandenbossche · 2017-10-22T14:38:27Z

CategoricalEncoder (TODO: check on Joris' recent work in sklearn here)

I picked up again this work last week, and I think the design should rather fixed now (scikit-learn/scikit-learn#9151). For now, we opted to only provide two ways to encode: just replacing with integer (categorial 'codes') and one-hot / dummy encoding (as there are many ways one could do 'categorical encoding'). Feedback is always welcome there!

Note that you might not want to follow the design exactly, as it does not use anything dataframe-specific (it can handles dataframes, but does not take advantage of it). Eg, the categories are specified as a positional list for the different columns, not as a dict of column name -> categories.

All transformers should take an optional columns argument. When specified, the transformation will be limited to just columns (e.g. if doing a standard scaling, and columns=['A', 'B'], only 'A' and 'B' are scaled). By default, all columns are scaled

In scikit-learn, we are currently taking another route: instead of adding such a columns keyword to all transformers, I am working on a ColumnTransformer to perform such column-specific transformers (scikit-learn/scikit-learn#9012).

The example from above:

make_pipeline(
    StandardScaler(columns=[0, 1]),
    DummyEncoder(columns=[2, 3])
)

would then become

make_pipeline(
    make_column_transformer(
        ([0, 1], StandardScaler()),
        ([2, 3], DummyEncoder())
    ),
    ... regresson/classifier
)

Maybe a bit less easy from user point of view, but it means that the transformers itself don't need to be updated to handle column selection (and since it is in a single object, it would naturally share the reading of X).
But of course, that doesn't you need to adapt the same pattern here.

TomAugspurger · 2017-10-22T16:30:04Z

Thanks Joris, I'll try to take another look at the CategoricalEncoder soon.

I don't have much to add on the ColumnTransformer PR. It seems like make_column_transformer will be awkward to use, but perhaps not. Ideally, any transformer / estimator implemented in dask_ml could also be wrapped in make_column_transformer. I'll look through to seed if that's the case.

paolof89 · 2019-02-05T11:10:43Z

I was looking for an implementation of a woe-scaler (Weight of Evidence). I found it in a contrib branch of sklearn: https://github.com/scikit-learn-contrib/categorical-encoding.
Do you have thoughts about it? Do you think it make sense to candidate it as a transformer?

TomAugspurger · 2019-02-05T15:10:47Z

Skimming the implementation, things seem doable. The bulk of the work seems to be in OrdinalEncoder: https://github.com/scikit-learn-contrib/categorical-encoding/blob/e3ce76f711f923e762722aa8d6cb44cb9a17742c/category_encoders/woe.py#L150, which is implemented in dask-ml.

…

On Tue, Feb 5, 2019 at 5:10 AM paolof89 ***@***.***> wrote: I was looking for an implementation of a woe-scaler (Weight of Evidence). I found it in a contrib branch of sklearn: https://github.com/scikit-learn-contrib/categorical-encoding. Do you have thoughts about it? Do you think it make sense to candidate it as a transformer? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIhV4qdIi1H5zmEa7vrCx_L_rWfq7ks5vKWa0gaJpZM4PizuV> .

Add Regularizer classes; also closes issue dask#6

TomAugspurger mentioned this issue Oct 27, 2017

Change fitted values to be concrete #75

Closed

TomAugspurger pushed a commit to TomAugspurger/dask-ml that referenced this issue Oct 17, 2019

Merge pull request dask#29 from moody-marlin/regularizer_class

8ca2bc8

Add Regularizer classes; also closes issue dask#6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Candidate transformers #6

Candidate transformers #6

TomAugspurger commented Sep 25, 2017 •

edited

Loading

mrocklin commented Sep 25, 2017

TomAugspurger commented Sep 25, 2017

jcrist commented Sep 25, 2017

TomAugspurger commented Sep 25, 2017

dsevero commented Sep 26, 2017 •

edited

Loading

TomAugspurger commented Sep 26, 2017

dsevero commented Sep 26, 2017

dsevero commented Sep 27, 2017

dsevero commented Oct 3, 2017

jorisvandenbossche commented Oct 22, 2017

TomAugspurger commented Oct 22, 2017

paolof89 commented Feb 5, 2019

TomAugspurger commented Feb 5, 2019 via email

Candidate transformers #6

Candidate transformers #6

Comments

TomAugspurger commented Sep 25, 2017 • edited Loading

API

mrocklin commented Sep 25, 2017

TomAugspurger commented Sep 25, 2017

jcrist commented Sep 25, 2017

TomAugspurger commented Sep 25, 2017

dsevero commented Sep 26, 2017 • edited Loading

TomAugspurger commented Sep 26, 2017

dsevero commented Sep 26, 2017

dsevero commented Sep 27, 2017

dsevero commented Oct 3, 2017

jorisvandenbossche commented Oct 22, 2017

TomAugspurger commented Oct 22, 2017

paolof89 commented Feb 5, 2019

TomAugspurger commented Feb 5, 2019 via email

TomAugspurger commented Sep 25, 2017 •

edited

Loading

dsevero commented Sep 26, 2017 •

edited

Loading