Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Candidate transformers #6

Open
2 of 8 tasks
TomAugspurger opened this issue Sep 25, 2017 · 13 comments
Open
2 of 8 tasks

Candidate transformers #6

TomAugspurger opened this issue Sep 25, 2017 · 13 comments

Comments

@TomAugspurger
Copy link
Member

TomAugspurger commented Sep 25, 2017

I think it'd be nice to have some transformers that work on dask and numpy arrays, & dask and pandas DataFrames. This would be good since

  1. We can depend on dask and pandas, scikit-learn can't
  2. A basic transformer is generally much less work than a full-blown estimato

API

See https://github.com/tomaugspurger/sktransformers for some inspiration (and maybe some tests)?

  • We should match scikit-learn as closely as possible where things overlap
  • All transformers should take an optional columns argument. When specified, the transformation will be limited to just columns (e.g. if doing a standard scaling, and columns=['A', 'B'], only 'A' and 'B' are scaled). By default, all columns are scaled
  • We should operate on np.ndarray, dask.array.Array, pandas.core.NDFrame, dask.dataframe._Frame.
  • Should our operations be nan-safe?

The big question right now is should fitting be eager, and fitted values concrete? e.g.

scaler = StandardScaler()
scaler.fit(X)  # X is a dask.array

So, has scaler.mean_ been computed, and is it a dask.array or a numpy.array? This is a big decision.

Candidates

@mrocklin
Copy link
Member

What are the cases where you don't want to compute immediately after calling fit?

  1. Fitting multiple estimators at once?
  2. Parameter searches?
  3. ...?

cc @jcrist who I think thought about this for dask-searchcv

@TomAugspurger
Copy link
Member Author

What are the cases where you don't want to compute immediately after calling fit?

I was trying to think of cases where multiple stages of a Pipeline could be fused together by dask into a single .compute. Something like

  • scale columns [0, 1] by mean and variance
  • categorical encode columns [2, 3]

In a pipeline that's

make_pipeline(
    StandardScaler(columns=[0, 1]),
    DummyEncoder(columns=[2, 3])
)

As a programmer, I know that DummyEncoding operation doesn't depend on the scaling step in this specific case. Ideally, I could share the common tasks like reading X off disk, across the two operations. I haven't thought about this very deeply yet :)

@jcrist
Copy link
Member

jcrist commented Sep 25, 2017

The big question right now is should fitting be eager, and fitted values concrete? e.g.

Off the top of my head I can think of a few different solutions:

  • Keyword to the transformer init compute=False, defaults to true. Programmer is responsible for this. Easy, intuitive, this would be my recommendation.

  • Always return dask.array objects from dask based transformers, and let non-compatible scikit-learn transformers convert via np.asarray. This is potentially fragile if transforms are not robust to array-like objects.

  • Pipeline subclass that always runs transforms collectively as a graph. Non-dask inputs to fit/fit_transform are coerced into dask inputs, then fed lazily through the pipeline. Non-dask transformers are run as a single task each, similar to if they were wrapped with dask.delayed, dask based transformers are free to use dask-array/dataframe api. Compute is called at the end of the pipeline by default to match scikit-learn eager evaluation, but a delayed option is available to support dask pipelines containing dask pipelines.

    You might implement this with either a base-class check to see if a lazy kwarg is supported, or a method check to see if a transformer supports lazy fit/transform. Something like:

    # Using a base-class and a supported kwarg to `fit`/`fit_transform`
    if isinstance(transformer, DaskBaseEstimator):
       # lazily fit the next stage in the pipeline
       est, Xt = transformer.fit_transform(Xt, y, compute=False)
    
    # Using duck-typing and custom method names
    if hasattr(transformer, 'fit_transform_dask'):
        # lazily fit the next stage in the pipeline
        est, Xt = transformer.fit_transform_dask(Xt, y)

Of these I'd probably go with either 1 (easy, clear, intuitive) or 3 (harder to implement, may not play nice with all of scikit-learn, but probably more robust than 2).

So, has scaler.mean_ been computed, and is it a dask.array or a numpy.array? This is a big decision.

I think that after calling fit in immediate mode (whether this is the default or the only option depends on the solution picked above), all attributes (e.g. .mean_) should be concrete. This will mesh better with scikit-learn, and matches their eager evaluation model.

@TomAugspurger
Copy link
Member Author

Thanks @jcrist. Your proposal 1 seems the best. I think option 3, of running all the transforms as a single graph, will be interesting to experiment with at some point.

@dsevero
Copy link
Contributor

dsevero commented Sep 26, 2017

Cool!

@TomAugspurger does it make sense to implement the MinMaxScaler? If so, I'll do it. Looks like a good entry point to the dask-ml philosophy, given that I'm familiar with sklearn.

@TomAugspurger
Copy link
Member Author

@daniel-severo yep, that'd be great to have. I'll add it to the list.

@dsevero
Copy link
Contributor

dsevero commented Sep 26, 2017

With respect to being nan safe: I think the user should handle this using the Imputer scaler for inputs.

If operations within daskml end up throwing up np.nan results, it is also probably due to some misuse by the user.

@dsevero
Copy link
Contributor

dsevero commented Sep 27, 2017

MinMaxScaler: #9

@dsevero
Copy link
Contributor

dsevero commented Oct 3, 2017

Imputer: #11

@jorisvandenbossche
Copy link
Member

CategoricalEncoder (TODO: check on Joris' recent work in sklearn here)

I picked up again this work last week, and I think the design should rather fixed now (scikit-learn/scikit-learn#9151). For now, we opted to only provide two ways to encode: just replacing with integer (categorial 'codes') and one-hot / dummy encoding (as there are many ways one could do 'categorical encoding'). Feedback is always welcome there!

Note that you might not want to follow the design exactly, as it does not use anything dataframe-specific (it can handles dataframes, but does not take advantage of it). Eg, the categories are specified as a positional list for the different columns, not as a dict of column name -> categories.

All transformers should take an optional columns argument. When specified, the transformation will be limited to just columns (e.g. if doing a standard scaling, and columns=['A', 'B'], only 'A' and 'B' are scaled). By default, all columns are scaled

In scikit-learn, we are currently taking another route: instead of adding such a columns keyword to all transformers, I am working on a ColumnTransformer to perform such column-specific transformers (scikit-learn/scikit-learn#9012).

The example from above:

make_pipeline(
    StandardScaler(columns=[0, 1]),
    DummyEncoder(columns=[2, 3])
)

would then become

make_pipeline(
    make_column_transformer(
        ([0, 1], StandardScaler()),
        ([2, 3], DummyEncoder())
    ),
    ... regresson/classifier
)

Maybe a bit less easy from user point of view, but it means that the transformers itself don't need to be updated to handle column selection (and since it is in a single object, it would naturally share the reading of X).
But of course, that doesn't you need to adapt the same pattern here.

@TomAugspurger
Copy link
Member Author

Thanks Joris, I'll try to take another look at the CategoricalEncoder soon.

I don't have much to add on the ColumnTransformer PR. It seems like make_column_transformer will be awkward to use, but perhaps not. Ideally, any transformer / estimator implemented in dask_ml could also be wrapped in make_column_transformer. I'll look through to seed if that's the case.

@paolof89
Copy link

paolof89 commented Feb 5, 2019

I was looking for an implementation of a woe-scaler (Weight of Evidence). I found it in a contrib branch of sklearn: https://github.com/scikit-learn-contrib/categorical-encoding.
Do you have thoughts about it? Do you think it make sense to candidate it as a transformer?

@TomAugspurger
Copy link
Member Author

TomAugspurger commented Feb 5, 2019 via email

TomAugspurger pushed a commit to TomAugspurger/dask-ml that referenced this issue Oct 17, 2019
Add Regularizer classes; also closes issue dask#6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants