-
-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Candidate transformers #6
Comments
What are the cases where you don't want to compute immediately after calling fit?
cc @jcrist who I think thought about this for dask-searchcv |
I was trying to think of cases where multiple stages of a
In a pipeline that's
As a programmer, I know that DummyEncoding operation doesn't depend on the scaling step in this specific case. Ideally, I could share the common tasks like reading |
Off the top of my head I can think of a few different solutions:
Of these I'd probably go with either 1 (easy, clear, intuitive) or 3 (harder to implement, may not play nice with all of scikit-learn, but probably more robust than 2).
I think that after calling |
Thanks @jcrist. Your proposal 1 seems the best. I think option 3, of running all the transforms as a single graph, will be interesting to experiment with at some point. |
Cool! @TomAugspurger does it make sense to implement the MinMaxScaler? If so, I'll do it. Looks like a good entry point to the dask-ml philosophy, given that I'm familiar with sklearn. |
@daniel-severo yep, that'd be great to have. I'll add it to the list. |
With respect to being nan safe: I think the user should handle this using the If operations within daskml end up throwing up |
MinMaxScaler: #9 |
Imputer: #11 |
I picked up again this work last week, and I think the design should rather fixed now (scikit-learn/scikit-learn#9151). For now, we opted to only provide two ways to encode: just replacing with integer (categorial 'codes') and one-hot / dummy encoding (as there are many ways one could do 'categorical encoding'). Feedback is always welcome there! Note that you might not want to follow the design exactly, as it does not use anything dataframe-specific (it can handles dataframes, but does not take advantage of it). Eg, the categories are specified as a positional list for the different columns, not as a dict of column name -> categories.
In scikit-learn, we are currently taking another route: instead of adding such a The example from above:
would then become
Maybe a bit less easy from user point of view, but it means that the transformers itself don't need to be updated to handle column selection (and since it is in a single object, it would naturally share the reading of X). |
Thanks Joris, I'll try to take another look at the CategoricalEncoder soon. I don't have much to add on the ColumnTransformer PR. It seems like |
I was looking for an implementation of a woe-scaler (Weight of Evidence). I found it in a contrib branch of sklearn: https://github.com/scikit-learn-contrib/categorical-encoding. |
Skimming the implementation, things seem doable. The bulk of the work seems
to be in OrdinalEncoder:
https://github.com/scikit-learn-contrib/categorical-encoding/blob/e3ce76f711f923e762722aa8d6cb44cb9a17742c/category_encoders/woe.py#L150,
which is implemented in dask-ml.
…On Tue, Feb 5, 2019 at 5:10 AM paolof89 ***@***.***> wrote:
I was looking for an implementation of a woe-scaler (Weight of Evidence).
I found it in a contrib branch of sklearn:
https://github.com/scikit-learn-contrib/categorical-encoding.
Do you have thoughts about it? Do you think it make sense to candidate it
as a transformer?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIhV4qdIi1H5zmEa7vrCx_L_rWfq7ks5vKWa0gaJpZM4PizuV>
.
|
Add Regularizer classes; also closes issue dask#6
I think it'd be nice to have some transformers that work on dask and numpy arrays, & dask and pandas DataFrames. This would be good since
API
See https://github.com/tomaugspurger/sktransformers for some inspiration (and maybe some tests)?
columns
argument. When specified, the transformation will be limited to justcolumns
(e.g. if doing a standard scaling, andcolumns=['A', 'B']
, only'A'
and'B'
are scaled). By default, all columns are scalednp.ndarray
,dask.array.Array
,pandas.core.NDFrame
,dask.dataframe._Frame
.The big question right now is should fitting be eager, and fitted values concrete? e.g.
So, has
scaler.mean_
been computed, and is it adask.array
or anumpy.array
? This is a big decision.Candidates
The text was updated successfully, but these errors were encountered: