Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds LabelEncoder #226

Merged
merged 9 commits into from Jun 26, 2018
Merged

Adds LabelEncoder #226

merged 9 commits into from Jun 26, 2018

Conversation

@jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Jun 25, 2018

This PR adds LabelEncoder as a drop-in replacement for sklearn.preprocessing.LabelEncoder.

Copy link
Member

@TomAugspurger TomAugspurger left a comment

LGTM overall. One very minor comment.

__doc__ = sklabel.LabelEncoder.__doc__

def fit(self, y):
if isinstance(y, dd.Series):

This comment has been minimized.

@TomAugspurger

TomAugspurger Jun 25, 2018
Member

Slight preference for moving this to a LabelEnoder._check_array. Then it can be re-used with LabelEncodertransform.

This comment has been minimized.

@jrbourbeau

jrbourbeau Jun 25, 2018
Author Member

Can do @TomAugspurger, that seems reasonable to me


if isinstance(y, da.Array):
return y.map_blocks(lambda b: np.searchsorted(self.classes_, b),
dtype=self.classes_.dtype)

This comment has been minimized.

@mrocklin

mrocklin Jun 25, 2018
Member

Can I suggest placing self.classes_ as a keyword argument and dropping the lambda? Maybe something like the following:

return y.map_blocks(lambda block, classes=None: np.searchsorted(classes, block), dtype=self.classes_.dtype, classes=self.classes_)

This avoids placing self into the lambda, which might cause issues when serializing and sending to other workers.

This comment has been minimized.

@mrocklin

mrocklin Jun 25, 2018
Member

Or, something like the following might work as well and be a bit cleaner actually:

da.map_blocks(np.searchsorted, self.classes_, y)

This is probably cheaper to serialize and also keeps the nice searchsorted function name for the diagnostics.

This comment has been minimized.

@jrbourbeau

jrbourbeau Jun 25, 2018
Author Member

return da.map_blocks(np.searchsorted, self.classes_, y, dtype=self.classes_.dtype)

Seems to work just fine and is cleaner in my opinion, thanks for the recommendation @mrocklin! Not sure why I used the lambda in the first place...


a = dpp.LabelEncoder()
assert_eq_ar(a.inverse_transform(a.fit_transform(array)).compute(),
array.compute())

This comment has been minimized.

@mrocklin

mrocklin Jun 25, 2018
Member

Hopefully you don't need to call .compute here. The assert_eq function should handle this for you. It will also perform a number of other checks if it is provided dask objects directly.

This comment has been minimized.

@jrbourbeau

jrbourbeau Jun 26, 2018
Author Member

Currently the .tranform method is implemented such that LabelEncoder.transform(<dask-series>) returns a dask array. It looks like neither the assert_eq in dask.array.utils nor dask.dataframe.utils can compare a dask array and a dask series (although there could be a utility function I'm unaware of that will do this).

One way around this could be to remove the .computes and wrap dask series with da.asarray. This way assert_eq will get two arrays instead of an array and a series. Something like

if isinstance(array, da.Array):
    assert_eq_ar(a.inverse_transform(a.fit_transform(array)), array)
if isinstance(array, dd.Series):
    assert_eq_ar(a.inverse_transform(a.fit_transform(array)), da.asarray(array))

This comment has been minimized.

@TomAugspurger

TomAugspurger Jun 26, 2018
Member

Does dask_ml.utils._assert_eq work for you?

This comment has been minimized.

@TomAugspurger

TomAugspurger Jun 26, 2018
Member

Sorry, I should have read more closely. That won't quite work.

You could also make a scikit-learn LabelEncoder, and then check that the transformed & inverse_transformed versions match that.

This comment has been minimized.

@TomAugspurger

TomAugspurger Jun 26, 2018
Member

Actually, how about assert_eq_ar(a.inverse_transform(a.fit_transform(array)), da.asarray(array))

@TomAugspurger TomAugspurger merged commit 7c247a3 into dask:master Jun 26, 2018
3 checks passed
3 checks passed
ci/circleci: py27 Your tests passed on CircleCI!
Details
ci/circleci: py36 Your tests passed on CircleCI!
Details
ci/circleci: sklearn_dev Your tests passed on CircleCI!
Details
@TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Jun 26, 2018

Thanks @jrbourbeau!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants