New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds LabelEncoder #226

Merged
merged 9 commits into from Jun 26, 2018

Conversation

Projects
None yet
3 participants
@jrbourbeau
Member

jrbourbeau commented Jun 25, 2018

This PR adds LabelEncoder as a drop-in replacement for sklearn.preprocessing.LabelEncoder.

@TomAugspurger

LGTM overall. One very minor comment.

__doc__ = sklabel.LabelEncoder.__doc__
def fit(self, y):
if isinstance(y, dd.Series):

This comment has been minimized.

@TomAugspurger

TomAugspurger Jun 25, 2018

Member

Slight preference for moving this to a LabelEnoder._check_array. Then it can be re-used with LabelEncodertransform.

This comment has been minimized.

@jrbourbeau

jrbourbeau Jun 25, 2018

Member

Can do @TomAugspurger, that seems reasonable to me

if isinstance(y, da.Array):
return y.map_blocks(lambda b: np.searchsorted(self.classes_, b),
dtype=self.classes_.dtype)

This comment has been minimized.

@mrocklin

mrocklin Jun 25, 2018

Member

Can I suggest placing self.classes_ as a keyword argument and dropping the lambda? Maybe something like the following:

return y.map_blocks(lambda block, classes=None: np.searchsorted(classes, block), dtype=self.classes_.dtype, classes=self.classes_)

This avoids placing self into the lambda, which might cause issues when serializing and sending to other workers.

This comment has been minimized.

@mrocklin

mrocklin Jun 25, 2018

Member

Or, something like the following might work as well and be a bit cleaner actually:

da.map_blocks(np.searchsorted, self.classes_, y)

This is probably cheaper to serialize and also keeps the nice searchsorted function name for the diagnostics.

This comment has been minimized.

@jrbourbeau

jrbourbeau Jun 25, 2018

Member
return da.map_blocks(np.searchsorted, self.classes_, y, dtype=self.classes_.dtype)

Seems to work just fine and is cleaner in my opinion, thanks for the recommendation @mrocklin! Not sure why I used the lambda in the first place...

a = dpp.LabelEncoder()
assert_eq_ar(a.inverse_transform(a.fit_transform(array)).compute(),
array.compute())

This comment has been minimized.

@mrocklin

mrocklin Jun 25, 2018

Member

Hopefully you don't need to call .compute here. The assert_eq function should handle this for you. It will also perform a number of other checks if it is provided dask objects directly.

This comment has been minimized.

@jrbourbeau

jrbourbeau Jun 26, 2018

Member

Currently the .tranform method is implemented such that LabelEncoder.transform(<dask-series>) returns a dask array. It looks like neither the assert_eq in dask.array.utils nor dask.dataframe.utils can compare a dask array and a dask series (although there could be a utility function I'm unaware of that will do this).

One way around this could be to remove the .computes and wrap dask series with da.asarray. This way assert_eq will get two arrays instead of an array and a series. Something like

if isinstance(array, da.Array):
    assert_eq_ar(a.inverse_transform(a.fit_transform(array)), array)
if isinstance(array, dd.Series):
    assert_eq_ar(a.inverse_transform(a.fit_transform(array)), da.asarray(array))

This comment has been minimized.

@TomAugspurger

TomAugspurger Jun 26, 2018

Member

Does dask_ml.utils._assert_eq work for you?

This comment has been minimized.

@TomAugspurger

TomAugspurger Jun 26, 2018

Member

Sorry, I should have read more closely. That won't quite work.

You could also make a scikit-learn LabelEncoder, and then check that the transformed & inverse_transformed versions match that.

This comment has been minimized.

@TomAugspurger

TomAugspurger Jun 26, 2018

Member

Actually, how about assert_eq_ar(a.inverse_transform(a.fit_transform(array)), da.asarray(array))

@TomAugspurger TomAugspurger merged commit 7c247a3 into dask:master Jun 26, 2018

3 checks passed

ci/circleci: py27 Your tests passed on CircleCI!
Details
ci/circleci: py36 Your tests passed on CircleCI!
Details
ci/circleci: sklearn_dev Your tests passed on CircleCI!
Details
@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Jun 26, 2018

Thanks @jrbourbeau!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment