New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: LabelEncoder supports pandas Categorical #310

Merged
merged 8 commits into from Jul 20, 2018

Conversation

Projects
None yet
2 participants
@TomAugspurger
Member

TomAugspurger commented Jul 19, 2018

Enhances LabelEncoder to use CategoricalDtype for pandas and dask
series.

This improves the performance, and will be helpful for implementing OneHotEncoder efficiently for dask dataframes.

import string
import numpy as np
import pandas as pd
import dask.dataframe as dd
import dask_ml.preprocessing

dtype = pd.api.types.CategoricalDtype(list(string.ascii_letters[:12]))
n = 1_000_000

data = dd.from_pandas(
    pd.Series(np.random.choice(dtype.categories, size=n), dtype=dtype),
    npartitions=n // 100_000
)

Setup

le = dask_ml.preprocessing.LabelEncoder()  # True / False
codes = le.fit_transform(data)

print('fit')
%timeit le.fit(data)
print('transform')
%timeit le.transform(data)

print('transform-compute')
%timeit le.transform(data).compute()

print('inverse_transform')
%timeit le.inverse_transform(codes)

print('inverse_transform-compute')
%timeit le.inverse_transform(codes).compute()

Results

method categorical no categorical
fit 43.5 µs 625 ms
transform 678 µs 33.1 ms
transform-compute 4.98 ms 128 ms
inverse_transform 3.66 ms 240 µs
inverse_transform-compute 37 ms 160 ms

cc @jrbourbeau

ENH: LabelEncoder supports pandas Categorical
Enhances LabelEncoder to use CategoricalDtype for pandas and dask
series
@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jul 19, 2018

Member

cc also @jorisvandenbossche if you're interested in what a future scikit-learn LabelEncoder that could use a pandas-like CategoricalDtype might look like.

Member

TomAugspurger commented Jul 19, 2018

cc also @jorisvandenbossche if you're interested in what a future scikit-learn LabelEncoder that could use a pandas-like CategoricalDtype might look like.

TomAugspurger added some commits Jul 19, 2018

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jul 20, 2018

Contributor

Cool! Note that in the mean time, we don't use LabelEncoder anymore directly in OneHotEncoder: scikit-learn/scikit-learn#10209 (both now use a common _encode function).
Although I suppose this doesn't matter too much for dask-ml (I don't think it would be possible to subclass OneHotEncoder to re-use the code that called LabelEncoder ?)

Contributor

jorisvandenbossche commented Jul 20, 2018

Cool! Note that in the mean time, we don't use LabelEncoder anymore directly in OneHotEncoder: scikit-learn/scikit-learn#10209 (both now use a common _encode function).
Although I suppose this doesn't matter too much for dask-ml (I don't think it would be possible to subclass OneHotEncoder to re-use the code that called LabelEncoder ?)

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jul 20, 2018

Member

Note that in the mean time, we don't use LabelEncoder anymore directly in OneHotEncoder: scikit-learn/scikit-learn#10209 (both now use a common _encode function).

Thanks, I was working off an older branch. This will still be a nice standalone addition though, and I think it's worthwhile diverging from (getting ahead of?) scikit-learn here as using the dtype information is more important for large datasets.

Member

TomAugspurger commented Jul 20, 2018

Note that in the mean time, we don't use LabelEncoder anymore directly in OneHotEncoder: scikit-learn/scikit-learn#10209 (both now use a common _encode function).

Thanks, I was working off an older branch. This will still be a nice standalone addition though, and I think it's worthwhile diverging from (getting ahead of?) scikit-learn here as using the dtype information is more important for large datasets.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jul 20, 2018

Contributor

Yeah, this is certainly a worthwhile addition!

Contributor

jorisvandenbossche commented Jul 20, 2018

Yeah, this is certainly a worthwhile addition!

TomAugspurger added some commits Jul 20, 2018

@TomAugspurger TomAugspurger merged commit 8f49b9d into dask:master Jul 20, 2018

5 checks passed

ci/circleci: py27 Your tests passed on CircleCI!
Details
ci/circleci: py36 Your tests passed on CircleCI!
Details
ci/circleci: sklearn_dev Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 95.25%)
Details
codecov/project 95.45% (+0.19%) compared to f7e3058
Details

@TomAugspurger TomAugspurger deleted the TomAugspurger:categorical-label-encoder branch Jul 20, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment