New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: RobustScaler #62

Merged
merged 9 commits into from Jan 22, 2018

Conversation

Projects
None yet
3 participants
@jorisvandenbossche
Contributor

jorisvandenbossche commented Oct 24, 2017

We are having a mini coding workshop with Matthew, and @ogrisel and I made a very minimal implementation of a RobustScaler. It's working on a toy dataset, but not yet more extensive testing.

class RobustScaler(skdata.RobustScaler):
def _check_array(self, X, copy):
return X

This comment has been minimized.

@TomAugspurger

TomAugspurger Oct 24, 2017

Member

FYI: adding a utils.validation.check_array that works on dask arrays is near the top of my TODO list.

raise ValueError("Invalid quantile range: %s" %
str(self.quantile_range))
quantiles = [da.percentile(col, [q_min, 50., q_max]) for col in X.T]

This comment has been minimized.

@ogrisel

ogrisel Oct 25, 2017

It would be great to add the axis kwarg to da.percentile to reflect what is available in np.percentile.

BTW, I think the code RobustScaler in scikit-learn should also use np.percentile to compute both the median and the interquantile range at the same time as done here.

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Oct 27, 2017

One question that came up is how to handle computed attributes during fit. Here in the PR I compute them, but in the other examples already included in preprocessing/data.py the approach is not uniform: both compute as persist are used.

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Oct 27, 2017

I think for now we want to always compute the attributes (https://github.com/dask/dask-ml/pull/71/files#diff-b3693cf8f80e35f1ef5ec62caeebce81R64). In the future, we may add a compute keyword to each estimator.

The fitted attributes here are small correct?

@ogrisel

This comment has been minimized.

ogrisel commented Oct 27, 2017

Yes, 2 float arrays of size n_features.

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Oct 27, 2017

Yes, but for the StandardScaler that is the same (similarly small data), but there persist is used (but that could of course be changed to compute for consistency if we choose that option)

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Oct 27, 2017

Yes, we should change StandardScaler (and MinMaxScaler) to be concrete. Apologies for the confusion.

@jorisvandenbossche jorisvandenbossche changed the title from [WIP] RobustScaler to ENH: RobustScaler Jan 22, 2018

# bigger data to make percentile more reliable
# and not centered around 0 to make rtol work
X, y = make_classification(n_samples=1000, chunks=200)

This comment has been minimized.

@ogrisel

ogrisel Jan 22, 2018

Maybe fix the seed: random_state=0

assert_eq_ar(a.transform(X).compute(), b.transform(X.compute()))
def test_inverse_transform(self):
a = dpp.RobustScaler()

This comment has been minimized.

@ogrisel

ogrisel Jan 22, 2018

Better rename a to scaler or rs.

@TomAugspurger TomAugspurger merged commit 7f0663f into dask:master Jan 22, 2018

2 checks passed

ci/circleci: py27 Your tests passed on CircleCI!
Details
ci/circleci: py36 Your tests passed on CircleCI!
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment