New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StandardScaler does not support a dataframe? #157

Closed
bnmnetp opened this Issue Apr 6, 2018 · 12 comments

Comments

Projects
None yet
5 participants
@bnmnetp

bnmnetp commented Apr 6, 2018

Given the following code:

df = pd.DataFrame({'A':[1,2,3,4,2,2,3,4], 'B':[4,4,4,3,3,3,7,7], 'C':[2,2,4,3,2,3,2,7]})
xd = dd.from_pandas(df, npartitions=2)
scl = StandardScaler()
scl.fit(xd)

I get the error

TypeError: 'Series' object does not support item assignment

Does the dask_ml.preprocessing.StandardScaler not support dataframes? The documentation says it supports any of the Dask collections, so I assumed it would. I also tried to convert it to a Dask array but that gives me a different error: TypeError: 'float' object cannot be interpreted as an integer

Maybe I'm just doing something wrong?

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Apr 9, 2018

Not currently, shouldn't be hard to support though. We should have StandardScalar define a _check_array that transforms dask.dataframes to dask.arrays I think (with .values).

@mrocklin

This comment has been minimized.

Member

mrocklin commented Apr 12, 2018

Is this something that you would like to help contribute @bnmnetp ? It sounds like it might be a good first issue.

@bnmnetp

This comment has been minimized.

bnmnetp commented Apr 12, 2018

@mrocklin I would love to help out, but my hacking time is very limited during the semester, and responding to issues on my own projects keeps me pretty busy. I ran across this problem while trying to prepare an example for a class I'm teaching. If its still lingering come summer I may be able to help out.

@mrocklin

This comment has been minimized.

Member

mrocklin commented Apr 12, 2018

@cr458

This comment has been minimized.

Contributor

cr458 commented Apr 19, 2018

@mrocklin I could have a look at this if needed?

@mrocklin

This comment has been minimized.

Member

mrocklin commented Apr 19, 2018

Glad to hear it @cr458 ! This project would benefit from your help.

For future reference you don't need my permission to take on issues like this. You're right that it's probably good to announce your intent to ensure that no one else has some unpublished work that might overlap, but in general anyone can work on anything. Even if there was some more formal process I probably wouldn't be the one to ping, @TomAugspurger tends to lead development on dask-ml.

@cr458

This comment has been minimized.

Contributor

cr458 commented Apr 19, 2018

@TomAugspurger, @mrocklin it seems the error described by @bnmnetp when passing a dask.array to the StandardScaler.fit() method arises from this line attributes['n_samples_seen_'] = len(X).

Since the chunks are not defined for the dask.array, the array in his example has shape (nan, 3) and this raises the error TypeError: 'float' object cannot be interpreted as an integer due to the np.nan .

I noticed that we explicitly set attributes['n_samples_seen_'] = np.nan for the MinMaxScaler, would this be an acceptable solution for the StandardScaler as well? I'm not sure if this attribute gets called anywhere downstream.

Another solution would be something along the lines of:

if isinstance(X, da.Array):
    attributes['n_samples_seen_'] = len(X.compute())
else:
    attributes['n_samples_seen_'] = len(X)
@jakirkham

This comment has been minimized.

Member

jakirkham commented Apr 19, 2018

FWIW the desire to know chunk size of unknown arrays is a common enough problem it's worth solving correctly and generally IMHO. Issue ( dask/dask#3293 ) has some discussion along these lines. The open question is the API that we provide. Feedback in that issue would be welcome.

@mrocklin

This comment has been minimized.

Member

mrocklin commented Apr 19, 2018

Stepping back a moment, do we actually need to compute the length here? Is that strictly necessary for the StandardScaler transformation on large datasets? This might require some domain expertise to ensure that we make the right choice here.

@jakirkham

This comment has been minimized.

Member

jakirkham commented Apr 19, 2018

Looking at the code it seems like this length could also be computed lazily, which would solve the issue.

ref: https://github.com/dask/dask-ml/blob/v0.4.1/dask_ml/preprocessing/data.py#L43-L44

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Apr 19, 2018

Stepping back a moment, do we actually need to compute the length here?

I think this is the right approach. n_samples_seen_ is really only useful for partial_fit to adjust statistics for incremental training. But we don't implement partial_fit.

Setting n_samples_seen_ to to NaN seems fine. I just merged #162 with a 0.5.0 changelog. Changing this would go under API breaking changes.

That said, I don't think the issue (at least not the original issue) is with unknown lengths. The traceback I get is

TypeError                                 Traceback (most recent call last)
<ipython-input-34-66a7ea4404ff> in <module>()
----> 1 scl.fit(xd)

~/sandbox/dask-ml/dask_ml/preprocessing/data.py in fit(self, X, y)
     36             var_ = X.var(0)
     37             scale_ = var_.copy()
---> 38             scale_[scale_ == 0] = 1
     39             scale_ = da.sqrt(scale_)
     40             attributes['scale_'] = scale_

TypeError: 'Series' object does not support item assignment

I think it's best to just convert dataframes to arrays here, matching scikit-learn's behavior for now. You can do that directly in the method, or expand check_array with a convert_dataframe argument that gets the .values attribute from the dataframe if True.

@cr458

This comment has been minimized.

Contributor

cr458 commented Apr 20, 2018

@TomAugspurger the issue with unknown lengths arises when attempting to pass a dask.Array (or converting from a dataframe) to the method.
Since we don't implement partial_fit, I'll just set n_samples_seen_ to to NaN in any case. Thanks for the information!

@jakirkham out of curiosity, how would you compute that lazily? ( I'm still fairly unfamiliar with dask, so any snippets of information are very welcome!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment