New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KMeans.fit on uncertain lengths #390

Closed
mrocklin opened this Issue Oct 7, 2018 · 4 comments

Comments

Projects
None yet
3 participants
@mrocklin
Member

mrocklin commented Oct 7, 2018

This seems like a reasonable thing that ought to work. Any ideas on how we could make this happen?

https://stackoverflow.com/questions/52583316/how-to-pass-dask-dataframe-as-input-to-dask-ml-models

import dask.dataframe as dd
import pandas as pd
from dask_ml.cluster import KMeans

df = dd.from_pandas(pd.DataFrame({'A': [1, 2, 3, 4, 5], 
                                  'B': [6, 7, 8, 9, 10]}),
                    npartitions=2)

kmeans = KMeans()
kmeans.fit(df)
  1. Do we compute lengths explicitly but warn that we're doing so?
  2. Do we actually need explicit lengths (I was somewhat surprised by this)
@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Oct 8, 2018

Do we compute lengths explicitly but warn that we're doing so?

With DataFrame.to_dask_array that's possible now. I'm not sure what's preferable.

Do we actually need explicit lengths (I was somewhat surprised by this)

  1. We could probably be smarter about this. init='random' for example does idx = sorted(random_state.randint(0, len(X), size=n_clusters)). That doesn't work for unknown lengths as written, but we could rewrite that without too much effort.

The default init='kmeans||' may take a bit more effort, but at a glance it may be doable.

I think that once we get past the initialization and into Lloyd's algorithm, things will be fine.

@mrocklin

This comment has been minimized.

Member

mrocklin commented Oct 8, 2018

I would be inclined to select random points in a less uniformly random way. We might take a few points from each chunk. If we wanted to be careful we might take several points from each chunk and the length of that chunk, and then down-select again based on those lengths.

@ianalis

This comment has been minimized.

ianalis commented Oct 8, 2018

The workaround I found is to call DataFrame.to_dask_array(lengths=True) for models that require length, as @TomAugspurger mentioned above.

@mrocklin

This comment has been minimized.

Member

mrocklin commented Oct 8, 2018

I'll plan to take a stab at this this afternoon if no one else has started.

mrocklin added a commit to mrocklin/dask-ml that referenced this issue Oct 8, 2018

TomAugspurger added a commit that referenced this issue Oct 8, 2018

Support dataframes for k-means (#393)
* Support dataframes for k-means

Fixes #390
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment