Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KMeans.fit on uncertain lengths #390

Closed
mrocklin opened this issue Oct 7, 2018 · 4 comments
Closed

KMeans.fit on uncertain lengths #390

mrocklin opened this issue Oct 7, 2018 · 4 comments

Comments

@mrocklin
Copy link
Member

@mrocklin mrocklin commented Oct 7, 2018

This seems like a reasonable thing that ought to work. Any ideas on how we could make this happen?

https://stackoverflow.com/questions/52583316/how-to-pass-dask-dataframe-as-input-to-dask-ml-models

import dask.dataframe as dd
import pandas as pd
from dask_ml.cluster import KMeans

df = dd.from_pandas(pd.DataFrame({'A': [1, 2, 3, 4, 5], 
                                  'B': [6, 7, 8, 9, 10]}),
                    npartitions=2)

kmeans = KMeans()
kmeans.fit(df)
  1. Do we compute lengths explicitly but warn that we're doing so?
  2. Do we actually need explicit lengths (I was somewhat surprised by this)
@TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Oct 8, 2018

Do we compute lengths explicitly but warn that we're doing so?

With DataFrame.to_dask_array that's possible now. I'm not sure what's preferable.

Do we actually need explicit lengths (I was somewhat surprised by this)

  1. We could probably be smarter about this. init='random' for example does idx = sorted(random_state.randint(0, len(X), size=n_clusters)). That doesn't work for unknown lengths as written, but we could rewrite that without too much effort.

The default init='kmeans||' may take a bit more effort, but at a glance it may be doable.

I think that once we get past the initialization and into Lloyd's algorithm, things will be fine.

@mrocklin
Copy link
Member Author

@mrocklin mrocklin commented Oct 8, 2018

I would be inclined to select random points in a less uniformly random way. We might take a few points from each chunk. If we wanted to be careful we might take several points from each chunk and the length of that chunk, and then down-select again based on those lengths.

@ianalis
Copy link

@ianalis ianalis commented Oct 8, 2018

The workaround I found is to call DataFrame.to_dask_array(lengths=True) for models that require length, as @TomAugspurger mentioned above.

@mrocklin
Copy link
Member Author

@mrocklin mrocklin commented Oct 8, 2018

I'll plan to take a stab at this this afternoon if no one else has started.

mrocklin added a commit to mrocklin/dask-ml that referenced this issue Oct 8, 2018
TomAugspurger added a commit that referenced this issue Oct 8, 2018
* Support dataframes for k-means

Fixes #390
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants