WIP: scalable k-means #1

TomAugspurger · 2017-08-28T11:07:50Z

Implements k-means||, a scalable k-means initialization alternative to K-Means++

mrocklin · 2017-10-09T12:45:31Z

benchmarks/k_means.py

+except ImportError:
+    pass
+else:
+    coloredlogs.install()


mrocklin@carbon:~/workspace/dask-ml$ python benchmarks/k_means_kdd.py Traceback (most recent call last): File "benchmarks/k_means_kdd.py", line 12, in <module> import coloredlogs ModuleNotFoundError: No module named 'coloredlogs'

mrocklin · 2017-10-09T12:51:50Z

benchmarks/k_means_kdd.py

+        else:
+            logger.info("Uploading to cloud storage")
+            upload(local, fs)
+        path = "dask-data/kddcup/kdd.parq/"


I tried running this with a dask cluster on localhost and ran into errors here

Whoops, missing the s3://.

Some context, the kdd-cup dataset is the largest dataset used in the k-means|| paper, but it isn't that big... I can cluster the entire dataset just fine on my laptop.

I just setup a cluster to benchmark k-means on the airlines dataset.

We shouldn't necessarily need to engage s3 here though, no? I wanted to run this locally just to see the diagnostic dashboard (things look great by the way) and ran into issues here

Oh I remember now. I assumed that using the distributed scheduler implied a remote cluster. I'll have to refactor it a bit more then.

TomAugspurger · 2017-10-09T13:00:16Z

Oh, sorry mixing up my examples. Yes... that should be `path = local` I think. I'll push an update.

…

On Mon, Oct 9, 2017 at 7:57 AM, Matthew Rocklin ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In benchmarks/k_means_kdd.py <#1 (comment)>: > +def main(args=None): + args = parse_args(args) + logger.info("Checking local data") + local = split(download()) + + if args.scheduler_address: + logger.info("Using distributed mode") + client = Client(args.scheduler_address) + logger.info(client.scheduler_info()) + fs = s3fs.S3FileSystem() + if fs.exists("dask-data"): + logger.info("Using cached dataset") + else: + logger.info("Uploading to cloud storage") + upload(local, fs) + path = "dask-data/kddcup/kdd.parq/" We shouldn't necessarily need to engage s3 here though, no? I wanted to run this locally just to see the diagnostic dashboard (things look great by the way) and ran into issues here — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIgqTPcdvcZqExx8hQX76xKHrCahlks5sqhg1gaJpZM4PEYWN> .

TomAugspurger · 2017-10-09T17:44:24Z

For some reason, travis is pickup up a2541e4 as the merge commit, which is for #15 . Opening a new PR.

TomAugspurger force-pushed the k-means branch from 66ace53 to ab9c095 Compare September 8, 2017 19:34

TomAugspurger force-pushed the k-means branch 2 times, most recently from 30dfa90 to d249822 Compare September 27, 2017 19:33

TomAugspurger force-pushed the k-means branch from f85426c to bf0f4b8 Compare October 6, 2017 18:41

mrocklin reviewed Oct 9, 2017

View reviewed changes

TomAugspurger added 2 commits October 9, 2017 10:12

ENH: Added k-means||

a225aa6

BENCH: Added benchmarks

427ab94

TomAugspurger force-pushed the k-means branch from 28727fd to 427ab94 Compare October 9, 2017 17:41

TomAugspurger closed this Oct 9, 2017

TomAugspurger deleted the k-means branch October 18, 2017 12:50

TomAugspurger pushed a commit to TomAugspurger/dask-ml that referenced this pull request Oct 17, 2019

Update travis.yml (dask#1)

52505fe

TomAugspurger pushed a commit to TomAugspurger/dask-ml that referenced this pull request Oct 17, 2019

Update travis.yml (dask#1)

0133780

eric-czech mentioned this pull request Aug 28, 2020

PCA for short-fat arrays #731

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: scalable k-means #1

WIP: scalable k-means #1

Uh oh!

TomAugspurger commented Aug 28, 2017

Uh oh!

mrocklin Oct 9, 2017

Uh oh!

mrocklin Oct 9, 2017

Uh oh!

TomAugspurger Oct 9, 2017

Uh oh!

mrocklin Oct 9, 2017

Uh oh!

TomAugspurger Oct 9, 2017

Uh oh!

TomAugspurger commented Oct 9, 2017 via email

Uh oh!

TomAugspurger commented Oct 9, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

WIP: scalable k-means #1

WIP: scalable k-means #1

Uh oh!

Conversation

TomAugspurger commented Aug 28, 2017

Uh oh!

mrocklin Oct 9, 2017

Choose a reason for hiding this comment

Uh oh!

mrocklin Oct 9, 2017

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Oct 9, 2017

Choose a reason for hiding this comment

Uh oh!

mrocklin Oct 9, 2017

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Oct 9, 2017

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Oct 9, 2017 via email

Uh oh!

TomAugspurger commented Oct 9, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants