-
-
Notifications
You must be signed in to change notification settings - Fork 260
WIP: scalable k-means #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
66ace53 to
ab9c095
Compare
30dfa90 to
d249822
Compare
benchmarks/k_means.py
Outdated
| except ImportError: | ||
| pass | ||
| else: | ||
| coloredlogs.install() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mrocklin@carbon:~/workspace/dask-ml$ python benchmarks/k_means_kdd.py
Traceback (most recent call last):
File "benchmarks/k_means_kdd.py", line 12, in <module>
import coloredlogs
ModuleNotFoundError: No module named 'coloredlogs'
benchmarks/k_means_kdd.py
Outdated
| else: | ||
| logger.info("Uploading to cloud storage") | ||
| upload(local, fs) | ||
| path = "dask-data/kddcup/kdd.parq/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running this with a dask cluster on localhost and ran into errors here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, missing the s3://.
Some context, the kdd-cup dataset is the largest dataset used in the k-means|| paper, but it isn't that big... I can cluster the entire dataset just fine on my laptop.
I just setup a cluster to benchmark k-means on the airlines dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't necessarily need to engage s3 here though, no? I wanted to run this locally just to see the diagnostic dashboard (things look great by the way) and ran into issues here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I remember now. I assumed that using the distributed scheduler implied a remote cluster. I'll have to refactor it a bit more then.
|
Oh, sorry mixing up my examples. Yes... that should be `path = local` I
think. I'll push an update.
…On Mon, Oct 9, 2017 at 7:57 AM, Matthew Rocklin ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In benchmarks/k_means_kdd.py
<#1 (comment)>:
> +def main(args=None):
+ args = parse_args(args)
+ logger.info("Checking local data")
+ local = split(download())
+
+ if args.scheduler_address:
+ logger.info("Using distributed mode")
+ client = Client(args.scheduler_address)
+ logger.info(client.scheduler_info())
+ fs = s3fs.S3FileSystem()
+ if fs.exists("dask-data"):
+ logger.info("Using cached dataset")
+ else:
+ logger.info("Uploading to cloud storage")
+ upload(local, fs)
+ path = "dask-data/kddcup/kdd.parq/"
We shouldn't necessarily need to engage s3 here though, no? I wanted to
run this locally just to see the diagnostic dashboard (things look great by
the way) and ran into issues here
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIgqTPcdvcZqExx8hQX76xKHrCahlks5sqhg1gaJpZM4PEYWN>
.
|
Implements
k-means||, a scalable k-means initialization alternative to K-Means++