Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multithreading center init #4

Open
ghamerly opened this issue May 10, 2018 · 2 comments
Open

multithreading center init #4

ghamerly opened this issue May 10, 2018 · 2 comments

Comments

@ghamerly
Copy link
Owner

Migrated from BaylorCS/baylorml#2 (@BenjaminHorn)

I have a fairly big dataset (100m * 10) , and as i calculated it would take around 8 hours to initialise the centers with init_centers_kmeanspp_v2. After some test i realised
-that only one core does the work
-most of the time is spent in this loop: https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L187

I have to admit i dont know much about multithreaded programming, but i think the loop could be split into the number of threads, to make it run parallel.

float sumDistribution(int from, int to, Dataset const &x, pair<double, int> *dist2)
{
    //here comes the loop
    return sum_distribution;
}

But those parallel running function have to read from the same dist2 array and x. Maybe this is why a cluster loop takes 5-6s, and it cant be run parallel, and fasten up.
Before i start to dig into the topic i just wanted to ask your opinion.

Some other thing:
why is https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L198 necessary?

            if (dist2[i].first > max_dist) {
                max_dist = dist2[i].first;
            }

As i can see max_dist wont be used anywhere.

@ghamerly
Copy link
Owner Author

The K-means++ initialization is inherently serial which is why you see only one core doing the work. At least, each major stage of the algorithm requires previous stages to complete (each iteration of https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L183).

However, the code inside that loop can be parallelized, especially the loop you mention at https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L187
It would not take too much work to do this. I probably won't have time in the next few weeks to do it, but if you want to send a pull request I'd be happy to look at it and incorporate it.

One thing I have tried before and works well but is not in this code is to apply the idea of the triangle inequality to this code. This avoids distance computations on subsequent iterations and makes it much faster for large k. I may have this code lying around somewhere.

Alternatively, there are other methods based on K-means++ that supposedly work faster in parallel, see here: http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf

As for the max_dist variable, I think you're right that it's unused. It's probably left over from some older profiling code.

@ghamerly
Copy link
Owner Author

From BaylorCS/baylorml#2 (comment) (@kno10)

It may well be feasible to use random initialization.
In my experiments, k-means++ only marginally improved runtime. It may however yield better clusters.
This depends a lot on the data; but the 8 hours initialization time may not be worth the improvement.

A simpler—yet effective—strategy for huge data is to initialize using k-means on a sample (if I'm not mistaken k-means|| is similar to running k-means++ on a distributed sample?).
I.e. randomly draw 1 million points (or even just 100k), run k-means++ and do just 3–5 iterations. Take the resulting centers as initial centers for your complete data set. Finding these centers should take maybe 10 minutes and will often yield better initial centers than k-means++? These centers may already be very good estimates.

In ELKI (not baylorml), on a 38 million 2d coordinates data set, k=100, ''on a single core and not on a dedicated benchmarking host''. For more reliable conclusions, this would need to be run on many data sets, with different ks, and many more seeds, on dedicated systems.

SortMeans with random intialization, seed 0: 420 s, 41 iterations, varsum 4.471356921340852E8
SortMeans with random intialization, seed 1: 1000 s, 102 iterations, varsum 3.494018158732119E8
SortMeans with random intialization, seed 2: 570 s, 47 iterations, varsum 3.690071545123777E8
SortMeans with k-means++ initialization, seed 0: 810 s, 77 iterations, varsum 3.7151307775374913E8
SortMeans with k-means++ initialization, seed 1: 774 s, 69 iterations, varsum 3.8855969067854965E8
SortMeans with k-means++ initialization, seed 2: 492 s, 39 iterations, varsum 3.374139162253591E8

So the main benefit of k-means++ is that it is less likely to get stuck in a bad local minimum.
If we now initialize via k-means++ on a sample, we should get good results, at hopefully fewer iterations.

SortMeans initialized with 5it,km++,1%, seed 0: 624 s, 55 iter, varsum 3.538176164640555E8
SortMeans initialized with 5it,km++,1%, seed 1: 850 s, 85 iter, varsum 4.0328181944224614E8
SortMeans initialized with 5it,km++,1%, seed 2: 593 s, 56 iter, varsum 3.590163352246089E8

And, unfortunately, reality is much more complex... The performance is not bad, but also not substantially better. We need much larger experiments to even be able to tell what is "usually" better. Overall, lucky or bad random seems to be the largest factor.

The parameters used for the "sample k-means" combination ELKI:

-time
-algorithm clustering.kmeans.KMeansSort -kmeans.k 100
-kmeans.initialization SampleKMeansInitialization -kmeans.seed 0
-kmeans.algorithm KMeansSort
-kmeans.initialization KMeansPlusPlusInitialMeans -kmeans.seed 0
-kmeans.maxiter 5 -kmeans.samplesize 0.01
-evaluator clustering.LogClusterSizes -resulthandler DiscardResultHandler

If you want to experiment with different seeds, I have contributed a seed command in 27d970511f46321012674498d99538c37f6db651
To reproduce these sampling-based initialization in this C++ implementation, you would need to further extend the driver with e.g. a function to use precomputed centers for initialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant