[SPARK-2308][MLLIB] Add Mini-Batch KMeans Clustering method #1248

rnowling · 2014-06-27T20:14:56Z

Mini-batch is a version of KMeans that uses a randomly-sampled subset of the data points in each iteration instead of the full set of data points, improving performance (and in some cases, accuracy). The mini-batch version is compatible with the KMeans|| initialization algorithm currently implemented in MLlib.

This PR adds the KMeansMiniBatch clustering algorithm, tests, and updates docs.

Discussed in SPARK-2308

AmplabJenkins · 2014-06-27T20:15:28Z

Can one of the admins verify this patch?

srowen · 2014-06-27T20:17:02Z

Broad question -- this seems to duplicate a lot of KMeans.scala. Can it not be a variant rather than a separate implementation? or at least refactor the substantial commonality?

rnowling · 2014-06-27T20:21:02Z

The main function (runBreeze) of the KMeans is not compatible since KMeans optimizes multiple runs by striping iterations across the runs. With MiniBatch, each run's iteration will use a different randomly-sampled subset of points so the runs have to be done independently.

I can pull out other shared functions, though.

…ctions common to KMeans and KMeansMiniBatch objects.

…vate KMeans classes. Moved KMeansMiniBatch.{initRandom, initKMeansMiniBatchParallel} there

…tParallel

…s are only for one run.

…tep update

…n stochastic nature, use epsilons instead of direct comparison of floats.

rnowling · 2014-07-02T18:45:29Z

Sean,

I updated the code to factor out common bits into a KMeansCommons file, using traits for both the objects and classes. I updated the KMeansMiniBatch tests so they are customized for the KMeansMiniBatch, don't duplicate testing of common code, and account for the stochastic nature by using an epsilon for the errors instead of directly comparing the floats. I also realized that I failed to implement a key part of the MiniBatch algorithm so that is now included.

Please review again.

dorx · 2014-07-09T02:10:10Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansMiniBatch.scala

+    // Execute iterations of Lloyd's algorithm until all runs have converged
+    while (iteration < maxIterations) {
+
+      val sampledPoints = data.sample(false, batchSize)


sample actually takes a Double as the second argument for the sampling rate (between 0.0, and 1.0). I think you want takeSample here in order to get the exact batch size (but be warned that takeSample actually needs to collect the sample to the driver). Or you could compute the sampling rate and use sample instead, for an approximate sample size (you can only do this for free if the size of the RDD is already known, o/w you need to do a count, which requires a pass over the entire RDD).

Thanks, dorx! I'm surprised this didn't result in a compilation or run-time error. I'll update the code.

…e tests accordingly.

rnowling · 2014-08-26T20:03:17Z

MiniBatch KMeans needs a sampling method that runs in O(k), where k is the number of data points to sample time. Current Spark sampling methods run in O(n) time, where n is the number of data points in the RDD. I'm closing this PR until a better sampling method is found.

)

rnowling added 5 commits June 27, 2014 14:32

Added KMeansMiniBatch implementation

d56aa5b

Updated KMeansMiniBatch docs

54fabe1

Added KMeansMiniBatch to docs

2afee1a

Added overloaded alternative for train()

0853adb

Added KMeansMiniBatchSuite test

fc472ca

rnowling added 8 commits July 1, 2014 16:03

Created KMeansCommons file for common code. Created trait holding fun…

a6626fb

…ctions common to KMeans and KMeansMiniBatch objects.

Created KMeansCommons.KMeansCommons trait for common functions in pri…

316e005

…vate KMeans classes. Moved KMeansMiniBatch.{initRandom, initKMeansMiniBatchParallel} there

Change KMeansCommons.initKMeansMiniBatchParallel to KMeansCommons.ini…

b37c733

…tParallel

Modify KMeans to use initRandom, initParallel in KMeansCommons

71e5685

Updated documentation for initRun/initParallel to specify that center…

5eb9b29

…s are only for one run.

Clean up calls to initRandom in KMeans

b121178

Updated KMeansMiniBatch to properly implement sampling and gradient s…

294640b

…tep update

Modify KMeansMiniBatchSuite to test KMeansMiniBatch differences. Give…

da1c2cf

…n stochastic nature, use epsilons instead of direct comparison of floats.

dorx reviewed Jul 9, 2014
View reviewed changes

Change KMeansMiniBatch to use takeSample() instead of sample(). Updat…

cfff6ff

…e tests accordingly.

rnowling closed this Aug 26, 2014

wangyum pushed a commit that referenced this pull request May 26, 2023

[CARMEL-6495] Do not quit am even when all nodes are in blacklist (#1248

413224b

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2308][MLLIB] Add Mini-Batch KMeans Clustering method #1248

[SPARK-2308][MLLIB] Add Mini-Batch KMeans Clustering method #1248

rnowling commented Jun 27, 2014

AmplabJenkins commented Jun 27, 2014

srowen commented Jun 27, 2014

rnowling commented Jun 27, 2014

rnowling commented Jul 2, 2014

dorx Jul 9, 2014

rnowling Jul 9, 2014

rnowling commented Aug 26, 2014

[SPARK-2308][MLLIB] Add Mini-Batch KMeans Clustering method #1248

[SPARK-2308][MLLIB] Add Mini-Batch KMeans Clustering method #1248

Conversation

rnowling commented Jun 27, 2014

AmplabJenkins commented Jun 27, 2014

srowen commented Jun 27, 2014

rnowling commented Jun 27, 2014

rnowling commented Jul 2, 2014

dorx Jul 9, 2014

Choose a reason for hiding this comment

rnowling Jul 9, 2014

Choose a reason for hiding this comment

rnowling commented Aug 26, 2014