SPARK-4156 [MLLIB] EM algorithm for GMMs #3022

tgaloppo · 2014-10-30T19:00:49Z

Implementation of Expectation-Maximization for Gaussian Mixture Models.

This is my maiden contribution to Apache Spark, so I apologize now if I have done anything incorrectly; having said that, this work is my own, and I offer it to the project under the project's open source license.

AmplabJenkins · 2014-10-30T19:02:11Z

Can one of the admins verify this patch?

rxin · 2014-11-01T08:15:26Z

Jenkins, test this please.

manishamde · 2014-11-01T08:25:02Z

@tgaloppo Thanks for the PR and congratulations on the first contribution. Apologies for the lack of feedback thus far -- I guess everyone is busy with the 1.2 release deadline on Nov 1. I will take a look at the PR in the next few days.

Please make sure you get the JIRA assigned to yourself next time before working. It's the only way to avoid duplicate work.

cc: @jkbradley, @mengxr

AmplabJenkins · 2014-11-01T08:32:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22688/
Test FAILed.

tgaloppo · 2014-11-08T12:53:55Z

This test appeared to fail due to some form of timeout during the pull; is there any action I need to take?

SparkQA · 2014-11-09T18:28:22Z

Test build #514 has started for PR 3022 at commit c15405c.

This patch does not merge cleanly.

SparkQA · 2014-11-09T20:01:26Z

Test build #514 has finished for PR 3022 at commit c15405c.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class GaussianMixtureModel(val w: Array[Double], val mu: Array[Vector], val sigma: Array[Matrix])

squito · 2014-11-10T01:24:29Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala

+  /** Sum the values in array of doubles */
+  private def sum(x : Array[Double]) : Double = {
+    var s : Double = 0.0
+    x.foreach(u => s += u)


You might not care about this at all, but calling foreach on an Array is actually notably slower than using a while loop over the indices. foreach over a Range is actually pretty close to while loop (ie. (0 until x.length).foreach{idx => s += x(idx)}. Or if you don't care about runtimes, then you can always just call array.sum (it actually comes from an implicit conversion to WrappedArray):

scala> ((0 to 100).map{_ / 100.0}.toArray).sum res2: Double = 50.5

tgaloppo · 2014-11-10T13:51:55Z

Please advise how to resolve merge issues.

Modified sum function for better performance

tgaloppo · 2014-11-13T17:15:14Z

Thanks, @squito ... while I expect the array to only have a few elements, I have made changes according to your advice.

…ges. Improved cluster initialization strategy.

tgaloppo · 2014-11-18T00:57:11Z

Merged with the latest master branch to hopefully fix any merge issues.
Updated scala test suite to use new MLlibSparkTestContext
Improved cluster initialization strategy to average several samples per cluster.

… and tolerance parameters. Modified cluster initialization strategy to use an initial covariance matrix derived from the sample points used to initialize the mean.

jkbradley · 2014-12-11T03:09:43Z

examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala

+package org.apache.spark.examples.mllib
+
+import org.apache.spark.{SparkConf, SparkContext}
+import org.apache.spark.mllib.clustering.GaussianMixtureModel


no need for this import

mengxr · 2014-12-19T07:38:11Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala

+    val mu = vectorMean(x)
+    val ss = BreezeVector.zeros[Double](x(0).length)
+    val cov = BreezeMatrix.eye[Double](ss.length)
+    x.map(xi => (xi - mu) :^ 2.0).foreach(u => ss += u)


breeze has squaredDistance.

squaredDistance returns a scalar... I want the squared entry values.

Changed ExpectationSum to a private class

tgaloppo · 2014-12-20T00:41:02Z

I've performed most of the requested changes. I do not see the BLAS function mentioned (dsyr), so I left this as a TODO. Also, I could not find EPSILON in MLUtils.

I left predictMembership public and changed predict to predictLabels, providing soft and hard label assignments, respectively. I know there are some other thoughts around improving these, but I am not clear on what I should do.

cc: @mengxr @jkbradley

FlytxtRnD · 2014-12-22T09:16:02Z

Sorry for late reply.predictLabels() and predictMembership() looks fine.But what about moving the computeSoftAssignments() to GaussianMixtureModelEM class(in KMeans, findClosest() is defined in KMeans rather than in KMeansModel)

It will be good if the name of the class GaussianMixtureModelEM is changed as @mengxr suggested.

FlytxtRnD · 2014-12-22T11:51:42Z

examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala

+    }
+  }
+
+  private def run(inputFile: String, k: Int, convergenceTol: Double) {


Can we take maxIterations as an optional input parameter?

Made maximum iterations an optional parameter to DenseGmmEM

jkbradley · 2014-12-22T19:43:03Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala

+
+  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
+  def this() = this(2, 0.01, 100)
+


Remove extra newlines

jkbradley · 2014-12-22T19:43:36Z

@tgaloppo MLUtils.EPSILON is actually private[util]. I think it would be fine to change it to be private[mllib]. CC: @mengxr

@tgaloppo I strongly recommend predict() instead of predictLabels() to be consistent with KMeansModel.

@FlytxtRnD computeSoftAssignments() is a function of the model, not the learning algorithm, so I think it belongs in the model. IMO, findClosest() should be in KMeansModel instead of KMeans, but that should be fixed in another PR. (It is not too important though since it is a private[mllib] API.)

GaussianMixtureEM: Renamed from GaussianMixtureModelEM; corrected formatting issues GaussianMixtureModel: Renamed predictLabels() to predict() Others: Modifications based on rename of GaussianMixtureEM

tgaloppo · 2014-12-22T20:26:14Z

Ok. I changed the privacy of EPSILON and am now using it in this code.
I changed the name from GaussianMixtureModelEM to GaussianMixtureEM.
I've changed predictLabels() back to predict().

SparkQA · 2014-12-29T20:30:58Z

Test build #555 has started for PR 3022 at commit aaa8f25.

This patch merges cleanly.

SparkQA · 2014-12-29T21:56:05Z

Test build #555 has finished for PR 3022 at commit aaa8f25.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class GaussianMixtureModel(

jkbradley · 2014-12-29T22:53:56Z

@tgaloppo Thanks for the updates, and thanks for all of your work in getting this ready!

LGTM

CC: @mengxr

After this is merged, I'll make some JIRAs for the various item we've discussed along the way + a few more. Let me know if I've missed anything here:

Add parameters: seed, maxIterations
Use sparse vectors more efficiently
If numFeatures or k are large, distribute matrix inverses for Gaussian initialization.
Breeze pinv fails when the matrix is singular: [https://github.com/MatrixSingularException when column is 0 all at pinv scalanlp/breeze#304] Do SVD instead.
Make MultivariateGaussian public, and update GMM API
Check for NaNs:
- in computeSoftAssignments (if all pdfs = 0)
- in values when constructing a GMM

tgaloppo · 2014-12-29T23:10:11Z

@jkbradley Thank you for your help and feedback along the way. Please assign some (or all) of those tickets to me and I will continue to improve the implementation. In particular, you mentioned that there are a number of PR's with code for common distributions... I would be happy to help formalize a common interface and make these a public part of the library.

mengxr · 2014-12-29T23:30:36Z

@tgaloppo I've merged this into master. Thanks for contributing GMM!

FlytxtRnD · 2014-12-30T07:06:55Z

@tgaloppo Good Work
@mengxr Thanks for giving us a chance to be a part of this contribution

jkbradley · 2014-12-30T21:04:31Z

@tgaloppo @FlytxtRnD I made some JIRAs for the to-do items above.

I'd say the most important are:

Change predictMembership() to take an RDD, not the GMM.
- I did not notice that it took all of the GMM parameters. It should be renamed and made internal, and a wrapper method predictMembership() should take an RDD only.
Make MultivariateGaussian public
Update GMM API to use MultivariateGaussian instead of means, covariances
(The Python API and user guide JIRAs from @mengxr should also be in this list.)

It would be great to do:

SVD for Gaussian initialization

Some less critical ones are:

I removed the NAN JIRAs, but we should investigate numerical stability at some point.

Please let me know if you'd like any assigned to you, and thanks in advance for your work on this! If I'm able to work on one of the JIRAs, I'll make a note on the JIRA page.

tgaloppo · 2014-12-30T21:44:59Z

@jkbradley Please assign 5017, 5018, 5019, and 5020 to me. Regarding 5018, can you refer me to other PR's that are bringing in common distributions? I can work toward formalizing an API to make all of them public.

I also indicated that I would be happy to provide the Python wrappers for the algorithm (ticket 5012); @FlytxtRnD had provided an initial Python implementation of the algorithm... if they would like to provide the wrappers instead, that would be cool (but I am still definitely happy to do it if not).

CC: @mengxr

jkbradley · 2014-12-31T00:39:00Z

@tgaloppo It's ideal if we assign & fix one JIRA at a time (as separate PRs). Can I start by assigning one of your choosing?

For 5018, there is only one other such PR I know of, and it uses a Dirichlet distribution. But for API examples, I would recommend checking out popular libraries, such as R, Matlab, numpy, etc.

tgaloppo · 2014-12-31T01:01:19Z

@jkbradley No problem. Let's start with 5020, and I'll move on from there.

tgaloppo · 2014-12-31T23:52:03Z

@jkbradley Please assign me SPARK-5017, and I will take care of this in preparation for 5018 and 5019.

mengxr · 2015-01-01T00:18:37Z

Done :)

SPARK-4156

c15405c

squito reviewed Nov 10, 2014
View reviewed changes

Travis Galoppo and others added 3 commits November 11, 2014 18:30

Merge remote-tracking branch 'upstream/master'

5c96c57

Made GaussianMixtureModel class serializable

c1a8e16

Modified sum function for better performance

Added scala test suite with basic test

719d8cc

tgaloppo added 2 commits November 17, 2014 17:46

Merge remote-tracking branch 'upstream/master'

86fb382

Merged with master branch; update test suite with latest context chan…

e6ea805

…ges. Improved cluster initialization strategy.

tgaloppo added 2 commits December 3, 2014 10:14

Fixed to no longer ignore delta value provided on command line

676e523

Added additional train() method to companion object for cluster count…

8aaa17d

… and tolerance parameters. Modified cluster initialization strategy to use an initial covariance matrix derived from the sample points used to initialize the mean.

jkbradley reviewed Dec 11, 2014
View reviewed changes

mengxr reviewed Dec 19, 2014
View reviewed changes

Style improvements

9b2fc2a

Changed ExpectationSum to a private class

FlytxtRnD reviewed Dec 22, 2014
View reviewed changes

tgaloppo added 2 commits December 22, 2014 09:26

Fixed parameter comment in GaussianMixtureModel

acf1fba

Made maximum iterations an optional parameter to DenseGmmEM

fixed usage line to include optional maxIterations parameter

709e4bf

jkbradley reviewed Dec 22, 2014
View reviewed changes

MLUtils: changed privacy of EPSILON from [util] to [mllib]

aaa8f25

GaussianMixtureEM: Renamed from GaussianMixtureModelEM; corrected formatting issues GaussianMixtureModel: Renamed predictLabels() to predict() Others: Modifications based on rename of GaussianMixtureEM

asfgit closed this in 6cf6fdf Dec 29, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-4156 [MLLIB] EM algorithm for GMMs #3022

SPARK-4156 [MLLIB] EM algorithm for GMMs #3022

tgaloppo commented Oct 30, 2014

AmplabJenkins commented Oct 30, 2014

rxin commented Nov 1, 2014

manishamde commented Nov 1, 2014

AmplabJenkins commented Nov 1, 2014

tgaloppo commented Nov 8, 2014

SparkQA commented Nov 9, 2014

SparkQA commented Nov 9, 2014

squito Nov 10, 2014

tgaloppo commented Nov 10, 2014

tgaloppo commented Nov 13, 2014

tgaloppo commented Nov 18, 2014

jkbradley Dec 11, 2014

mengxr Dec 19, 2014

tgaloppo Dec 19, 2014

tgaloppo commented Dec 20, 2014

FlytxtRnD commented Dec 22, 2014

FlytxtRnD Dec 22, 2014

jkbradley Dec 22, 2014

jkbradley commented Dec 22, 2014

tgaloppo commented Dec 22, 2014

SparkQA commented Dec 29, 2014

SparkQA commented Dec 29, 2014

jkbradley commented Dec 29, 2014

tgaloppo commented Dec 29, 2014

mengxr commented Dec 29, 2014

FlytxtRnD commented Dec 30, 2014

jkbradley commented Dec 30, 2014

tgaloppo commented Dec 30, 2014

jkbradley commented Dec 31, 2014

tgaloppo commented Dec 31, 2014

tgaloppo commented Dec 31, 2014

mengxr commented Jan 1, 2015


		/** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
		def this() = this(2, 0.01, 100)

SPARK-4156 [MLLIB] EM algorithm for GMMs #3022

SPARK-4156 [MLLIB] EM algorithm for GMMs #3022

Conversation

tgaloppo commented Oct 30, 2014

AmplabJenkins commented Oct 30, 2014

rxin commented Nov 1, 2014

manishamde commented Nov 1, 2014

AmplabJenkins commented Nov 1, 2014

tgaloppo commented Nov 8, 2014

SparkQA commented Nov 9, 2014

SparkQA commented Nov 9, 2014

squito Nov 10, 2014

Choose a reason for hiding this comment

tgaloppo commented Nov 10, 2014

tgaloppo commented Nov 13, 2014

tgaloppo commented Nov 18, 2014

jkbradley Dec 11, 2014

Choose a reason for hiding this comment

mengxr Dec 19, 2014

Choose a reason for hiding this comment

tgaloppo Dec 19, 2014

Choose a reason for hiding this comment

tgaloppo commented Dec 20, 2014

FlytxtRnD commented Dec 22, 2014

FlytxtRnD Dec 22, 2014

Choose a reason for hiding this comment

jkbradley Dec 22, 2014

Choose a reason for hiding this comment

jkbradley commented Dec 22, 2014

tgaloppo commented Dec 22, 2014

SparkQA commented Dec 29, 2014

SparkQA commented Dec 29, 2014

jkbradley commented Dec 29, 2014

tgaloppo commented Dec 29, 2014

mengxr commented Dec 29, 2014

FlytxtRnD commented Dec 30, 2014

jkbradley commented Dec 30, 2014

tgaloppo commented Dec 30, 2014

jkbradley commented Dec 31, 2014

tgaloppo commented Dec 31, 2014

tgaloppo commented Dec 31, 2014

mengxr commented Jan 1, 2015