[SPARK-5986][MLLib] Add save/load for k-means #4951

yinxusen · 2015-03-09T14:16:37Z

This PR adds save/load for K-means as described in SPARK-5986. Python version will be added in another PR.

SparkQA · 2015-03-09T14:17:44Z

Test build #28393 has started for PR 4951 at commit dce7055.

This patch merges cleanly.

SparkQA · 2015-03-09T14:23:03Z

Test build #28394 has started for PR 4951 at commit b144216.

This patch merges cleanly.

SparkQA · 2015-03-09T15:37:30Z

Test build #28393 has finished for PR 4951 at commit dce7055.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KMeansModel (val clusterCenters: Array[Vector]) extends Saveable with Serializable

AmplabJenkins · 2015-03-09T15:37:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28393/
Test PASSed.

SparkQA · 2015-03-09T15:45:01Z

Test build #28394 has finished for PR 4951 at commit b144216.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KMeansModel (val clusterCenters: Array[Vector]) extends Saveable with Serializable

AmplabJenkins · 2015-03-09T15:45:05Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28394/
Test PASSed.

mengxr · 2015-03-09T18:40:12Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala

+      val metadata = compact(render(
+        ("class" -> thisClassName) ~ ("version" -> thisFormatVersion) ~ ("k" -> model.k)))
+      sc.parallelize(Seq(metadata), 1).saveAsTextFile(Loader.metadataPath(path))
+      val dataRDD = sc.parallelize(model.clusterCenters).map(wrapper.serialize)


We don't need wrapper.serialize for vectors.

We need to store cluster indices. If the centers are saved to more than one partitions, we cannot easily load them back in the original order.

SparkQA · 2015-03-09T22:32:45Z

Test build #28408 has started for PR 4951 at commit cd390fd.

This patch merges cleanly.

yinxusen · 2015-03-09T22:34:23Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala

+    KMeansModel.SaveLoadV1_0.load(sc, path)
+  }
+
+  case class IndexedPoint(id: Int, point: Vector)


How about using Int here to represent indexes? I think there is no need to use Long.

Yes, Int is sufficient. This class should be private. Please also check the scope of other classes.

make this private. The class name could be changed to case class Cluster(id: Int, center: Vector). IndexedPoint is too general.

SparkQA · 2015-03-09T23:54:44Z

Test build #28408 has finished for PR 4951 at commit cd390fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KMeansModel (val clusterCenters: Array[Vector]) extends Saveable with Serializable
- case class IndexedPoint(id: Int, point: Vector)

AmplabJenkins · 2015-03-09T23:54:48Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28408/
Test PASSed.

mengxr · 2015-03-10T22:28:11Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala

 import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.linalg._


Be specific about the imports. You only need Vector.

SparkQA · 2015-03-11T01:22:47Z

Test build #28453 has started for PR 4951 at commit 6dd74a0.

This patch merges cleanly.

SparkQA · 2015-03-11T02:48:02Z

Test build #28453 has finished for PR 4951 at commit 6dd74a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KMeansModel (val clusterCenters: Array[Vector]) extends Saveable with Serializable

AmplabJenkins · 2015-03-11T02:48:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28453/
Test PASSed.

mengxr · 2015-03-11T07:30:23Z

LGTM. Merged into master. Thanks!

add save/load for k-means for SPARK-5986

dce7055

remove invalid comments

b144216

mengxr reviewed Mar 9, 2015
View reviewed changes

add indexed point

cd390fd

yinxusen reviewed Mar 9, 2015
View reviewed changes

mengxr reviewed Mar 10, 2015
View reviewed changes

rewrite some functions and classes

6dd74a0

asfgit closed this in 2d4e00e Mar 11, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5986][MLLib] Add save/load for k-means #4951

[SPARK-5986][MLLib] Add save/load for k-means #4951

yinxusen commented Mar 9, 2015

SparkQA commented Mar 9, 2015

SparkQA commented Mar 9, 2015

SparkQA commented Mar 9, 2015

AmplabJenkins commented Mar 9, 2015

SparkQA commented Mar 9, 2015

AmplabJenkins commented Mar 9, 2015

mengxr Mar 9, 2015

SparkQA commented Mar 9, 2015

yinxusen Mar 9, 2015

mengxr Mar 10, 2015

mengxr Mar 10, 2015

SparkQA commented Mar 9, 2015

AmplabJenkins commented Mar 9, 2015

mengxr Mar 10, 2015

SparkQA commented Mar 11, 2015

SparkQA commented Mar 11, 2015

AmplabJenkins commented Mar 11, 2015

mengxr commented Mar 11, 2015

		import org.apache.spark.api.java.JavaRDD
		import org.apache.spark.mllib.linalg._

[SPARK-5986][MLLib] Add save/load for k-means #4951

[SPARK-5986][MLLib] Add save/load for k-means #4951

Conversation

yinxusen commented Mar 9, 2015

SparkQA commented Mar 9, 2015

SparkQA commented Mar 9, 2015

SparkQA commented Mar 9, 2015

AmplabJenkins commented Mar 9, 2015

SparkQA commented Mar 9, 2015

AmplabJenkins commented Mar 9, 2015

mengxr Mar 9, 2015

Choose a reason for hiding this comment

SparkQA commented Mar 9, 2015

yinxusen Mar 9, 2015

Choose a reason for hiding this comment

mengxr Mar 10, 2015

Choose a reason for hiding this comment

mengxr Mar 10, 2015

Choose a reason for hiding this comment

SparkQA commented Mar 9, 2015

AmplabJenkins commented Mar 9, 2015

mengxr Mar 10, 2015

Choose a reason for hiding this comment

SparkQA commented Mar 11, 2015

SparkQA commented Mar 11, 2015

AmplabJenkins commented Mar 11, 2015

mengxr commented Mar 11, 2015