Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5986][MLLib] Add save/load for k-means #4951

Closed
wants to merge 4 commits into from

Conversation

yinxusen
Copy link
Contributor

@yinxusen yinxusen commented Mar 9, 2015

This PR adds save/load for K-means as described in SPARK-5986. Python version will be added in another PR.

@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28393 has started for PR 4951 at commit dce7055.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28394 has started for PR 4951 at commit b144216.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28393 has finished for PR 4951 at commit dce7055.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class KMeansModel (val clusterCenters: Array[Vector]) extends Saveable with Serializable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28393/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28394 has finished for PR 4951 at commit b144216.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class KMeansModel (val clusterCenters: Array[Vector]) extends Saveable with Serializable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28394/
Test PASSed.

val metadata = compact(render(
("class" -> thisClassName) ~ ("version" -> thisFormatVersion) ~ ("k" -> model.k)))
sc.parallelize(Seq(metadata), 1).saveAsTextFile(Loader.metadataPath(path))
val dataRDD = sc.parallelize(model.clusterCenters).map(wrapper.serialize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We don't need wrapper.serialize for vectors.
  • We need to store cluster indices. If the centers are saved to more than one partitions, we cannot easily load them back in the original order.

@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28408 has started for PR 4951 at commit cd390fd.

  • This patch merges cleanly.

KMeansModel.SaveLoadV1_0.load(sc, path)
}

case class IndexedPoint(id: Int, point: Vector)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using Int here to represent indexes? I think there is no need to use Long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Int is sufficient. This class should be private. Please also check the scope of other classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this private. The class name could be changed to case class Cluster(id: Int, center: Vector). IndexedPoint is too general.

@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28408 has finished for PR 4951 at commit cd390fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class KMeansModel (val clusterCenters: Array[Vector]) extends Saveable with Serializable
    • case class IndexedPoint(id: Int, point: Vector)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28408/
Test PASSed.

import org.apache.spark.api.java.JavaRDD
import org.apache.spark.mllib.linalg._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be specific about the imports. You only need Vector.

@SparkQA
Copy link

SparkQA commented Mar 11, 2015

Test build #28453 has started for PR 4951 at commit 6dd74a0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 11, 2015

Test build #28453 has finished for PR 4951 at commit 6dd74a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class KMeansModel (val clusterCenters: Array[Vector]) extends Saveable with Serializable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28453/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Mar 11, 2015

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in 2d4e00e Mar 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants