Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4749] [mllib]: Allow initializing KMeans clusters using a seed #3610

Closed
wants to merge 8 commits into from

Conversation

nxwhite-str
Copy link

This implements the functionality for SPARK-4749 and provides units tests in Scala and PySpark

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@@ -43,7 +43,8 @@ class KMeans private (
private var runs: Int,
private var initializationMode: String,
private var initializationSteps: Int,
private var epsilon: Double) extends Serializable with Logging {
private var epsilon: Double,
private var seed: Long = System.nanoTime()) extends Serializable with Logging {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you set the default in the one public constructor instead since that's where other defaults are set?

@jkbradley
Copy link
Member

@nxwhite-str Thanks for the PR! Could you please update the title to start with "[SPARK-4749] [mllib]" to help with automated tagging?

@nxwhite-str nxwhite-str changed the title SPARK-4749: Allow initializing KMeans clusters using a seed [SPARK-4749] [mllib]: Allow initializing KMeans clusters using a seed Dec 10, 2014
* @param initializationMode initialization model, either "random" or "k-means||" (default).
* @param seed random seed value for cluster initialization
*/
def train(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this to the beginning and make the one without seed call this?

@mengxr
Copy link
Contributor

mengxr commented Dec 19, 2014

LGTM except minor inline comments.

@mengxr
Copy link
Contributor

mengxr commented Dec 19, 2014

ok to test

@SparkQA
Copy link

SparkQA commented Dec 19, 2014

Test build #24654 has started for PR 3610 at commit f8d5928.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 19, 2014

Test build #24654 has finished for PR 3610 at commit f8d5928.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SparkContext(config: SparkConf) extends Logging
    • class RandomModuleHook(object):
    • class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Boolean)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24654/
Test FAILed.

@jkbradley
Copy link
Member

failure in a streaming test...retesting

@SparkQA
Copy link

SparkQA commented Dec 22, 2014

Test build #551 has started for PR 3610 at commit f8d5928.

  • This patch merges cleanly.


/**
* Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, runs: 1,
* initializationMode: "k-means||", initializationSteps: 5, epsilon: 1e-4}.
* initializationMode: "k-means||", initializationSteps: 5, epsilon: 1e-4, System.nanoTime()}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add parameter name: seed: System.nanoTime()

@SparkQA
Copy link

SparkQA commented Dec 22, 2014

Test build #551 has finished for PR 3610 at commit f8d5928.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jan 10, 2015

@nxwhite-str There are few minor comments left. Do you have time to update the PR?

@SparkQA
Copy link

SparkQA commented Jan 21, 2015

Test build #25891 has started for PR 3610 at commit a2ebbd3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 21, 2015

Test build #25891 has finished for PR 3610 at commit a2ebbd3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25891/
Test PASSed.

@asfgit asfgit closed this in 7450a99 Jan 21, 2015
@mengxr
Copy link
Contributor

mengxr commented Jan 21, 2015

Merged into master. Thanks!

bomeng pushed a commit to Huawei-Spark/spark that referenced this pull request Jan 22, 2015
This implements the functionality for SPARK-4749 and provides units tests in Scala and PySpark

Author: nate.crosswhite <nate.crosswhite@stresearch.com>
Author: nxwhite-str <nxwhite-str@users.noreply.github.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes apache#3610 from nxwhite-str/master and squashes the following commits:

a2ebbd3 [nxwhite-str] Merge pull request #1 from mengxr/SPARK-4749-kmeans-seed
7668124 [Xiangrui Meng] minor updates
f8d5928 [nate.crosswhite] Addressing PR issues
277d367 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
9156a57 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
5d087b4 [nate.crosswhite] Adding KMeans train with seed and Scala unit test
616d111 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
35c1884 [nate.crosswhite] Add kmeans initial seed to pyspark API
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants