Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19142][SparkR]:spark.kmeans should take seed, initSteps, and tol as parameters #16523

Closed
wants to merge 6 commits into from

Conversation

wangmiao1981
Copy link
Contributor

What changes were proposed in this pull request?

spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans.

Add missing parameters and corresponding document.

Modified existing unit tests to take additional parameters.

@SparkQA
Copy link

SparkQA commented Jan 9, 2017

Test build #71101 has finished for PR 16523 at commit 8379018.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 10, 2017

Test build #71103 has finished for PR 16523 at commit f44c5b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

cc @yanboliang

formula <- paste(deparse(formula), collapse = "")
initMode <- match.arg(initMode)
if (!is.null(seed)) {
seed <- as.character(as.integer(seed))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to know why you convert seed to integer first and then convert to character? AFAIK, the type of seed in MLlib is Long whose max value is 9223372036854775807. as.integer will return NA if it's beyond the scope of integer. Should we support consistent range for seed across languages? It looks like R support for Long is not very well, if we only support integer, do we need to convert the integer to character?
cc @felixcheung

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed one example in the file. Let me investigate more about this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as you call out, R does not natively support 64bit integer. I think we are pretty much stuck here since the user won't be able to pass in a 64bit integer. We could explore making this a string but I really think that is hard to use.

the reason this is a string on the JVM side is because we want to support default seed value when it is unset (which is passed as NULL)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, it's reasonable, let's leave it as is.

@@ -99,7 +99,8 @@ test_that("spark.kmeans", {

take(training, 1)

model <- spark.kmeans(data = training, ~ ., k = 2, maxIter = 10, initMode = "random")
model <- spark.kmeans(data = training, ~ ., k = 2, maxIter = 10, initMode = "random", seed = 1,
initSteps = 3, tol = 1E-5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the test case is insensitive to seed, could you add a test case which is sensitive to seed? Or add a test case whose termination controlled by tol? We need to make sure these arguments will take effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I will compose a test that is really controlled by the parameters.

Copy link
Contributor Author

@wangmiao1981 wangmiao1981 Jan 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested existing unit tests in KmeansSuite.scala in both ML and MLLIB. They are all insensitive to seed and tol. I think I should compose tests use random as initMode.

@SparkQA
Copy link

SparkQA commented Jan 11, 2017

Test build #71229 has finished for PR 16523 at commit 44a0c73.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2017

Test build #71230 has finished for PR 16523 at commit c840c4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


model1 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10,
initMode = "random", seed = 1, tol = 1E-5)
model2 <- model <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? model2 <- model <- spark.kmeans

@SparkQA
Copy link

SparkQA commented Jan 12, 2017

Test build #71273 has finished for PR 16523 at commit 1c27df4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

LGTM. Probably will be a good idea to review:

  • vignettes
  • programming guide
  • R examples

to see if there is anything to add (there might not - we don't want to overload people with every parameters)

@wangmiao1981
Copy link
Contributor Author

@felixcheung I will review these items after wrapping up my current work. Now I am working on two items: The bug 18011; and bisecting kmeans.

bisecting kmeans should be ready soon. Bug 18011 needs more debugging. Thanks!

@felixcheung
Copy link
Member

sounds good! @yanboliang any more comment before we merge?

@asfgit asfgit closed this in 7f24a0b Jan 13, 2017
@yanboliang
Copy link
Contributor

LGTM, merged into master. Thanks. We can not update JIRA since it's currently down for maintenance, will do later.

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…ol as parameters

## What changes were proposed in this pull request?
spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans.

Add missing parameters and corresponding document.

Modified existing unit tests to take additional parameters.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#16523 from wangmiao1981/kmeans.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…ol as parameters

## What changes were proposed in this pull request?
spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans.

Add missing parameters and corresponding document.

Modified existing unit tests to take additional parameters.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#16523 from wangmiao1981/kmeans.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants