[SPARK-19142][SparkR]:spark.kmeans should take seed, initSteps, and tol as parameters #16523

wangmiao1981 · 2017-01-09T23:52:46Z

What changes were proposed in this pull request?

spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans.

Add missing parameters and corresponding document.

Modified existing unit tests to take additional parameters.

SparkQA · 2017-01-09T23:59:28Z

Test build #71101 has finished for PR 16523 at commit 8379018.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-10T01:30:00Z

Test build #71103 has finished for PR 16523 at commit f44c5b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2017-01-10T06:41:36Z

cc @yanboliang

yanboliang · 2017-01-10T14:07:53Z

R/pkg/R/mllib_clustering.R

            formula <- paste(deparse(formula), collapse = "")
            initMode <- match.arg(initMode)
+            if (!is.null(seed)) {
+              seed <- as.character(as.integer(seed))


I'd like to know why you convert seed to integer first and then convert to character? AFAIK, the type of seed in MLlib is Long whose max value is 9223372036854775807. as.integer will return NA if it's beyond the scope of integer. Should we support consistent range for seed across languages? It looks like R support for Long is not very well, if we only support integer, do we need to convert the integer to character?
cc @felixcheung

I followed one example in the file. Let me investigate more about this issue.

as you call out, R does not natively support 64bit integer. I think we are pretty much stuck here since the user won't be able to pass in a 64bit integer. We could explore making this a string but I really think that is hard to use.

the reason this is a string on the JVM side is because we want to support default seed value when it is unset (which is passed as NULL)

Thanks for the clarification, it's reasonable, let's leave it as is.

yanboliang · 2017-01-10T14:23:13Z

R/pkg/inst/tests/testthat/test_mllib_clustering.R

@@ -99,7 +99,8 @@ test_that("spark.kmeans", {

  take(training, 1)

-  model <- spark.kmeans(data = training, ~ ., k = 2, maxIter = 10, initMode = "random")
+  model <- spark.kmeans(data = training, ~ ., k = 2, maxIter = 10, initMode = "random", seed = 1,
+                        initSteps = 3, tol = 1E-5)


It looks like the test case is insensitive to seed, could you add a test case which is sensitive to seed? Or add a test case whose termination controlled by tol? We need to make sure these arguments will take effect.

Sounds good. I will compose a test that is really controlled by the parameters.

I tested existing unit tests in KmeansSuite.scala in both ML and MLLIB. They are all insensitive to seed and tol. I think I should compose tests use random as initMode.

SparkQA · 2017-01-11T20:28:40Z

Test build #71229 has finished for PR 16523 at commit 44a0c73.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-11T21:46:13Z

Test build #71230 has finished for PR 16523 at commit c840c4d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-01-12T13:30:26Z

R/pkg/inst/tests/testthat/test_mllib_clustering.R

+
+  model1 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10,
+                         initMode = "random", seed = 1, tol = 1E-5)
+  model2 <- model <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10,


typo? model2 <- model <- spark.kmeans

SparkQA · 2017-01-12T19:38:57Z

Test build #71273 has finished for PR 16523 at commit 1c27df4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-01-12T19:56:54Z

LGTM. Probably will be a good idea to review:

vignettes
programming guide
R examples

to see if there is anything to add (there might not - we don't want to overload people with every parameters)

wangmiao1981 · 2017-01-12T20:08:18Z

@felixcheung I will review these items after wrapping up my current work. Now I am working on two items: The bug 18011; and bisecting kmeans.

bisecting kmeans should be ready soon. Bug 18011 needs more debugging. Thanks!

felixcheung · 2017-01-13T04:15:36Z

sounds good! @yanboliang any more comment before we merge?

yanboliang · 2017-01-13T06:30:50Z

LGTM, merged into master. Thanks. We can not update JIRA since it's currently down for maintenance, will do later.

…ol as parameters ## What changes were proposed in this pull request? spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans. Add missing parameters and corresponding document. Modified existing unit tests to take additional parameters. Author: wm624@hotmail.com <wm624@hotmail.com> Closes apache#16523 from wangmiao1981/kmeans.

yanboliang reviewed Jan 10, 2017

View reviewed changes

wangmiao1981 added 3 commits January 11, 2017 09:41

take additional parameters for spark.kmeans

32ce873

fix R style

961f601

add a test that is sensitive to seed value

73f8f2e

wangmiao1981 force-pushed the kmeans branch from f44c5b9 to 73f8f2e Compare January 11, 2017 20:17

modify comment

44a0c73

fix style

c840c4d

yanboliang reviewed Jan 12, 2017

View reviewed changes

fix typo

1c27df4

asfgit closed this in 7f24a0b Jan 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19142][SparkR]:spark.kmeans should take seed, initSteps, and tol as parameters #16523

[SPARK-19142][SparkR]:spark.kmeans should take seed, initSteps, and tol as parameters #16523

wangmiao1981 commented Jan 9, 2017

SparkQA commented Jan 9, 2017

SparkQA commented Jan 10, 2017

wangmiao1981 commented Jan 10, 2017

yanboliang Jan 10, 2017

wangmiao1981 Jan 10, 2017

felixcheung Jan 10, 2017

yanboliang Jan 11, 2017

yanboliang Jan 10, 2017

wangmiao1981 Jan 10, 2017

wangmiao1981 Jan 11, 2017 •

edited

Loading

SparkQA commented Jan 11, 2017

SparkQA commented Jan 11, 2017

yanboliang Jan 12, 2017

SparkQA commented Jan 12, 2017

felixcheung commented Jan 12, 2017

wangmiao1981 commented Jan 12, 2017

felixcheung commented Jan 13, 2017

yanboliang commented Jan 13, 2017

[SPARK-19142][SparkR]:spark.kmeans should take seed, initSteps, and tol as parameters #16523

[SPARK-19142][SparkR]:spark.kmeans should take seed, initSteps, and tol as parameters #16523

Conversation

wangmiao1981 commented Jan 9, 2017

What changes were proposed in this pull request?

SparkQA commented Jan 9, 2017

SparkQA commented Jan 10, 2017

wangmiao1981 commented Jan 10, 2017

yanboliang Jan 10, 2017

Choose a reason for hiding this comment

wangmiao1981 Jan 10, 2017

Choose a reason for hiding this comment

felixcheung Jan 10, 2017

Choose a reason for hiding this comment

yanboliang Jan 11, 2017

Choose a reason for hiding this comment

yanboliang Jan 10, 2017

Choose a reason for hiding this comment

wangmiao1981 Jan 10, 2017

Choose a reason for hiding this comment

wangmiao1981 Jan 11, 2017 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Jan 11, 2017

SparkQA commented Jan 11, 2017

yanboliang Jan 12, 2017

Choose a reason for hiding this comment

SparkQA commented Jan 12, 2017

felixcheung commented Jan 12, 2017

wangmiao1981 commented Jan 12, 2017

felixcheung commented Jan 13, 2017

yanboliang commented Jan 13, 2017

wangmiao1981 Jan 11, 2017 •

edited

Loading