-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19142][SparkR]:spark.kmeans should take seed, initSteps, and tol as parameters #16523
Conversation
Test build #71101 has finished for PR 16523 at commit
|
Test build #71103 has finished for PR 16523 at commit
|
cc @yanboliang |
formula <- paste(deparse(formula), collapse = "") | ||
initMode <- match.arg(initMode) | ||
if (!is.null(seed)) { | ||
seed <- as.character(as.integer(seed)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to know why you convert seed
to integer first and then convert to character? AFAIK, the type of seed
in MLlib is Long
whose max value is 9223372036854775807
. as.integer
will return NA
if it's beyond the scope of integer. Should we support consistent range for seed
across languages? It looks like R support for Long
is not very well, if we only support integer, do we need to convert the integer to character?
cc @felixcheung
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed one example in the file. Let me investigate more about this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as you call out, R does not natively support 64bit integer. I think we are pretty much stuck here since the user won't be able to pass in a 64bit integer. We could explore making this a string but I really think that is hard to use.
the reason this is a string on the JVM side is because we want to support default seed value when it is unset (which is passed as NULL)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification, it's reasonable, let's leave it as is.
@@ -99,7 +99,8 @@ test_that("spark.kmeans", { | |||
|
|||
take(training, 1) | |||
|
|||
model <- spark.kmeans(data = training, ~ ., k = 2, maxIter = 10, initMode = "random") | |||
model <- spark.kmeans(data = training, ~ ., k = 2, maxIter = 10, initMode = "random", seed = 1, | |||
initSteps = 3, tol = 1E-5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the test case is insensitive to seed
, could you add a test case which is sensitive to seed
? Or add a test case whose termination controlled by tol
? We need to make sure these arguments will take effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I will compose a test that is really controlled by the parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested existing unit tests in KmeansSuite.scala in both ML and MLLIB. They are all insensitive to seed
and tol
. I think I should compose tests use random
as initMode
.
Test build #71229 has finished for PR 16523 at commit
|
Test build #71230 has finished for PR 16523 at commit
|
|
||
model1 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10, | ||
initMode = "random", seed = 1, tol = 1E-5) | ||
model2 <- model <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo? model2 <- model <- spark.kmeans
Test build #71273 has finished for PR 16523 at commit
|
LGTM. Probably will be a good idea to review:
to see if there is anything to add (there might not - we don't want to overload people with every parameters) |
@felixcheung I will review these items after wrapping up my current work. Now I am working on two items: The bug 18011; and bisecting kmeans. bisecting kmeans should be ready soon. Bug 18011 needs more debugging. Thanks! |
sounds good! @yanboliang any more comment before we merge? |
LGTM, merged into master. Thanks. We can not update JIRA since it's currently down for maintenance, will do later. |
…ol as parameters ## What changes were proposed in this pull request? spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans. Add missing parameters and corresponding document. Modified existing unit tests to take additional parameters. Author: wm624@hotmail.com <wm624@hotmail.com> Closes apache#16523 from wangmiao1981/kmeans.
…ol as parameters ## What changes were proposed in this pull request? spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans. Add missing parameters and corresponding document. Modified existing unit tests to take additional parameters. Author: wm624@hotmail.com <wm624@hotmail.com> Closes apache#16523 from wangmiao1981/kmeans.
What changes were proposed in this pull request?
spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans.
Add missing parameters and corresponding document.
Modified existing unit tests to take additional parameters.