New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19827][R]spark.ml R API for PIC #23072
Conversation
Test build #98971 has finished for PR 23072 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, a few comments
R/pkg/R/mllib_clustering.R
Outdated
#' @note spark.assignClusters(SparkDataFrame) since 3.0.0 | ||
setMethod("spark.assignClusters", | ||
signature(data = "SparkDataFrame"), | ||
function(data, k = 2L, initMode = "random", maxIter = 20L, srcCol = "src", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set valid values for initMode and check for it - eg. https://github.com/apache/spark/pull/23072/files#diff-d9f92e07db6424e2527a7f9d7caa9013R355
and match.arg(initMode)
R/pkg/vignettes/sparkr-vignettes.Rmd
Outdated
df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), | ||
list(1L, 2L, 1.0), list(3L, 4L, 1.0), | ||
list(4L, 0L, 0.1)), schema = c("src", "dst", "weight")) | ||
head(spark.assignClusters(df, initMode="degree", weightCol="weight")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spacing: initMode = "degree", weightCol = "weight"
R/pkg/R/mllib_clustering.R
Outdated
setMethod("spark.assignClusters", | ||
signature(data = "SparkDataFrame"), | ||
function(data, k = 2L, initMode = "random", maxIter = 20L, srcCol = "src", | ||
dstCol = "dst", weightCol = NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we try to avoid srcCol dstCol in R (I think there are other R ml APIs like that)
Test build #99000 has finished for PR 23072 at commit
|
Test build #98999 has finished for PR 23072 at commit
|
Test build #99017 has finished for PR 23072 at commit
|
retest this please |
Test build #99028 has finished for PR 23072 at commit
|
|
||
## Power Iteration Clustering (PIC) | ||
|
||
Power Iteration Clustering (PIC) is a scalable graph clustering algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you open a separate PR with just this file (minus R) and FPGrowthExample.scala on branch-2.4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pardon, I'm catching up -- why just commit this doc to 2.4 and not master?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc change will be in both 2.4 and master, but the R related code will be in master only. I think that's why @felixcheung asked me to open a separate PR to merge in the doc change for 2.4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK sounds good. Let's merge this one first just as a matter of process.
Test build #99470 has finished for PR 23072 at commit
|
Test build #99478 has finished for PR 23072 at commit
|
@@ -64,4 +64,3 @@ object FPGrowthExample { | |||
spark.stop() | |||
} | |||
} | |||
// scalastyle:on println |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @huaxingao . Let's not remove this. I understand the intention, but we had better keep this because this is the indicator of the end of the scope of line 20.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun sorry, I missed the // scalastyle:off println
Is it OK with you if I remove // scalastyle:off println
too? Since println
is not used in the example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, println is not used
dstCol: String, | ||
weightCol: String): PowerIterationClustering = { | ||
val pic = new PowerIterationClustering() | ||
.setK(k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation with two spaces?
python/pyspark/ml/clustering.py
Outdated
PIC finds a very low-dimensional embedding of a dataset using truncated power | ||
iteration on a normalized pair-wise similarity matrix of the data. | ||
abstract: iteration on a normalized pair-wise similarity matrix of the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check this again? It seems to break the original sentence accidentally. Maybe, From the abstract: PIC finds ~~~
?
Test build #99528 has finished for PR 23072 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung @dongjoon-hyun OK with you?
R/pkg/R/mllib_clustering.R
Outdated
#' @param initMode Param for the initialization algorithm. | ||
#' @param maxIter Param for maximum number of iterations. | ||
#' @param sourceCol Param for the name of the input column for source vertex IDs. | ||
#' @param destinationCol Name of the input column for destination vertex IDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Here, Name
-> Param for the name
for consistency with the other param descriptions?
Or, is it better to remote Param for
prefix in other descriptions?
list(list(list(1L), list(3L)), 2L)), | ||
schema = c("sequence", "freq")) | ||
|
||
expect_equivalent(expected_result, result1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.prefixSpan
test case is irrelevant to the scope of PR.
If we want to reindent and add this missing line expect_equivalent(expected_result, result1)
, let's add in another PR.
@@ -319,4 +319,18 @@ test_that("spark.posterior and spark.perplexity", { | |||
expect_equal(length(local.posterior), sum(unlist(local.posterior))) | |||
}) | |||
|
|||
test_that("spark.assignClusters", { | |||
df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation for this whole block (line 323~334)?
} | ||
if (!is.integer(maxIter) || maxIter <= 0) { | ||
stop("maxIter should be a number with value > 0.") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it sure that the src
and dst
columns are int or bigint, too? Otherwise, we may hit IllegalArgumentException
from Scala side in case of R double
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun src
and dst
are character columns. I have the check for character type.
as.character(sourceCol),
as.character(destinationCol)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the data
SparkDataFrame's column types, if possible. If you remove 'L' from '0L' in your example dataset, you can see the failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me that R is a thin wrapper, we only need to create a PIC object and call the corresponding scala method. SparkDataFrame's column types are only checked in scala, not in R.
docs/ml-clustering.md
Outdated
## Power Iteration Clustering (PIC) | ||
|
||
Power Iteration Clustering (PIC) is a scalable graph clustering algorithm | ||
developed by <a href=http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and Cohen</a>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that <a>
tag doesn't work here. Could you check the generated document and try [Lin and Cohen](http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf)
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I normally check the md file on the github. The link works OK. Is there a better way to check? @dongjoon-hyun @felixcheung
https://github.com/apache/spark/blob/9158da8cb76cc13f3011deaa7ac2c290eef62389/docs/ml-clustering.md
I guess I will still remove the a href=
since no other places in the doc uses <a>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to build from Spark repository because Jekyll handles it differently from GitHub. Please try to build in docs
directory. There is README.md
for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I built this PR on my Mac, and found that the hyperlink is not generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I will change the hyperlink.
{% include_example java/org/apache/spark/examples/ml/JavaPowerIterationClusteringExample.java %} | ||
</div> | ||
|
||
<div data-lang="r" markdown="1"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that Python
is missed here. Do we need to file a JIRA issue to add Python example?
cc @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun
#22996
I will add the python example in the doc once the above PR is merged in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Got it.
df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), | ||
list(1L, 2L, 1.0), list(3L, 4L, 1.0), | ||
list(4L, 0L, 0.1)), schema = c("src", "dst", "weight")) | ||
#assign clusters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. #assign
-> # assign
.
@dongjoon-hyun Thank you very much for your review. I will make the changes soon. |
Test build #99794 has finished for PR 23072 at commit
|
R/pkg/vignettes/sparkr-vignettes.Rmd
Outdated
|
||
```{r} | ||
df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), | ||
list(1L, 2L, 1.0), list(3L, 4L, 1.0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an indentation rule for this? This PR is using two types of indentations for the same statements.
- For docs (sparkr-vignettes.Rmd, mllib_clustering.R), this line is aligned with the first
list
. - For real code (test_mllib_clustering.R, powerIterationClustering.R), this line is aligned with the second
list
.
Can we use the same indentation rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, we do have for indentation rule. "Code Style Guide" at https://spark.apache.org/contributing.html -> https://google.github.io/styleguide/Rguide.xml. I know the code style is not perfectly documented but at least there are some examples. I think the correct indentation is:
df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
list(1L, 2L, 1.0), list(3L, 4L, 1.0),
list(4L, 0L, 0.1)),
schema = c("src", "dst", "weight"))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two separate style already mixed in R code IIRC:
df <- createDataFrame(
list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
list(1L, 2L, 1.0), list(3L, 4L, 1.0),
list(4L, 0L, 0.1)), schema = c("src", "dst", "weight"))
or
df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
list(1L, 2L, 1.0), list(3L, 4L, 1.0),
list(4L, 0L, 0.1)),
schema = c("src", "dst", "weight"))
Let's avoid mixed style, and let's go for the later one when possible because at least that looks more complying the code style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, when I added that into https://spark.apache.org/contributing.html, we also agreed upon following committer's judgement based upon the guide because the guide mentions:
The coding conventions described above should be followed, unless there is good reason to do otherwise. Exceptions include legacy code and modifying third-party code.
since we do have legacy reason, and there is a good reason - consistency and committer's judgement.
list(1L, 0L), | ||
list(3L, 1L), | ||
list(2L, 0L)), | ||
schema = c("id", "cluster")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for style
Test build #99837 has finished for PR 23072 at commit
|
Test build #99839 has finished for PR 23072 at commit
|
@dongjoon-hyun @felixcheung how about now? |
It looks enough to me, @srowen . |
Merged to master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, late to the party.
#' | ||
#' A scalable graph clustering algorithm. Users can call \code{spark.assignClusters} to | ||
#' return a cluster assignment for each input vertex. | ||
#' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove empty line - empty is significant in roxygen2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be more careful about this roxygen2 stuff.
# Run the PIC algorithm and returns a cluster assignment for each input vertex. | ||
#' @param data a SparkDataFrame. | ||
#' @param k the number of clusters to create. | ||
#' @param initMode the initialization algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add One of "random", "degree"
?
#' @return A dataset that contains columns of vertex id and the corresponding cluster for the id. | ||
#' The schema of it will be: | ||
#' \code{id: Long} | ||
#' \code{cluster: Int} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mm, this won't format correctly - roxygen strips all the whitespaces
also Long and Int is not a proper type in R
#' \code{id: Long} | ||
#' \code{cluster: Int} | ||
#' @rdname spark.powerIterationClustering | ||
#' @aliases assignClusters,PowerIterationClustering-method,SparkDataFrame-method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, this aliases doesn't make sense. could you test if ?assignClusters
in R shell if this works?
this should be @aliases spark.assignClusters,SparkDataFrame-method
#' list(1L, 2L, 1.0), list(3L, 4L, 1.0), | ||
#' list(4L, 0L, 0.1)), | ||
#' schema = c("src", "dst", "weight")) | ||
#' clusters <- spark.assignClusters(df, initMode="degree", weightCol="weight") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space around =
as style
if (!is.numeric(k) || k < 1) { | ||
stop("k should be a number with value >= 1.") | ||
} | ||
if (!is.integer(maxIter) || maxIter <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if maxIter should in integer, should we check k is also integer? it;s fixed when it is passed, so just a minor consistency on value check
list(4L, 0L, 0.1)), | ||
schema = c("src", "dst", "weight")) | ||
# assign clusters | ||
clusters <- spark.assignClusters(df, k=2L, maxIter=20L, initMode="degree", weightCol="weight") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny nit: here as well spacing around =
.
Oops, my bad. I opened #23292 |
## What changes were proposed in this pull request? Follow up style fixes to PIC in R; see #23072 ## How was this patch tested? Existing tests. Closes #23292 from srowen/SPARK-19827.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request? Follow up style fixes to PIC in R; see apache#23072 ## How was this patch tested? Existing tests. Closes apache#23292 from srowen/SPARK-19827.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request? Add PowerIterationCluster (PIC) in R ## How was this patch tested? Add test case Closes apache#23072 from huaxingao/spark-19827. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request? Follow up style fixes to PIC in R; see apache#23072 ## How was this patch tested? Existing tests. Closes apache#23292 from srowen/SPARK-19827.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
What changes were proposed in this pull request?
Add PowerIterationCluster (PIC) in R
How was this patch tested?
Add test case