[SPARK-19827][R]spark.ml R API for PIC #23072

huaxingao · 2018-11-17T21:32:51Z

What changes were proposed in this pull request?

Add PowerIterationCluster (PIC) in R

How was this patch tested?

Add test case

SparkQA · 2018-11-17T22:52:25Z

Test build #98971 has finished for PR 23072 at commit 9e2b0f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

thx, a few comments

felixcheung · 2018-11-18T07:38:29Z

R/pkg/R/mllib_clustering.R

+#' @note spark.assignClusters(SparkDataFrame) since 3.0.0
+setMethod("spark.assignClusters",
+          signature(data = "SparkDataFrame"),
+          function(data, k = 2L, initMode = "random", maxIter = 20L, srcCol = "src",


set valid values for initMode and check for it - eg. https://github.com/apache/spark/pull/23072/files#diff-d9f92e07db6424e2527a7f9d7caa9013R355

and match.arg(initMode)

felixcheung · 2018-11-18T07:39:58Z

R/pkg/vignettes/sparkr-vignettes.Rmd

+df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+                      list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+                      list(4L, 0L, 0.1)), schema = c("src", "dst", "weight"))
+head(spark.assignClusters(df, initMode="degree", weightCol="weight"))


spacing: initMode = "degree", weightCol = "weight"

felixcheung · 2018-11-18T07:42:09Z

R/pkg/R/mllib_clustering.R

+setMethod("spark.assignClusters",
+          signature(data = "SparkDataFrame"),
+          function(data, k = 2L, initMode = "random", maxIter = 20L, srcCol = "src",
+            dstCol = "dst", weightCol = NULL) {


I think we try to avoid srcCol dstCol in R (I think there are other R ml APIs like that)

SparkQA · 2018-11-19T19:40:23Z

Test build #99000 has finished for PR 23072 at commit 2ebfe5a.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-19T19:44:27Z

Test build #98999 has finished for PR 23072 at commit f9cb330.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-19T21:28:02Z

Test build #99017 has finished for PR 23072 at commit ea45a51.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2018-11-19T22:57:21Z

retest this please

SparkQA · 2018-11-20T00:22:06Z

Test build #99028 has finished for PR 23072 at commit ea45a51.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-11-27T17:36:30Z

docs/ml-clustering.md

+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm


could you open a separate PR with just this file (minus R) and FPGrowthExample.scala on branch-2.4?

Pardon, I'm catching up -- why just commit this doc to 2.4 and not master?

The doc change will be in both 2.4 and master, but the R related code will be in master only. I think that's why @felixcheung asked me to open a separate PR to merge in the doc change for 2.4.

OK sounds good. Let's merge this one first just as a matter of process.

SparkQA · 2018-11-29T20:47:09Z

Test build #99470 has finished for PR 23072 at commit 719d9d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-29T23:04:46Z

Test build #99478 has finished for PR 23072 at commit 15cf7f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-11-30T18:26:28Z

examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala

@@ -64,4 +64,3 @@ object FPGrowthExample {
    spark.stop()
  }
 }
-// scalastyle:on println


Hi, @huaxingao . Let's not remove this. I understand the intention, but we had better keep this because this is the indicator of the end of the scope of line 20.

@dongjoon-hyun sorry, I missed the // scalastyle:off println
Is it OK with you if I remove // scalastyle:off println too? Since println is not used in the example

Of course, sure!

yes, println is not used

dongjoon-hyun · 2018-11-30T19:58:04Z

mllib/src/main/scala/org/apache/spark/ml/r/PowerIterationClusteringWrapper.scala

+      dstCol: String,
+      weightCol: String): PowerIterationClustering = {
+    val pic = new PowerIterationClustering()
+        .setK(k)


Indentation with two spaces?

dongjoon-hyun · 2018-11-30T20:00:47Z

python/pyspark/ml/clustering.py

    PIC finds a very low-dimensional embedding of a dataset using truncated power
-    iteration on a normalized pair-wise similarity matrix of the data.
+    abstract: iteration on a normalized pair-wise similarity matrix of the data.


Could you check this again? It seems to break the original sentence accidentally. Maybe, From the abstract: PIC finds ~~~?

SparkQA · 2018-12-01T00:14:36Z

Test build #99528 has finished for PR 23072 at commit 9158da8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

@felixcheung @dongjoon-hyun OK with you?

dongjoon-hyun · 2018-12-05T18:59:16Z

R/pkg/R/mllib_clustering.R

+#' @param initMode Param for the initialization algorithm.
+#' @param maxIter Param for maximum number of iterations.
+#' @param sourceCol Param for the name of the input column for source vertex IDs.
+#' @param destinationCol Name of the input column for destination vertex IDs.


nit. Here, Name -> Param for the name for consistency with the other param descriptions?

Or, is it better to remote Param for prefix in other descriptions?

dongjoon-hyun · 2018-12-05T19:06:24Z

R/pkg/tests/fulltests/test_mllib_fpm.R

+                                          list(list(list(1L), list(3L)), 2L)),
+                                          schema = c("sequence", "freq"))
+
+  expect_equivalent(expected_result, result1)


spark.prefixSpan test case is irrelevant to the scope of PR.
If we want to reindent and add this missing line expect_equivalent(expected_result, result1), let's add in another PR.

dongjoon-hyun · 2018-12-05T19:10:44Z

R/pkg/tests/fulltests/test_mllib_clustering.R

@@ -319,4 +319,18 @@ test_that("spark.posterior and spark.perplexity", {
  expect_equal(length(local.posterior), sum(unlist(local.posterior)))
 })

+test_that("spark.assignClusters", {
+    df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),


indentation for this whole block (line 323~334)?

dongjoon-hyun · 2018-12-05T19:24:49Z

R/pkg/R/mllib_clustering.R

+            }
+            if (!is.integer(maxIter) || maxIter <= 0) {
+              stop("maxIter should be a number with value > 0.")
+            }


Can we make it sure that the src and dst columns are int or bigint, too? Otherwise, we may hit IllegalArgumentException from Scala side in case of R double.

@dongjoon-hyun src and dst are character columns. I have the check for character type.

as.character(sourceCol), as.character(destinationCol)

I mean the data SparkDataFrame's column types, if possible. If you remove 'L' from '0L' in your example dataset, you can see the failure.

Seems to me that R is a thin wrapper, we only need to create a PIC object and call the corresponding scala method. SparkDataFrame's column types are only checked in scala, not in R.

dongjoon-hyun · 2018-12-05T20:24:54Z

docs/ml-clustering.md

+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
+developed by <a href=http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and Cohen</a>.


It seems that <a> tag doesn't work here. Could you check the generated document and try [Lin and Cohen](http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf) instead?

I normally check the md file on the github. The link works OK. Is there a better way to check? @dongjoon-hyun @felixcheung
https://github.com/apache/spark/blob/9158da8cb76cc13f3011deaa7ac2c290eef62389/docs/ml-clustering.md
I guess I will still remove the a href= since no other places in the doc uses <a>

You need to build from Spark repository because Jekyll handles it differently from GitHub. Please try to build in docs directory. There is README.md for that.

Actually, I built this PR on my Mac, and found that the hyperlink is not generated.

Thanks. I will change the hyperlink.

dongjoon-hyun · 2018-12-05T20:26:25Z

docs/ml-clustering.md

+{% include_example java/org/apache/spark/examples/ml/JavaPowerIterationClusteringExample.java %}
+</div>
+
+<div data-lang="r" markdown="1">


It seems that Python is missed here. Do we need to file a JIRA issue to add Python example?
cc @HyukjinKwon

@dongjoon-hyun
#22996
I will add the python example in the doc once the above PR is merged in.

Thanks. Got it.

dongjoon-hyun · 2018-12-05T20:39:52Z

examples/src/main/r/ml/powerIterationClustering.R

+df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+                           list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+                           list(4L, 0L, 0.1)), schema = c("src", "dst", "weight"))
+#assign clusters


nit. #assign -> # assign.

huaxingao · 2018-12-05T21:48:34Z

@dongjoon-hyun Thank you very much for your review. I will make the changes soon.

SparkQA · 2018-12-06T23:43:14Z

Test build #99794 has finished for PR 23072 at commit ca19b00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-12-07T05:39:35Z

R/pkg/vignettes/sparkr-vignettes.Rmd

+
+```{r}
+df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+                      list(1L, 2L, 1.0), list(3L, 4L, 1.0),


Do we have an indentation rule for this? This PR is using two types of indentations for the same statements.

For docs (sparkr-vignettes.Rmd, mllib_clustering.R), this line is aligned with the first list.

For real code (test_mllib_clustering.R, powerIterationClustering.R), this line is aligned with the second list.

Can we use the same indentation rule?

Yea, we do have for indentation rule. "Code Style Guide" at https://spark.apache.org/contributing.html -> https://google.github.io/styleguide/Rguide.xml. I know the code style is not perfectly documented but at least there are some examples. I think the correct indentation is:

df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), list(1L, 2L, 1.0), list(3L, 4L, 1.0), list(4L, 0L, 0.1)), schema = c("src", "dst", "weight"))

There are two separate style already mixed in R code IIRC:

df <- createDataFrame( list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), list(1L, 2L, 1.0), list(3L, 4L, 1.0), list(4L, 0L, 0.1)), schema = c("src", "dst", "weight"))

or

df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), list(1L, 2L, 1.0), list(3L, 4L, 1.0), list(4L, 0L, 0.1)), schema = c("src", "dst", "weight"))

Let's avoid mixed style, and let's go for the later one when possible because at least that looks more complying the code style.

BTW, when I added that into https://spark.apache.org/contributing.html, we also agreed upon following committer's judgement based upon the guide because the guide mentions:

The coding conventions described above should be followed, unless there is good reason to do otherwise. Exceptions include legacy code and modifying third-party code.

since we do have legacy reason, and there is a good reason - consistency and committer's judgement.

HyukjinKwon · 2018-12-07T05:52:20Z

R/pkg/tests/fulltests/test_mllib_clustering.R

+                                          list(1L, 0L),
+                                          list(3L, 1L),
+                                          list(2L, 0L)),
+                                          schema = c("id", "cluster"))


ditto for style

SparkQA · 2018-12-07T19:26:39Z

Test build #99837 has finished for PR 23072 at commit 184560c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-07T20:45:37Z

Test build #99839 has finished for PR 23072 at commit cd07083.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-12-09T17:55:52Z

@dongjoon-hyun @felixcheung how about now?

dongjoon-hyun · 2018-12-10T02:42:33Z

It looks enough to me, @srowen .

srowen · 2018-12-11T00:29:10Z

Merged to master

felixcheung

sorry, late to the party.

felixcheung · 2018-12-11T07:02:10Z

R/pkg/R/mllib_clustering.R

+#'
+#' A scalable graph clustering algorithm. Users can call \code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'


remove empty line - empty is significant in roxygen2

I'll be more careful about this roxygen2 stuff.

felixcheung · 2018-12-11T07:02:41Z

R/pkg/R/mllib_clustering.R

+#  Run the PIC algorithm and returns a cluster assignment for each input vertex.
+#' @param data a SparkDataFrame.
+#' @param k the number of clusters to create.
+#' @param initMode the initialization algorithm.


add One of "random", "degree"?

felixcheung · 2018-12-11T07:05:14Z

R/pkg/R/mllib_clustering.R

+#' @return A dataset that contains columns of vertex id and the corresponding cluster for the id.
+#'         The schema of it will be:
+#'         \code{id: Long}
+#'         \code{cluster: Int}


mm, this won't format correctly - roxygen strips all the whitespaces
also Long and Int is not a proper type in R

felixcheung · 2018-12-11T07:06:49Z

R/pkg/R/mllib_clustering.R

+#'         \code{id: Long}
+#'         \code{cluster: Int}
+#' @rdname spark.powerIterationClustering
+#' @aliases assignClusters,PowerIterationClustering-method,SparkDataFrame-method


wait, this aliases doesn't make sense. could you test if ?assignClusters in R shell if this works?

this should be @aliases spark.assignClusters,SparkDataFrame-method

felixcheung · 2018-12-11T07:07:18Z

R/pkg/R/mllib_clustering.R

+#'                            list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+#'                            list(4L, 0L, 0.1)),
+#'                       schema = c("src", "dst", "weight"))
+#' clusters <- spark.assignClusters(df, initMode="degree", weightCol="weight")


space around = as style

felixcheung · 2018-12-11T07:11:14Z

R/pkg/R/mllib_clustering.R

+            if (!is.numeric(k) || k < 1) {
+              stop("k should be a number with value >= 1.")
+            }
+            if (!is.integer(maxIter) || maxIter <= 0) {


if maxIter should in integer, should we check k is also integer? it;s fixed when it is passed, so just a minor consistency on value check

HyukjinKwon · 2018-12-11T08:14:15Z

examples/src/main/r/ml/powerIterationClustering.R

+                           list(4L, 0L, 0.1)),
+                      schema = c("src", "dst", "weight"))
+# assign clusters
+clusters <- spark.assignClusters(df, k=2L, maxIter=20L, initMode="degree", weightCol="weight")


tiny nit: here as well spacing around =.

srowen · 2018-12-11T20:15:38Z

Oops, my bad. I opened #23292

## What changes were proposed in this pull request? Follow up style fixes to PIC in R; see #23072 ## How was this patch tested? Existing tests. Closes #23292 from srowen/SPARK-19827.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

## What changes were proposed in this pull request? Follow up style fixes to PIC in R; see apache#23072 ## How was this patch tested? Existing tests. Closes apache#23292 from srowen/SPARK-19827.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

## What changes were proposed in this pull request? Add PowerIterationCluster (PIC) in R ## How was this patch tested? Add test case Closes apache#23072 from huaxingao/spark-19827. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

## What changes were proposed in this pull request? Follow up style fixes to PIC in R; see apache#23072 ## How was this patch tested? Existing tests. Closes apache#23292 from srowen/SPARK-19827.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

[SPARK-19827][R]spark.ml R API for PIC

9e2b0f9

felixcheung reviewed Nov 18, 2018

View reviewed changes

huaxingao added 2 commits November 19, 2018 09:32

address comments

2ebfe5a

fix extra space in indentation

f9cb330

fix cran check errors

ea45a51

felixcheung reviewed Nov 27, 2018

View reviewed changes

huaxingao mentioned this pull request Nov 29, 2018

[SPARK-26207][doc]add PowerIterationClustering (PIC) doc in 2.4 branch #23168

Closed

address comments

719d9d1

change invalid url

15cf7f6

dongjoon-hyun reviewed Nov 30, 2018

View reviewed changes

address comments

9158da8

srowen reviewed Dec 4, 2018

View reviewed changes

dongjoon-hyun reviewed Dec 5, 2018

View reviewed changes

address comments

ca19b00

dongjoon-hyun reviewed Dec 7, 2018

View reviewed changes

HyukjinKwon reviewed Dec 7, 2018

View reviewed changes

fix code style problems

184560c

fix more style issues

cd07083

srowen closed this in 05cf81e Dec 11, 2018

felixcheung reviewed Dec 11, 2018

View reviewed changes

HyukjinKwon reviewed Dec 11, 2018

View reviewed changes

srowen mentioned this pull request Dec 11, 2018

[SPARK-19827][R][FOLLOWUP] spark.ml R API for PIC #23292

Closed


		## Power Iteration Clustering (PIC)

		Power Iteration Clustering (PIC) is a scalable graph clustering algorithm

[SPARK-19827][R]spark.ml R API for PIC #23072

[SPARK-19827][R]spark.ml R API for PIC #23072

Conversation

huaxingao commented Nov 17, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Nov 17, 2018

felixcheung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 19, 2018

SparkQA commented Nov 19, 2018

SparkQA commented Nov 19, 2018

huaxingao commented Nov 19, 2018

SparkQA commented Nov 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 29, 2018

SparkQA commented Nov 29, 2018

dongjoon-hyun Nov 30, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Nov 30, 2018 • edited

Choose a reason for hiding this comment

SparkQA commented Dec 1, 2018

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Dec 5, 2018 • edited

Choose a reason for hiding this comment

dongjoon-hyun Dec 5, 2018 • edited

Choose a reason for hiding this comment

dongjoon-hyun Dec 5, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Dec 5, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Dec 5, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao commented Dec 5, 2018

SparkQA commented Dec 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Dec 7, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 7, 2018

SparkQA commented Dec 7, 2018

srowen commented Dec 9, 2018

dongjoon-hyun commented Dec 10, 2018

srowen commented Dec 11, 2018

felixcheung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Dec 11, 2018

dongjoon-hyun Nov 30, 2018 •

edited

dongjoon-hyun Nov 30, 2018 •

edited

dongjoon-hyun Dec 5, 2018 •

edited

dongjoon-hyun Dec 5, 2018 •

edited

dongjoon-hyun Dec 5, 2018 •

edited

dongjoon-hyun Dec 5, 2018 •

edited

dongjoon-hyun Dec 5, 2018 •

edited

HyukjinKwon Dec 7, 2018 •

edited