Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19827][R]spark.ml R API for PIC #23072

Closed
wants to merge 10 commits into from

Conversation

huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Add PowerIterationCluster (PIC) in R

How was this patch tested?

Add test case

@SparkQA
Copy link

SparkQA commented Nov 17, 2018

Test build #98971 has finished for PR 23072 at commit 9e2b0f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, a few comments

#' @note spark.assignClusters(SparkDataFrame) since 3.0.0
setMethod("spark.assignClusters",
signature(data = "SparkDataFrame"),
function(data, k = 2L, initMode = "random", maxIter = 20L, srcCol = "src",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set valid values for initMode and check for it - eg. https://github.com/apache/spark/pull/23072/files#diff-d9f92e07db6424e2527a7f9d7caa9013R355

and match.arg(initMode)

df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
list(1L, 2L, 1.0), list(3L, 4L, 1.0),
list(4L, 0L, 0.1)), schema = c("src", "dst", "weight"))
head(spark.assignClusters(df, initMode="degree", weightCol="weight"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spacing: initMode = "degree", weightCol = "weight"

setMethod("spark.assignClusters",
signature(data = "SparkDataFrame"),
function(data, k = 2L, initMode = "random", maxIter = 20L, srcCol = "src",
dstCol = "dst", weightCol = NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we try to avoid srcCol dstCol in R (I think there are other R ml APIs like that)

@SparkQA
Copy link

SparkQA commented Nov 19, 2018

Test build #99000 has finished for PR 23072 at commit 2ebfe5a.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2018

Test build #98999 has finished for PR 23072 at commit f9cb330.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2018

Test build #99017 has finished for PR 23072 at commit ea45a51.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 20, 2018

Test build #99028 has finished for PR 23072 at commit ea45a51.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


## Power Iteration Clustering (PIC)

Power Iteration Clustering (PIC) is a scalable graph clustering algorithm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you open a separate PR with just this file (minus R) and FPGrowthExample.scala on branch-2.4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pardon, I'm catching up -- why just commit this doc to 2.4 and not master?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc change will be in both 2.4 and master, but the R related code will be in master only. I think that's why @felixcheung asked me to open a separate PR to merge in the doc change for 2.4.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK sounds good. Let's merge this one first just as a matter of process.

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99470 has finished for PR 23072 at commit 719d9d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99478 has finished for PR 23072 at commit 15cf7f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -64,4 +64,3 @@ object FPGrowthExample {
spark.stop()
}
}
// scalastyle:on println
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @huaxingao . Let's not remove this. I understand the intention, but we had better keep this because this is the indicator of the end of the scope of line 20.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun sorry, I missed the // scalastyle:off println
Is it OK with you if I remove // scalastyle:off println too? Since println is not used in the example

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, sure!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, println is not used

dstCol: String,
weightCol: String): PowerIterationClustering = {
val pic = new PowerIterationClustering()
.setK(k)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation with two spaces?

PIC finds a very low-dimensional embedding of a dataset using truncated power
iteration on a normalized pair-wise similarity matrix of the data.
abstract: iteration on a normalized pair-wise similarity matrix of the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check this again? It seems to break the original sentence accidentally. Maybe, From the abstract: PIC finds ~~~?

@SparkQA
Copy link

SparkQA commented Dec 1, 2018

Test build #99528 has finished for PR 23072 at commit 9158da8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#' @param initMode Param for the initialization algorithm.
#' @param maxIter Param for maximum number of iterations.
#' @param sourceCol Param for the name of the input column for source vertex IDs.
#' @param destinationCol Name of the input column for destination vertex IDs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Here, Name -> Param for the name for consistency with the other param descriptions?

Or, is it better to remote Param for prefix in other descriptions?

list(list(list(1L), list(3L)), 2L)),
schema = c("sequence", "freq"))

expect_equivalent(expected_result, result1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.prefixSpan test case is irrelevant to the scope of PR.
If we want to reindent and add this missing line expect_equivalent(expected_result, result1), let's add in another PR.

@@ -319,4 +319,18 @@ test_that("spark.posterior and spark.perplexity", {
expect_equal(length(local.posterior), sum(unlist(local.posterior)))
})

test_that("spark.assignClusters", {
df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation for this whole block (line 323~334)?

}
if (!is.integer(maxIter) || maxIter <= 0) {
stop("maxIter should be a number with value > 0.")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it sure that the src and dst columns are int or bigint, too? Otherwise, we may hit IllegalArgumentException from Scala side in case of R double.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun src and dst are character columns. I have the check for character type.

as.character(sourceCol),
as.character(destinationCol)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the data SparkDataFrame's column types, if possible. If you remove 'L' from '0L' in your example dataset, you can see the failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me that R is a thin wrapper, we only need to create a PIC object and call the corresponding scala method. SparkDataFrame's column types are only checked in scala, not in R.

## Power Iteration Clustering (PIC)

Power Iteration Clustering (PIC) is a scalable graph clustering algorithm
developed by <a href=http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and Cohen</a>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that <a> tag doesn't work here. Could you check the generated document and try [Lin and Cohen](http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf) instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I normally check the md file on the github. The link works OK. Is there a better way to check? @dongjoon-hyun @felixcheung
https://github.com/apache/spark/blob/9158da8cb76cc13f3011deaa7ac2c290eef62389/docs/ml-clustering.md
I guess I will still remove the a href= since no other places in the doc uses <a>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to build from Spark repository because Jekyll handles it differently from GitHub. Please try to build in docs directory. There is README.md for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I built this PR on my Mac, and found that the hyperlink is not generated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will change the hyperlink.

{% include_example java/org/apache/spark/examples/ml/JavaPowerIterationClusteringExample.java %}
</div>

<div data-lang="r" markdown="1">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that Python is missed here. Do we need to file a JIRA issue to add Python example?
cc @HyukjinKwon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun
#22996
I will add the python example in the doc once the above PR is merged in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Got it.

df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
list(1L, 2L, 1.0), list(3L, 4L, 1.0),
list(4L, 0L, 0.1)), schema = c("src", "dst", "weight"))
#assign clusters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. #assign -> # assign.

@huaxingao
Copy link
Contributor Author

@dongjoon-hyun Thank you very much for your review. I will make the changes soon.

@SparkQA
Copy link

SparkQA commented Dec 6, 2018

Test build #99794 has finished for PR 23072 at commit ca19b00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


```{r}
df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
list(1L, 2L, 1.0), list(3L, 4L, 1.0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an indentation rule for this? This PR is using two types of indentations for the same statements.

  • For docs (sparkr-vignettes.Rmd, mllib_clustering.R), this line is aligned with the first list.
  • For real code (test_mllib_clustering.R, powerIterationClustering.R), this line is aligned with the second list.

Can we use the same indentation rule?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we do have for indentation rule. "Code Style Guide" at https://spark.apache.org/contributing.html -> https://google.github.io/styleguide/Rguide.xml. I know the code style is not perfectly documented but at least there are some examples. I think the correct indentation is:

df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
                           list(1L, 2L, 1.0), list(3L, 4L, 1.0),
                           list(4L, 0L, 0.1)),
                      schema = c("src", "dst", "weight"))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two separate style already mixed in R code IIRC:

df <- createDataFrame(
  list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
    list(1L, 2L, 1.0), list(3L, 4L, 1.0),
    list(4L, 0L, 0.1)), schema = c("src", "dst", "weight"))

or

df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
                           list(1L, 2L, 1.0), list(3L, 4L, 1.0),
                           list(4L, 0L, 0.1)),
                      schema = c("src", "dst", "weight"))

Let's avoid mixed style, and let's go for the later one when possible because at least that looks more complying the code style.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, when I added that into https://spark.apache.org/contributing.html, we also agreed upon following committer's judgement based upon the guide because the guide mentions:

The coding conventions described above should be followed, unless there is good reason to do otherwise. Exceptions include legacy code and modifying third-party code.

since we do have legacy reason, and there is a good reason - consistency and committer's judgement.

list(1L, 0L),
list(3L, 1L),
list(2L, 0L)),
schema = c("id", "cluster"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for style

@SparkQA
Copy link

SparkQA commented Dec 7, 2018

Test build #99837 has finished for PR 23072 at commit 184560c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 7, 2018

Test build #99839 has finished for PR 23072 at commit cd07083.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Dec 9, 2018

@dongjoon-hyun @felixcheung how about now?

@dongjoon-hyun
Copy link
Member

It looks enough to me, @srowen .

@srowen
Copy link
Member

srowen commented Dec 11, 2018

Merged to master

@srowen srowen closed this in 05cf81e Dec 11, 2018
Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, late to the party.

#'
#' A scalable graph clustering algorithm. Users can call \code{spark.assignClusters} to
#' return a cluster assignment for each input vertex.
#'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line - empty is significant in roxygen2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be more careful about this roxygen2 stuff.

# Run the PIC algorithm and returns a cluster assignment for each input vertex.
#' @param data a SparkDataFrame.
#' @param k the number of clusters to create.
#' @param initMode the initialization algorithm.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add One of "random", "degree"?

#' @return A dataset that contains columns of vertex id and the corresponding cluster for the id.
#' The schema of it will be:
#' \code{id: Long}
#' \code{cluster: Int}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm, this won't format correctly - roxygen strips all the whitespaces
also Long and Int is not a proper type in R

#' \code{id: Long}
#' \code{cluster: Int}
#' @rdname spark.powerIterationClustering
#' @aliases assignClusters,PowerIterationClustering-method,SparkDataFrame-method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, this aliases doesn't make sense. could you test if ?assignClusters in R shell if this works?

this should be @aliases spark.assignClusters,SparkDataFrame-method

#' list(1L, 2L, 1.0), list(3L, 4L, 1.0),
#' list(4L, 0L, 0.1)),
#' schema = c("src", "dst", "weight"))
#' clusters <- spark.assignClusters(df, initMode="degree", weightCol="weight")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space around = as style

if (!is.numeric(k) || k < 1) {
stop("k should be a number with value >= 1.")
}
if (!is.integer(maxIter) || maxIter <= 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if maxIter should in integer, should we check k is also integer? it;s fixed when it is passed, so just a minor consistency on value check

list(4L, 0L, 0.1)),
schema = c("src", "dst", "weight"))
# assign clusters
clusters <- spark.assignClusters(df, k=2L, maxIter=20L, initMode="degree", weightCol="weight")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit: here as well spacing around =.

@srowen
Copy link
Member

srowen commented Dec 11, 2018

Oops, my bad. I opened #23292

srowen added a commit that referenced this pull request Dec 12, 2018
## What changes were proposed in this pull request?

Follow up style fixes to PIC in R; see #23072

## How was this patch tested?

Existing tests.

Closes #23292 from srowen/SPARK-19827.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
## What changes were proposed in this pull request?

Follow up style fixes to PIC in R; see apache#23072

## How was this patch tested?

Existing tests.

Closes apache#23292 from srowen/SPARK-19827.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Add PowerIterationCluster (PIC) in R
## How was this patch tested?
Add test case

Closes apache#23072 from huaxingao/spark-19827.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Follow up style fixes to PIC in R; see apache#23072

## How was this patch tested?

Existing tests.

Closes apache#23292 from srowen/SPARK-19827.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants