Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11715][SPARKR] Add R support corr for Column Aggregration #9680

Closed
wants to merge 4 commits into from

Conversation

felixcheung
Copy link
Member

Need to match existing method signature

@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45789 has finished for PR 9680 at commit 940164c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

cc @NarineK

I think #9366 is a more general form of this ? If so we could probably just wait for #9366 ?

@felixcheung
Copy link
Member Author

I think 9366 is about computing corr or cov matrix whereas this is computing corr between two columns. They seem to be useful in their own ways. Also this is already supported in Scala and Python.

@NarineK
Copy link
Contributor

NarineK commented Nov 13, 2015

in R the general formula for correlation is the following:
cor(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman"))
in #9366 is the special case when x = y, but in general x and y are 2 different dataframes with the same number of rows.
I did the pull request specifically for x = y, but the jira is more general.

@felixcheung
Copy link
Member Author

So #9366 is for all columns in DataFrame x=y or different x, y DataFrames
And this #9680 is for 2 columns in one DataFrame.

@sun-rui
Copy link
Contributor

sun-rui commented Nov 16, 2015

these are two different issues.

#' @family math_funcs
#' @export
#' @examples \dontrun{corr(df$c, df$d)}
setMethod("corr", signature(x = "Column", col1 = "Column", col2 = "missing", method = "missing"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this signature looks confusing. Maybe change the generic function definition of "corr" is better:

setGeneric("corr", function(x, ...) {standardGeneric("corr") })

setMethod("corr",
          signature(x = "DataFrame"),
          function(x, col1, col2, method = "pearson") {
          ...
}

setMethod("corr", signature(x = "characterOrColumn"),
                  function(x, col2) {
                  ...
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern is how documentation for these "corr" methods are generated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I like the approach of changing the existing generic definition.

perhaps we should align the method signature with the stats::cor

cor(x, y = NULL, use = "everything",
    method = c("pearson", "kendall", "spearman"))

do you know why we decide to name it corr (vs. cor) in other places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for doc, the DataFrame corr in stats.R has @rdname statfunctions
this one has @rdname corr

so they go to different HTML page generated by roxygen2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add cor() as alias for corr(), as you did in #9489

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this, I think what's being added in #9366 matches https://stat.ethz.ch/R-manual/R-devel/library/stats/html/cor.html better. When we are adding that support in R we could add it as cor matching stats::cor.

Meanwhile I'll change corr to what you suggested with function(x, ...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#9366 only supports inter-column cov and cor of a DataFrame, not between columns of two DataFrames. I think actually it is better to add alias in this PR. corr() operating on two columns is similar to R cor() on two vectors.

@felixcheung
Copy link
Member Author

@sun-rui I updated it. I think it's a bit not as strongly typed as I'd like but if I add col2 = "Column" to signature I get this error:

Error in match.call(definition, call, expand.dots, envir) :
  unused argument (col2 = c("Column", ""))

@SparkQA
Copy link

SparkQA commented Nov 17, 2015

Test build #46055 has finished for PR 9680 at commit c5f5524.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member Author

any more comment?

#' @family math_funcs
#' @export
#' @examples \dontrun{corr(df$c, df$d)}
setMethod("corr", signature(x = "Column"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two versions of corr():

def corr(column1: Column, column2: Column): Column
def corr(columnName1: String, columnName2: String): Column

We'd better support both. Something like:

setMethod("corr", signature(x = "characterOrColumn"),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the same for count, max, mean and so on, so change we would need to change every function here - should we do that?

@felixcheung
Copy link
Member Author

@shivaram Can we go ahead with this? I think we could consider adding all character overload for DataFrame functions in a different JIRA.

@sun-rui
Copy link
Contributor

sun-rui commented Nov 19, 2015

adding all character overload for DataFrame functions in a different JIRA is OK.

But for alias of corr(), #9366 only supports inter-column cov and cor of a DataFrame, not between columns of two DataFrames. I think actually it is better to add alias in this PR. corr() operating on two columns is similar to R cor() on two vectors.

@felixcheung
Copy link
Member Author

As per this https://stat.ethz.ch/R-manual/R-devel/library/stats/html/cor.html
It support "vector, matrix or data frame", and it doesn't say a subset of a data frame.
I'm not sure a column in a data frame is like a vector?
Instead, I think #9366 should support 2 DataFrames. And we should come up with a signature similar to stats::cor

@felixcheung
Copy link
Member Author

Also, since here we are working with 2 columns, by adding a alias cor we will need to create a generic with a different signature that again masks stats::cor, like what's happening with stats::cov, and that seems problematic.

@felixcheung
Copy link
Member Author

@sun-rui ? I"m fine with adding cor - are we ok with masking stats::cor?

@sun-rui
Copy link
Contributor

sun-rui commented Nov 30, 2015

@felixcheung, sorry for late response.

Since there is no agreement now, I am fine that we don't add "cor" alias in this PR. Let's get this PR merged.

Could you submit a new JIRA addressing the issue of adding alias of "cor" and also the issue of existing "cov" which masks stats::cov?

@sun-rui
Copy link
Contributor

sun-rui commented Nov 30, 2015

LGTM

@felixcheung
Copy link
Member Author

stats::cov name conflict: https://issues.apache.org/jira/browse/SPARK-11886
will open a new JIRA on cor alias once this is merged

@felixcheung
Copy link
Member Author

thanks, rebased.

@SparkQA
Copy link

SparkQA commented Dec 1, 2015

Test build #46911 has finished for PR 9680 at commit b8090f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member Author

second is a git error (seems like having a lot these days?)
hudson.plugins.git.GitException: Failed to fetch from https://github.com/apache/spark.git at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:763)

@shivaram
Copy link
Contributor

shivaram commented Dec 1, 2015

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Dec 1, 2015

Test build #46922 has finished for PR 9680 at commit 659f805.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

shivaram commented Dec 5, 2015

LGTM. @felixcheung I think the current resolution of not adding cor until we can support it is fine by me. Its better to not mask existing functions unless we can match existing functionality.

Merging this to master, branch-1.6

asfgit pushed a commit that referenced this pull request Dec 6, 2015
Need to match existing method signature

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9680 from felixcheung/rcorr.

(cherry picked from commit 895b6c4)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
@asfgit asfgit closed this in 895b6c4 Dec 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants