Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7241] Pearson correlation for DataFrames #5858

Closed
wants to merge 5 commits into from

Conversation

brkyvz
Copy link
Contributor

@brkyvz brkyvz commented May 2, 2015

submitting this PR from a phone, excuse the brevity.
adds Pearson correlation to Dataframes, reusing the covariance calculation code

cc @mengxr @rxin

}

/**
* Java Friendly implementation to calculate the Pearson correlation coefficient of two columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's not java friendly?

@SparkQA
Copy link

SparkQA commented May 2, 2015

Test build #31646 has finished for PR 5858 at commit a682d06.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @return The Pearson Correlation Coefficient as a Double.
*/
def corr(col1: String, col2: String, method: String): Double = {
assert(method == "pearson", "Currently only the calculation of the Pearson Correlation " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require.

assert can be turned off ... and assert should only be used to check internal invariants.

@rxin
Copy link
Contributor

rxin commented May 2, 2015

I will let @mengxr comment on the math part.

@SparkQA
Copy link

SparkQA commented May 2, 2015

Test build #31652 has finished for PR 5858 at commit 4fe693b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 2, 2015

Test build #31662 has finished for PR 5858 at commit d10babb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/** Calculate the Pearson Correlation Coefficient for the given columns */
private[sql] def pearsonCorrelation(df: DataFrame, cols: Seq[String]): Double = {
val counts = collectStatisticalData(df, cols)
counts.Ck / math.sqrt(counts.MkX * counts.MkY)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be the sample correlation as well. In the unit tests, please provide R commands that compute the correlation and the result, and verify that we output the same value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, isn't it? The n - 1's cancel. I tested with sciPy Pearsonr method.
That's why I have the non trivial (i, sqrt(i)) test.
On May 2, 2015 10:32 AM, "Xiangrui Meng" notifications@github.com wrote:

In
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala
#5858 (comment):

@@ -23,29 +23,43 @@ import org.apache.spark.sql.types.{DoubleType, NumericType}

private[sql] object StatFunctions {

  • /** Calculate the Pearson Correlation Coefficient for the given columns */
  • private[sql] def pearsonCorrelation(df: DataFrame, cols: Seq[String]): Double = {
  • val counts = collectStatisticalData(df, cols)
  • counts.Ck / math.sqrt(counts.MkX * counts.MkY)

Should be the sample correlation as well. In the unit tests, please
provide R commands that compute the correlation and the result, and verify
that we output the same value.


Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/5858/files#r29549089.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, they canceled each other. Could you add a non-trivial test to Scala? Now it only has 0.0, 1.0, and -1.0. In your test, please provide Python/R commands in comments to reproduce the result. This makes easier for others to verify.

@SparkQA
Copy link

SparkQA commented May 3, 2015

Test build #31698 has finished for PR 5858 at commit 285b838.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// > a <- 0:19
// > b <- mapply(function(x) x * x - 2 * x + 3.5, a)
// > cor(a, b)
// [1] 0.957233913947585835
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this test!

@mengxr
Copy link
Contributor

mengxr commented May 4, 2015

LGTM.

@rxin
Copy link
Contributor

rxin commented May 4, 2015

Thanks. Merging in master.

@asfgit asfgit closed this in 9646018 May 4, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
submitting this PR from a phone, excuse the brevity.
adds Pearson correlation to Dataframes, reusing the covariance calculation code

cc mengxr rxin

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#5858 from brkyvz/df-corr and squashes the following commits:

285b838 [Burak Yavuz] addressed comments v2.0
d10babb [Burak Yavuz] addressed comments v0.2
4b74b24 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-corr
4fe693b [Burak Yavuz] addressed comments v0.1
a682d06 [Burak Yavuz] ready for PR
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
submitting this PR from a phone, excuse the brevity.
adds Pearson correlation to Dataframes, reusing the covariance calculation code

cc mengxr rxin

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#5858 from brkyvz/df-corr and squashes the following commits:

285b838 [Burak Yavuz] addressed comments v2.0
d10babb [Burak Yavuz] addressed comments v0.2
4b74b24 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-corr
4fe693b [Burak Yavuz] addressed comments v0.1
a682d06 [Burak Yavuz] ready for PR
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
submitting this PR from a phone, excuse the brevity.
adds Pearson correlation to Dataframes, reusing the covariance calculation code

cc mengxr rxin

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#5858 from brkyvz/df-corr and squashes the following commits:

285b838 [Burak Yavuz] addressed comments v2.0
d10babb [Burak Yavuz] addressed comments v0.2
4b74b24 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-corr
4fe693b [Burak Yavuz] addressed comments v0.1
a682d06 [Burak Yavuz] ready for PR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants