[SPARK-11057] [SQL] Add correlation and covariance matrices #9366

NarineK · 2015-10-30T00:28:56Z

Hi there,

As we know R has the option to calculate the correlation and covariance for all columns of a dataframe or between columns of two dataframes.

If we look at apache math package we can see that, they have that too.
http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29

In case we have as input only one DataFrame:

for correlation:
cor[i,j] = cor[j,i]
and for the main diagonal we can have 1s.

for covariance:
cov[i,j] = cov[j,i]
and for main diagonal: we can compute the variance for that specific column:
See:
http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29

Thanks,
Narine

NarineK · 2015-10-30T00:32:49Z

@shivaram , @rxin , would you guys, please, take a look at this ?
Thanks!

shivaram · 2015-10-30T02:09:01Z

cc @mengxr

SparkQA · 2015-10-30T02:42:08Z

Test build #44651 has finished for PR 9366 at commit 74bdf54.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

NarineK · 2015-11-05T06:48:29Z

Hi guys, would you share your thoughts about this ?
Thanks!

NarineK · 2015-11-09T19:52:54Z

In general I think that currently there are some issues in the StatFunctions.scala:

It seems that all computations both for covariance and correlation are being accomplished in one place which makes it a little confusing and harder to extend for the future.

collectStatisticalData method is called for both correlation and covariance and even if I call something like this:
df.stats.corr("numeric_colame", "string_colname")
I get an error like this:
java.lang.IllegalArgumentException: requirement failed: Covariance calculation for columns with dataType StringType not supported.

Here is an example:
These 2 variables are being computed each time when we compute covariance, however, are being used only for correlation:
var MkX = 0.0 // sum of squares of differences from the (current) mean for col1
var MkY = 0.0 // sum of squares of differences from the (current) mean for col2

I think we can actually separate the computations. Is there a reason why these computations are being accomplished in one place ? @rxin, @mengxr

sun-rui · 2015-11-16T08:32:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

+    // fills the covariance matrix by computing column-by-column covariances
+    for (i <- 0 to fieldNames.length-1){
+      for (j <- 0 to i){
+          val cov = calculateCov(df, Seq(fieldNames(i), fieldNames(j)))


You can't assume all columns are of numeric type. Catch exception here and use null as value if exception happens?

NarineK · 2015-11-16T13:55:50Z

Hi @sun-rui,
thank you for your comment. In general, I think that, it might be better to verify all columns types and make sure that we are dealing with numeric fields. if any of the fields isn't numeric we can show an error message, similar to R.
cor(iris)
Error in cor(iris) : 'x' must be numeric

NarineK · 2015-11-16T13:56:19Z

what do you think ?

sun-rui · 2015-11-17T07:22:01Z

Yes, since R throws error message in this case, we can leave exception un-handled. No need to verify all column types. User will get exception message at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L81

NarineK · 2015-11-17T16:21:36Z

yes, there is even a test case which covers that case.

NarineK · 2015-11-17T16:26:20Z

can someone from Spark SQL committers or experts also look at this ?

SparkQA · 2016-03-16T09:16:44Z

Test build #53308 has finished for PR 9366 at commit 74bdf54.

This patch fails R style tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-04-18T21:46:53Z

Test build #56142 has finished for PR 9366 at commit 74bdf54.

This patch fails R style tests.
This patch does not merge cleanly.
This patch adds no public classes.

shivaram · 2016-04-19T17:28:18Z

cc @mengxr

sjjpo2002 · 2016-04-22T21:55:23Z

I have been trying to use correlation on a matrix with many columns. @NarineK menthioned R like correlation. I wish we had something like what pandas offers. It handles missing data automatically. Take a look here. Even the corr() function from MLlib can not handle missing data. These features are really missing from SparkSQL:

Apply correlation on all columns and return a matrix
Handle missing data automatically like how pandas does

gatorsmile · 2017-06-13T16:01:13Z

@NarineK Are you still working on this? cc @yanboliang

gatorsmile · 2017-06-27T06:39:46Z

We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!

Initial commit for correelation and covariance matrices

74bdf54

shivaram mentioned this pull request Nov 13, 2015

[SPARK-11715][SPARKR] Add R support corr for Column Aggregration #9680

Closed

sun-rui reviewed Nov 16, 2015
View reviewed changes

HyukjinKwon mentioned this pull request Jun 25, 2017

[INFRA] Close stale PRs #18417

Closed

asfgit closed this in b32bd00 Jun 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-11057] [SQL] Add correlation and covariance matrices #9366

[SPARK-11057] [SQL] Add correlation and covariance matrices #9366

NarineK commented Oct 30, 2015

NarineK commented Oct 30, 2015

shivaram commented Oct 30, 2015

SparkQA commented Oct 30, 2015

NarineK commented Nov 5, 2015

NarineK commented Nov 9, 2015

sun-rui Nov 16, 2015

NarineK commented Nov 16, 2015

NarineK commented Nov 16, 2015

sun-rui commented Nov 17, 2015

NarineK commented Nov 17, 2015

NarineK commented Nov 17, 2015

SparkQA commented Mar 16, 2016

SparkQA commented Apr 18, 2016

shivaram commented Apr 19, 2016

sjjpo2002 commented Apr 22, 2016 •

edited

gatorsmile commented Jun 13, 2017

gatorsmile commented Jun 27, 2017

[SPARK-11057] [SQL] Add correlation and covariance matrices #9366

[SPARK-11057] [SQL] Add correlation and covariance matrices #9366

Conversation

NarineK commented Oct 30, 2015

NarineK commented Oct 30, 2015

shivaram commented Oct 30, 2015

SparkQA commented Oct 30, 2015

NarineK commented Nov 5, 2015

NarineK commented Nov 9, 2015

sun-rui Nov 16, 2015

Choose a reason for hiding this comment

NarineK commented Nov 16, 2015

NarineK commented Nov 16, 2015

sun-rui commented Nov 17, 2015

NarineK commented Nov 17, 2015

NarineK commented Nov 17, 2015

SparkQA commented Mar 16, 2016

SparkQA commented Apr 18, 2016

shivaram commented Apr 19, 2016

sjjpo2002 commented Apr 22, 2016 • edited

gatorsmile commented Jun 13, 2017

gatorsmile commented Jun 27, 2017

sjjpo2002 commented Apr 22, 2016 •

edited