Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11057] [SQL] Add correlation and covariance matrices #9366

Closed
wants to merge 1 commit into from

Conversation

NarineK
Copy link
Contributor

@NarineK NarineK commented Oct 30, 2015

Hi there,

As we know R has the option to calculate the correlation and covariance for all columns of a dataframe or between columns of two dataframes.

If we look at apache math package we can see that, they have that too.
http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29

In case we have as input only one DataFrame:


for correlation:
cor[i,j] = cor[j,i]
and for the main diagonal we can have 1s.


for covariance:
cov[i,j] = cov[j,i]
and for main diagonal: we can compute the variance for that specific column:
See:
http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29

Thanks,
Narine

@NarineK
Copy link
Contributor Author

NarineK commented Oct 30, 2015

@shivaram , @rxin , would you guys, please, take a look at this ?
Thanks!

@shivaram
Copy link
Contributor

cc @mengxr

@SparkQA
Copy link

SparkQA commented Oct 30, 2015

Test build #44651 has finished for PR 9366 at commit 74bdf54.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@NarineK
Copy link
Contributor Author

NarineK commented Nov 5, 2015

Hi guys, would you share your thoughts about this ?
Thanks!

@NarineK
Copy link
Contributor Author

NarineK commented Nov 9, 2015

In general I think that currently there are some issues in the StatFunctions.scala:

It seems that all computations both for covariance and correlation are being accomplished in one place which makes it a little confusing and harder to extend for the future.

collectStatisticalData method is called for both correlation and covariance and even if I call something like this:
df.stats.corr("numeric_colame", "string_colname")
I get an error like this:
java.lang.IllegalArgumentException: requirement failed: Covariance calculation for columns with dataType StringType not supported.

Here is an example:
These 2 variables are being computed each time when we compute covariance, however, are being used only for correlation:
var MkX = 0.0 // sum of squares of differences from the (current) mean for col1
var MkY = 0.0 // sum of squares of differences from the (current) mean for col2

I think we can actually separate the computations. Is there a reason why these computations are being accomplished in one place ? @rxin, @mengxr

// fills the covariance matrix by computing column-by-column covariances
for (i <- 0 to fieldNames.length-1){
for (j <- 0 to i){
val cov = calculateCov(df, Seq(fieldNames(i), fieldNames(j)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't assume all columns are of numeric type. Catch exception here and use null as value if exception happens?

@NarineK
Copy link
Contributor Author

NarineK commented Nov 16, 2015

Hi @sun-rui,
thank you for your comment. In general, I think that, it might be better to verify all columns types and make sure that we are dealing with numeric fields. if any of the fields isn't numeric we can show an error message, similar to R.
cor(iris)
Error in cor(iris) : 'x' must be numeric

@NarineK
Copy link
Contributor Author

NarineK commented Nov 16, 2015

what do you think ?

@sun-rui
Copy link
Contributor

sun-rui commented Nov 17, 2015

Yes, since R throws error message in this case, we can leave exception un-handled. No need to verify all column types. User will get exception message at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L81

@NarineK
Copy link
Contributor Author

NarineK commented Nov 17, 2015

yes, there is even a test case which covers that case.

@NarineK
Copy link
Contributor Author

NarineK commented Nov 17, 2015

can someone from Spark SQL committers or experts also look at this ?

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53308 has finished for PR 9366 at commit 74bdf54.

  • This patch fails R style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 18, 2016

Test build #56142 has finished for PR 9366 at commit 74bdf54.

  • This patch fails R style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

cc @mengxr

@sjjpo2002
Copy link

sjjpo2002 commented Apr 22, 2016

I have been trying to use correlation on a matrix with many columns. @NarineK menthioned R like correlation. I wish we had something like what pandas offers. It handles missing data automatically. Take a look here. Even the corr() function from MLlib can not handle missing data. These features are really missing from SparkSQL:

  • Apply correlation on all columns and return a matrix
  • Handle missing data automatically like how pandas does

@gatorsmile
Copy link
Member

@NarineK Are you still working on this? cc @yanboliang

@gatorsmile
Copy link
Member

We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!

@asfgit asfgit closed this in b32bd00 Jun 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants