Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7240][SQL] Single pass covariance calculation for dataframes #5825

Closed
wants to merge 11 commits into from

Conversation

brkyvz
Copy link
Contributor

@brkyvz brkyvz commented May 1, 2015

Added the calculation of covariance between two columns to DataFrames.

cc @mengxr @rxin

make progress

make progress2

finished cov implementation. Waiting for freqItems to be merged

trying to debug

added cov
@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31514 has finished for PR 5825 at commit 7dc6dbc.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DataFrameStatFunctions(object):
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31515 has finished for PR 5825 at commit a7115f1.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DataFrameStatFunctions(object):
  • This patch does not change any dependencies.

@brkyvz
Copy link
Contributor Author

brkyvz commented May 1, 2015

retest this please


:param col1: The name of the first column
:param col2: The name of the second column
:return: the covariance of the columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does :return: work in python doc? Can you try compiling it?

Also say something about the return type, maybe "covariance of the two columns as a double value"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not. IntelliJ generated them automatically... Thought I would give it a shot.

@rxin
Copy link
Contributor

rxin commented May 1, 2015

Can you also update __init__.py in sql to add them to the python doc?

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31521 has finished for PR 5825 at commit 4e97a50.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

var xAvg = 0.0
var yAvg = 0.0
var Ck = 0.0
var count = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be Long.

@mengxr
Copy link
Contributor

mengxr commented May 1, 2015

The math part looks good to me except two inline comments.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31516 has finished for PR 5825 at commit a7115f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DataFrameStatFunctions(object):
  • This patch does not change any dependencies.

raise ValueError("col2 should be a string.")
return self.df._jdf.stat().cov(col1, col2)

cov.__doc__ = DataFrame.cov.__doc__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you do the doc assignment here, then you should move the implementation to dataframe also ... or just put the doc assignment there. don't put the doc in one place and implementation in another

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31533 has finished for PR 5825 at commit 0c6a759.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31536 has finished for PR 5825 at commit 51e39b8.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DataFrameStatFunctions(object):

this
}
// return the covariance for the observed examples
def cov: Double = Ck / count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remember that this should be sample covariance (dividing by (n-1) and 0 if n is 1). Try R and make sure we output the same, and then update the doc. See:

https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L144

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31559 has finished for PR 5825 at commit f2e862b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DataFrameStatFunctions(object):

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31569 has finished for PR 5825 at commit cb18046.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DataFrameStatFunctions(object):

@@ -25,10 +25,11 @@ import org.apache.spark.sql.test.TestSQLContext.implicits._

class DataFrameStatSuite extends FunSuite {

import TestData._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we prefer having data defined inline now since it is easier to read. those were there historically because it was hard to create schemardds.

@rxin
Copy link
Contributor

rxin commented May 1, 2015

LGTM. You can fix the test case later. Thanks.

@asfgit asfgit closed this in 4dc8d74 May 1, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
Added the calculation of covariance between two columns to DataFrames.

cc mengxr rxin

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#5825 from brkyvz/df-cov and squashes the following commits:

cb18046 [Burak Yavuz] changed to sample covariance
f2e862b [Burak Yavuz] fixed failed test
51e39b8 [Burak Yavuz] moved implementation
0c6a759 [Burak Yavuz] addressed math comments
8456eca [Burak Yavuz] fix pyStyle3
aa2ad29 [Burak Yavuz] fix pyStyle2
4e97a50 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-cov
e3b0b85 [Burak Yavuz] addressed comments v0.1
a7115f1 [Burak Yavuz] fix python style
7dc6dbc [Burak Yavuz] reorder imports
408cb77 [Burak Yavuz] initial commit
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
Added the calculation of covariance between two columns to DataFrames.

cc mengxr rxin

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#5825 from brkyvz/df-cov and squashes the following commits:

cb18046 [Burak Yavuz] changed to sample covariance
f2e862b [Burak Yavuz] fixed failed test
51e39b8 [Burak Yavuz] moved implementation
0c6a759 [Burak Yavuz] addressed math comments
8456eca [Burak Yavuz] fix pyStyle3
aa2ad29 [Burak Yavuz] fix pyStyle2
4e97a50 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-cov
e3b0b85 [Burak Yavuz] addressed comments v0.1
a7115f1 [Burak Yavuz] fix python style
7dc6dbc [Burak Yavuz] reorder imports
408cb77 [Burak Yavuz] initial commit
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Added the calculation of covariance between two columns to DataFrames.

cc mengxr rxin

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#5825 from brkyvz/df-cov and squashes the following commits:

cb18046 [Burak Yavuz] changed to sample covariance
f2e862b [Burak Yavuz] fixed failed test
51e39b8 [Burak Yavuz] moved implementation
0c6a759 [Burak Yavuz] addressed math comments
8456eca [Burak Yavuz] fix pyStyle3
aa2ad29 [Burak Yavuz] fix pyStyle2
4e97a50 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-cov
e3b0b85 [Burak Yavuz] addressed comments v0.1
a7115f1 [Burak Yavuz] fix python style
7dc6dbc [Burak Yavuz] reorder imports
408cb77 [Burak Yavuz] initial commit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants