New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7240][SQL] Single pass covariance calculation for dataframes #5825
Conversation
make progress make progress2 finished cov implementation. Waiting for freqItems to be merged trying to debug added cov
Test build #31514 has finished for PR 5825 at commit
|
Test build #31515 has finished for PR 5825 at commit
|
retest this please |
|
||
:param col1: The name of the first column | ||
:param col2: The name of the second column | ||
:return: the covariance of the columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does :return:
work in python doc? Can you try compiling it?
Also say something about the return type, maybe "covariance of the two columns as a double value"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not. IntelliJ generated them automatically... Thought I would give it a shot.
Can you also update |
Test build #31521 has finished for PR 5825 at commit
|
var xAvg = 0.0 | ||
var yAvg = 0.0 | ||
var Ck = 0.0 | ||
var count = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be Long
.
The math part looks good to me except two inline comments. |
Test build #31516 has finished for PR 5825 at commit
|
raise ValueError("col2 should be a string.") | ||
return self.df._jdf.stat().cov(col1, col2) | ||
|
||
cov.__doc__ = DataFrame.cov.__doc__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you do the doc assignment here, then you should move the implementation to dataframe also ... or just put the doc assignment there. don't put the doc in one place and implementation in another
Test build #31533 has finished for PR 5825 at commit
|
Test build #31536 has finished for PR 5825 at commit
|
this | ||
} | ||
// return the covariance for the observed examples | ||
def cov: Double = Ck / count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just remember that this should be sample covariance (dividing by (n-1) and 0 if n is 1). Try R and make sure we output the same, and then update the doc. See:
Test build #31559 has finished for PR 5825 at commit
|
Test build #31569 has finished for PR 5825 at commit
|
@@ -25,10 +25,11 @@ import org.apache.spark.sql.test.TestSQLContext.implicits._ | |||
|
|||
class DataFrameStatSuite extends FunSuite { | |||
|
|||
import TestData._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we prefer having data defined inline now since it is easier to read. those were there historically because it was hard to create schemardds.
LGTM. You can fix the test case later. Thanks. |
Added the calculation of covariance between two columns to DataFrames. cc mengxr rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#5825 from brkyvz/df-cov and squashes the following commits: cb18046 [Burak Yavuz] changed to sample covariance f2e862b [Burak Yavuz] fixed failed test 51e39b8 [Burak Yavuz] moved implementation 0c6a759 [Burak Yavuz] addressed math comments 8456eca [Burak Yavuz] fix pyStyle3 aa2ad29 [Burak Yavuz] fix pyStyle2 4e97a50 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-cov e3b0b85 [Burak Yavuz] addressed comments v0.1 a7115f1 [Burak Yavuz] fix python style 7dc6dbc [Burak Yavuz] reorder imports 408cb77 [Burak Yavuz] initial commit
Added the calculation of covariance between two columns to DataFrames. cc mengxr rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#5825 from brkyvz/df-cov and squashes the following commits: cb18046 [Burak Yavuz] changed to sample covariance f2e862b [Burak Yavuz] fixed failed test 51e39b8 [Burak Yavuz] moved implementation 0c6a759 [Burak Yavuz] addressed math comments 8456eca [Burak Yavuz] fix pyStyle3 aa2ad29 [Burak Yavuz] fix pyStyle2 4e97a50 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-cov e3b0b85 [Burak Yavuz] addressed comments v0.1 a7115f1 [Burak Yavuz] fix python style 7dc6dbc [Burak Yavuz] reorder imports 408cb77 [Burak Yavuz] initial commit
Added the calculation of covariance between two columns to DataFrames. cc mengxr rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#5825 from brkyvz/df-cov and squashes the following commits: cb18046 [Burak Yavuz] changed to sample covariance f2e862b [Burak Yavuz] fixed failed test 51e39b8 [Burak Yavuz] moved implementation 0c6a759 [Burak Yavuz] addressed math comments 8456eca [Burak Yavuz] fix pyStyle3 aa2ad29 [Burak Yavuz] fix pyStyle2 4e97a50 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-cov e3b0b85 [Burak Yavuz] addressed comments v0.1 a7115f1 [Burak Yavuz] fix python style 7dc6dbc [Burak Yavuz] reorder imports 408cb77 [Burak Yavuz] initial commit
Added the calculation of covariance between two columns to DataFrames.
cc @mengxr @rxin