New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2786][mllib] Python correlations #1713
Conversation
QA tests have started for PR 1713. This patch merges cleanly. |
QA results for PR 1713: |
QA tests have started for PR 1713. This patch merges cleanly. |
* After new correlation algorithms are added, please update the documentation here and in | ||
* Statistics.scala for the correlation APIs. | ||
*/ | ||
object CorrelationNames { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it supposed to be a public API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally planned on using it inside of pyspark but private[mllib]
is sufficient scope now.
QA results for PR 1713: |
* Returns the correlation matrix serialized into a byte array understood by deserializers in | ||
* pyspark. | ||
*/ | ||
def corr(X: JavaRDD[Array[Byte]], method: String): Array[Byte] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can ignore this if you like, but if I'm being picky, why not spell out "correlations"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was designed to match R's method name. R was a good candidate here since it also allows you to pass in the method name as a string after the input arrays/matrix.
QA tests have started for PR 1713. This patch merges cleanly. |
QA results for PR 1713: |
* pyspark. | ||
*/ | ||
def corr(X: JavaRDD[Array[Byte]], method: String): Array[Byte] = { | ||
val inputMatrix = X.rdd.map(deserializeDoubleVector(_)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: X.rdd.map(deserializeDoubleVector)
((_)
is not necessary)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it is in this case, since deserializeDoubleVector
has 2 arguments (with offset
being optional).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
QA tests have started for PR 1713. This patch merges cleanly. |
QA results for PR 1713: |
LGTM. Merged into master. Thanks!! |
Author: Doris Xin <doris.s.xin@gmail.com> Closes apache#1713 from dorx/pythonCorrelation and squashes the following commits: 5f1e60c [Doris Xin] reviewer comments. 46ff6eb [Doris Xin] reviewer comments. ad44085 [Doris Xin] style fix e69d446 [Doris Xin] fixed missed conflicts. eb5bf56 [Doris Xin] merge master cc9f725 [Doris Xin] units passed. 9141a63 [Doris Xin] WIP2 d199f1f [Doris Xin] Moved correlation names into a public object cd163d6 [Doris Xin] WIP
No description provided.