-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-9298][SQL] Add pearson correlation aggregation function #8587
Conversation
Test build #41976 has finished for PR 8587 at commit
|
Is it possible to do this with AlgebraicAggregate so it can be codegened? |
Seems it is feasible. I will update this later. |
Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala
@rxin I've updated this patch to use |
Test build #42076 has finished for PR 8587 at commit
|
@viirya We compared the performance of declarative approach vs imperative in https://issues.apache.org/jira/browse/SPARK-10953. The declarative approach is much slower because of extra code and lack of common sub-expression elimination. If you still keep the original code, could you revert this PR to your first commit? Also cc @rxin for this performance issue. |
@mengxr I have the original code. I wii revert to it soon as I come back in these two days. |
Test build #44103 has finished for PR 8587 at commit
|
Test build #44106 has finished for PR 8587 at commit
|
Test build #44108 has finished for PR 8587 at commit
|
retest this please. |
Test build #44154 has finished for PR 8587 at commit
|
ping @mengxr |
ping @mengxr any other comments? |
@@ -524,6 +525,116 @@ case class Sum(child: Expression) extends DeclarativeAggregate { | |||
override val evaluateExpression = Cast(currentSum, resultType) | |||
} | |||
|
|||
case class Corr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please add ScalaDoc and document the behavior for
null
andNaN
values. - Provide a link to the wikipedia page that contains the update formula.
Test build #44418 has finished for PR 8587 at commit
|
Can you add this function to FunctionRegistry? Otherwise, we cannot use it in SQL. |
override def nullable: Boolean = false | ||
override def dataType: DoubleType.type = DoubleType | ||
override def toString: String = s"CORRELATION($left, $right)" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will be the error message if we call this function when spark.sql.useAggregate2=false
? It will be good to provide a meaning error message.
val df3 = Seq.tabulate(0)(i => (1.0 * i, 2.0 * i)).toDF("a", "b") | ||
val corr4 = df3.groupBy().agg(corr("a", "b")).collect()(0).getDouble(0) | ||
assert(corr4.isNaN) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if the data type of input parameters are not double?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add ImplicitCastInputTypes to case class Corr. So the other NumericType can be automatically casting to double.
Test build #44616 has finished for PR 8587 at commit
|
* | ||
*/ | ||
case class Corr(left: Expression, right: Expression) | ||
extends BinaryExpression with AggregateExpression with ImplicitCastInputTypes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a place holder, right? Can we change it to with AggregateExpression1
then we throw an exception (UnsupportedOperatorException) in the newInstance
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how do we check spark.sql.useAggregate2=false at this expression? Catalyst expressions seems being independent from SQLConf. In newInstance method, we can't refer a conf object.
Sorry. I think I know what you meant.
Test build #44645 has finished for PR 8587 at commit
|
Test build #44659 has finished for PR 8587 at commit
|
Test build #44671 has finished for PR 8587 at commit
|
Failure caused by: 0.6633880657639323 != 0.6633880657639322 |
@yhuai How do you think? Should we modify |
If that is the answer generated by Hive, we should not change that (at least for now). The thing we can to is that if we believe our answer is valid (please double check it), we can put that test in the blacklist (with comments on why we do not run it with HiveComparisonTest). Then, we add a query test using queries from that file. |
…lacklist and add these tests to AggregationQuerySuite.
@yhuai ok. I've put the test in the blacklist, and also add corresponding test to AggregationQuerySuite. |
Test build #44690 has finished for PR 8587 at commit
|
retest this please. |
oh, that failure is going to fixed in #9387 now. |
Test build #44719 has finished for PR 8587 at commit
|
retest this please. |
Test build #44726 has finished for PR 8587 at commit
|
ping @yhuai any more comments? |
Thanks! Merging to master. |
@yhuai Thank you! |
JIRA: https://issues.apache.org/jira/browse/SPARK-9298
This patch adds pearson correlation aggregation function based on
AggregateExpression2
.