[SPARK-9298][SQL] Add pearson correlation aggregation function #8587

viirya · 2015-09-03T16:50:56Z

JIRA: https://issues.apache.org/jira/browse/SPARK-9298

This patch adds pearson correlation aggregation function based on AggregateExpression2.

SparkQA · 2015-09-03T19:08:48Z

Test build #41976 has finished for PR 8587 at commit cb34a95.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Corr(left: Expression, right: Expression) extends AggregateFunction2
- case class Corr(

rxin · 2015-09-05T09:21:04Z

Is it possible to do this with AlgebraicAggregate so it can be codegened?

viirya · 2015-09-05T12:44:30Z

Seems it is feasible. I will update this later.

Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala

viirya · 2015-09-06T14:59:21Z

@rxin I've updated this patch to use AlgebraicAggregate as you suggested. Please take a look when you are available. Thanks.

SparkQA · 2015-09-06T20:44:43Z

Test build #42076 has finished for PR 8587 at commit 0dd6320.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Corr(left: Expression, right: Expression) extends AlgebraicAggregate
- case class Corr(

mengxr · 2015-10-20T18:07:19Z

@viirya We compared the performance of declarative approach vs imperative in https://issues.apache.org/jira/browse/SPARK-10953. The declarative approach is much slower because of extra code and lack of common sub-expression elimination. If you still keep the original code, could you revert this PR to your first commit? Also cc @rxin for this performance issue.

viirya · 2015-10-21T01:34:45Z

@mengxr I have the original code. I wii revert to it soon as I come back in these two days.

SparkQA · 2015-10-22T00:03:34Z

Test build #44103 has finished for PR 8587 at commit 1505cd2.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Corr(\n * case class Corr(\n

SparkQA · 2015-10-22T00:22:38Z

Test build #44106 has finished for PR 8587 at commit d10afbe.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class First(child: Expression, ignoreNullsExpr: Expression) extends DeclarativeAggregate\n * case class Last(child: Expression, ignoreNullsExpr: Expression) extends DeclarativeAggregate\n * case class Corr(\n * case class First(\n * case class FirstFunction(\n * case class Last(\n * case class LastFunction(\n * case class Corr(\n

SparkQA · 2015-10-22T01:22:36Z

Test build #44108 has finished for PR 8587 at commit cc1657b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Corr(\n * case class Corr(\n

viirya · 2015-10-22T15:33:16Z

retest this please.

SparkQA · 2015-10-22T17:28:02Z

Test build #44154 has finished for PR 8587 at commit cc1657b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class ClassEncoder[T](\n * case class First(child: Expression, ignoreNullsExpr: Expression) extends DeclarativeAggregate\n * case class Last(child: Expression, ignoreNullsExpr: Expression) extends DeclarativeAggregate\n * case class Corr(\n * case class First(\n * case class FirstFunction(\n * case class Last(\n * case class LastFunction(\n * case class Corr(\n * case class CreateRow(children: Seq[Expression]) extends Expression\n

viirya · 2015-10-22T18:05:37Z

ping @mengxr

viirya · 2015-10-26T13:28:44Z

ping @mengxr any other comments?

mengxr · 2015-10-27T07:24:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala

@@ -524,6 +525,116 @@ case class Sum(child: Expression) extends DeclarativeAggregate {
  override val evaluateExpression = Cast(currentSum, resultType)
 }

+case class Corr(


Please add ScalaDoc and document the behavior for null and NaN values.

Provide a link to the wikipedia page that contains the update formula.

mengxr · 2015-10-27T07:28:37Z

LGTM except inline comments. Ping @yhuai @rxin for another pass on SQL side.

SparkQA · 2015-10-27T11:28:07Z

Test build #44418 has finished for PR 8587 at commit 02562f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Corr(\n * case class Corr(\n

yhuai · 2015-10-27T18:06:55Z

Can you add this function to FunctionRegistry? Otherwise, we cannot use it in SQL.

yhuai · 2015-10-27T18:09:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala

+  override def nullable: Boolean = false
+  override def dataType: DoubleType.type = DoubleType
+  override def toString: String = s"CORRELATION($left, $right)"
+}


What will be the error message if we call this function when spark.sql.useAggregate2=false? It will be good to provide a meaning error message.

yhuai · 2015-10-27T18:11:07Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala

+    val df3 = Seq.tabulate(0)(i => (1.0 * i, 2.0 * i)).toDF("a", "b")
+    val corr4 = df3.groupBy().agg(corr("a", "b")).collect()(0).getDouble(0)
+    assert(corr4.isNaN)
+  }


What will happen if the data type of input parameters are not double?

I will add ImplicitCastInputTypes to case class Corr. So the other NumericType can be automatically casting to double.

SparkQA · 2015-10-29T19:44:42Z

Test build #44616 has finished for PR 8587 at commit 2f7b864.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Corr(\n * case class Corr(left: Expression, right: Expression)\n

yhuai · 2015-10-29T19:47:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala

+ *
+ */
+case class Corr(left: Expression, right: Expression)
+    extends BinaryExpression with AggregateExpression with ImplicitCastInputTypes {


This is just a place holder, right? Can we change it to with AggregateExpression1 then we throw an exception (UnsupportedOperatorException) in the newInstance method?

~~But how do we check spark.sql.useAggregate2=false at this expression? Catalyst expressions seems being independent from SQLConf. In newInstance method, we can't refer a conf object.~~

Sorry. I think I know what you meant.

SparkQA · 2015-10-30T01:12:10Z

Test build #44645 has finished for PR 8587 at commit 3b731e2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Corr(\n * case class Corr(left: Expression, right: Expression)\n

SparkQA · 2015-10-30T06:57:55Z

Test build #44659 has finished for PR 8587 at commit 4f8c381.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Corr(\n * case class Corr(left: Expression, right: Expression)\n

SparkQA · 2015-10-30T09:35:28Z

Test build #44671 has finished for PR 8587 at commit 7dcf689.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Corr(\n * case class Corr(left: Expression, right: Expression)\n

viirya · 2015-10-30T09:41:21Z

Failure caused by: 0.6633880657639323 != 0.6633880657639322

viirya · 2015-10-30T10:00:39Z

@yhuai How do you think? Should we modify HiveComparisonTest to allow this kind of error?

yhuai · 2015-10-30T16:19:23Z

If that is the answer generated by Hive, we should not change that (at least for now). The thing we can to is that if we believe our answer is valid (please double check it), we can put that test in the blacklist (with comments on why we do not run it with HiveComparisonTest). Then, we add a query test using queries from that file.

…lacklist and add these tests to AggregationQuerySuite.

viirya · 2015-10-30T17:34:41Z

@yhuai ok. I've put the test in the blacklist, and also add corresponding test to AggregationQuerySuite.

SparkQA · 2015-10-30T18:07:57Z

Test build #44690 has finished for PR 8587 at commit 2de76b4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Corr(\n * case class Corr(left: Expression, right: Expression)\n

viirya · 2015-10-31T00:32:26Z

retest this please.

viirya · 2015-10-31T00:42:09Z

oh, that failure is going to fixed in #9387 now.

SparkQA · 2015-10-31T01:01:33Z

Test build #44719 has finished for PR 8587 at commit 2de76b4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Corr(\n * case class Corr(left: Expression, right: Expression)\n

viirya · 2015-10-31T03:07:28Z

retest this please.

SparkQA · 2015-10-31T05:25:14Z

Test build #44726 has finished for PR 8587 at commit 2de76b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Corr(\n * case class Corr(left: Expression, right: Expression)\n

viirya · 2015-11-02T02:27:24Z

ping @yhuai any more comments?

yhuai · 2015-11-02T02:36:45Z

Thanks! Merging to master.

viirya · 2015-11-02T03:00:24Z

@yhuai Thank you!

Add corr aggregate function.

cb34a95

Merge remote-tracking branch 'upstream/master' into corr_aggregation

0dd6320

Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala

Merge remote-tracking branch 'upstream/master' into corr_aggregation

1505cd2

viirya added 2 commits October 22, 2015 08:13

Fix merging error.

d3e4414

Don't modify WindowSpec.

d10afbe

Fix scala style.

cc1657b

mengxr reviewed Oct 27, 2015
View reviewed changes

viirya added 2 commits October 27, 2015 17:06

Merge remote-tracking branch 'upstream/master' into corr_aggregation

e1fb438

Add document. Return NaN when count is zero.

02562f3

yhuai reviewed Oct 27, 2015
View reviewed changes

viirya added 2 commits October 30, 2015 01:37

For comments.

5fbcf91

Merge remote-tracking branch 'upstream/master' into corr_aggregation

2f7b864

yhuai reviewed Oct 29, 2015
View reviewed changes

Make Corr extends AggregateExpression1.

3b731e2

Fix null case.

4f8c381

Fix udaf_corr test.

7dcf689

Due to numerical errors, put udaf_corr in HiveCompatibilitySuite to b…

2de76b4

…lacklist and add these tests to AggregationQuerySuite.

asfgit closed this in 3e770a6 Nov 2, 2015

viirya deleted the corr_aggregation branch December 27, 2023 18:18

[SPARK-9298][SQL] Add pearson correlation aggregation function #8587

[SPARK-9298][SQL] Add pearson correlation aggregation function #8587

Conversation

viirya commented Sep 3, 2015

SparkQA commented Sep 3, 2015

rxin commented Sep 5, 2015

viirya commented Sep 5, 2015

viirya commented Sep 6, 2015

SparkQA commented Sep 6, 2015

mengxr commented Oct 20, 2015

viirya commented Oct 21, 2015

SparkQA commented Oct 22, 2015

SparkQA commented Oct 22, 2015

SparkQA commented Oct 22, 2015

viirya commented Oct 22, 2015

SparkQA commented Oct 22, 2015

viirya commented Oct 22, 2015

viirya commented Oct 26, 2015

mengxr Oct 27, 2015

Choose a reason for hiding this comment

mengxr commented Oct 27, 2015

SparkQA commented Oct 27, 2015

yhuai commented Oct 27, 2015

yhuai Oct 27, 2015

Choose a reason for hiding this comment

yhuai Oct 27, 2015

Choose a reason for hiding this comment

viirya Oct 29, 2015

Choose a reason for hiding this comment

SparkQA commented Oct 29, 2015

yhuai Oct 29, 2015

Choose a reason for hiding this comment

viirya Oct 29, 2015

Choose a reason for hiding this comment

SparkQA commented Oct 30, 2015

SparkQA commented Oct 30, 2015

SparkQA commented Oct 30, 2015

viirya commented Oct 30, 2015

viirya commented Oct 30, 2015

yhuai commented Oct 30, 2015

viirya commented Oct 30, 2015

SparkQA commented Oct 30, 2015

viirya commented Oct 31, 2015

viirya commented Oct 31, 2015

SparkQA commented Oct 31, 2015

viirya commented Oct 31, 2015

SparkQA commented Oct 31, 2015

viirya commented Nov 2, 2015

yhuai commented Nov 2, 2015

viirya commented Nov 2, 2015