[SPARK-19851][SQL] Add support for EVERY and ANY (SOME) aggregates #22809

dilipbiswal · 2018-10-23T16:49:47Z

What changes were proposed in this pull request?

Implements Every, Some, Any aggregates in SQL. These new aggregate expressions are analyzed in normal way and rewritten to equivalent existing aggregate expressions in the optimizer.

Every(x) => Min(x) where x is boolean.
Some(x) => Max(x) where x is boolean.

Any is a synonym for Some.
SQL

explain extended select every(v) from test_agg group by k;

Plan :

== Parsed Logical Plan ==
'Aggregate ['k], [unresolvedalias('every('v), None)]
+- 'UnresolvedRelation `test_agg`

== Analyzed Logical Plan ==
every(v): boolean
Aggregate [k#0], [every(v#1) AS every(v)#5]
+- SubqueryAlias `test_agg`
   +- Project [k#0, v#1]
      +- SubqueryAlias `test_agg`
         +- LocalRelation [k#0, v#1]

== Optimized Logical Plan ==
Aggregate [k#0], [min(v#1) AS every(v)#5]
+- LocalRelation [k#0, v#1]

== Physical Plan ==
*(2) HashAggregate(keys=[k#0], functions=[min(v#1)], output=[every(v)#5])
+- Exchange hashpartitioning(k#0, 200)
   +- *(1) HashAggregate(keys=[k#0], functions=[partial_min(v#1)], output=[k#0, min#7])
      +- LocalTableScan [k#0, v#1]
Time taken: 0.512 seconds, Fetched 1 row(s)

How was this patch tested?

Added tests in SQLQueryTestSuite, DataframeAggregateSuite

SparkQA · 2018-10-23T20:16:23Z

Test build #97930 has finished for PR 22809 at commit b793d06.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait UnevaluableAggrgate extends DeclarativeAggregate
abstract class AnyAggBase(arg: Expression)
case class AnyAgg(arg: Expression) extends AnyAggBase(arg)
case class SomeAgg(arg: Expression) extends AnyAggBase(arg)
case class EveryAgg(arg: Expression)

dilipbiswal · 2018-10-23T21:50:52Z

cc @cloud-fan @gatorsmile

dilipbiswal · 2018-10-24T17:13:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala

@@ -38,6 +39,18 @@ object ReplaceExpressions extends Rule[LogicalPlan] {
  }
 }

+/**


@cloud-fan We could also add these transformations in ReplaceExpressions rule and not require a dedicated rule (fyi).

ah this sounds better!

cloud-fan · 2018-10-24T18:04:08Z

Can we use these functions in window with this approach?

dilipbiswal · 2018-10-24T18:08:09Z

@cloud-fan Yeah.. I have some tests in group-by.sql. Please take a look.

dilipbiswal · 2018-10-24T18:09:10Z

sql/core/src/test/resources/sql-tests/inputs/group-by.sql

+SELECT every("true");
+
+-- every/some/any aggregates are not supported as windows expression.
+SELECT k, v, every(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg;


@cloud-fan here are a few window tests. (fyi)

SparkQA · 2018-10-24T23:17:09Z

Test build #97986 has finished for PR 22809 at commit 9e194b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-25T01:57:23Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

@@ -727,4 +728,67 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext {
      "grouping expressions: [current_date(None)], value: [key: int, value: string], " +
        "type: GroupBy]"))
  }
+
+  def getEveryAggColumn(columnName: String): Column = {
+    Column(new EveryAgg(Column(columnName).expr).toAggregateExpression(false))


Since we don't have APIs for them in functions, it's not likely users will use then with DataFrame. Thus I think we don't need these tests.

@cloud-fan Ok, let me remove these.

cloud-fan · 2018-10-25T01:57:34Z

LGTM

SparkQA · 2018-10-25T06:57:23Z

Test build #97999 has finished for PR 22809 at commit 6abf844.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-10-25T11:55:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

+    expression[EveryAgg]("every"),
+    expression[AnyAgg]("any"),
+    expression[SomeAgg]("some"),
+


nit: unneeded newline

mgaido91 · 2018-10-25T11:58:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Max.scala

+}
+
+@ExpressionDescription(
+  usage = "_FUNC_(expr) - Returns true if at least one value of `expr` is true.")


add since = "3.0.0"

I will add. One observation :

I see that, only in the API's we specify the @SInCE .. for example for aggregate Max, we have it in api function.scala:Max .. These aggregates are not exposed in the dataset apis and none of the other aggregates seem to have it.

I think the point is that it was not available until 2.3, so earlier methods don't have it. Am I missing something?

@mgaido91 Yeah... If we look at the definition of other aggregate function like Max, Min etc, they don't seem to have the @Since. However, they are defined in the for functions.scala for max and min as @Since 1.3. So basically i was not sure on the what is the rule when a function is not exposed in dataset api but only from SQL.

just to be clear, here I am not talking about @Since, I am talking about since as parameter of @ExpressionDescription

Ideally we need Since here. Some functions don't have them because at that time the Since method was not there. We should add missing Since to them as well, if other people have time to do it.

@mgaido91 @cloud-fan Thanks .. will add.

(btw, let's add since at ExpressionDescription wherever possible .. )

@HyukjinKwon Sure.. I will open a pr shortly.

mgaido91 · 2018-10-25T11:58:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Max.scala

+}
+
+@ExpressionDescription(
+  usage = "_FUNC_(expr) - Returns true if at least one value of `expr` is true.")


mgaido91 · 2018-10-25T11:58:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Min.scala

@@ -57,3 +57,27 @@ case class Min(child: Expression) extends DeclarativeAggregate {

  override lazy val evaluateExpression: AttributeReference = min
 }
+
+@ExpressionDescription(
+  usage = "_FUNC_(expr) - Returns true if all values of `expr` are true.")


mgaido91 · 2018-10-25T11:59:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

+ * with other databases. For example, we use this to support every, any/some aggregates by rewriting
+ * them with Min and Max respectively.
+ */
+trait UnevaluableAggrgate extends DeclarativeAggregate {


typo: UnevaluableAggrgate -> UnevaluableAggregate

mgaido91 · 2018-10-25T12:00:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Max.scala

@@ -57,3 +57,34 @@ case class Max(child: Expression) extends DeclarativeAggregate {

  override lazy val evaluateExpression: AttributeReference = max
 }
+
+abstract class AnyAggBase(arg: Expression)


can we change this to something like UnevaluableBooleanAggBase and make also EveryAgg extend this, in order to avoid code duplication?

@mgaido91 I had a confusion on where to house this class ? Thats why i kept it separate :-) . Is it okay if i just rename it and keep it in Max.scala ?

maybe we can move it close to UnevaluableAggrgate. @cloud-fan @dilipbiswal WDYT?

If it's hard to decide where to put it, I think putting it in a new file can be considered.

mgaido91 · 2018-10-25T12:00:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala

- * be evaluated. This is mainly used to provide compatibility with other databases.
- * For example, we use this to support "nvl" by replacing it with "coalesce".
+ * Finds all the expressions that are unevaluable and replace/rewrite them with semantically
+ * equivalent expressions that can be evaluated. Currently we replace two kinds of expressions :


nit: extra space before :

mgaido91 · 2018-10-25T12:01:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala

  }
 }

-


nit: unneded change

mgaido91 · 2018-10-25T12:02:04Z

...lyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ExpressionTypeCheckingSuite.scala

@@ -144,6 +144,8 @@ class ExpressionTypeCheckingSuite extends SparkFunSuite {
    assertSuccess(Sum('stringField))
    assertSuccess(Average('stringField))
    assertSuccess(Min('arrayField))
+    assertSuccess(new EveryAgg('booleanField))
+    assertSuccess(new AnyAgg('booleanField))


shall we add also SomeAgg?

SparkQA · 2018-10-25T21:27:15Z

Test build #98033 has finished for PR 22809 at commit 08999f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait UnevaluableAggregate extends DeclarativeAggregate
abstract class UnevaluableBooleanAggBase(arg: Expression)
case class EveryAgg(arg: Expression) extends UnevaluableBooleanAggBase(arg)
case class AnyAgg(arg: Expression) extends UnevaluableBooleanAggBase(arg)
case class SomeAgg(arg: Expression) extends UnevaluableBooleanAggBase(arg)

SparkQA · 2018-10-25T21:28:33Z

Test build #98035 has finished for PR 22809 at commit 2bc9965.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-10-26T17:41:32Z

@cloud-fan @mgaido91 I have incorporated the comments. Could we please check if things look okay now ?

mgaido91 · 2018-10-26T18:50:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

+ */
+trait UnevaluableAggregate extends DeclarativeAggregate {
+
+  override def nullable: Boolean = true


why do we set them always as nullable?

@mgaido91 most of the aggregates are nullable, no ? Did you have an suggestion here ?

shouldn't this be nullable only if the incoming expression is?

@mgaido91 I think for aggregates, its different ? Please see Max, Min, they all define it to be nullable. I think they work on group of rows and can return null on empty input.

Ah, right, I was missing that case, sorry, thanks.

mgaido91 · 2018-10-26T18:56:08Z

...yst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/UnevaluableAggs.scala

+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types._
+
+abstract class UnevaluableBooleanAggBase(arg: Expression)


why don't we reuse RuntimeReplaceable?

@mgaido91 RuntimeReplaceble works for scalar expressions but not for aggregate expressions.

why it doesn't work? Sorry if the question is dumb, but I can't see the problem in using it here, probably I am missing something.

@mgaido91 Actually i tried a few different ideas. They are in the comment of original PR link. I had prepared two branches with two approaches .. [comment] (#22047 (comment)). Could you please take a look ?

ok, I see you mentioned that there were issues using RuntimeReplaceble. Can we have a similar approach to that though? I mean introducing a new method here which returns the replaced value, so that in ReplaceExpressions we can simply match on UnevaluableAggregate and include the logic of replacement in the single items?

@mgaido91 I tried it here. I had trouble getting it to work for window expressions. Thats why @cloud-fan suggested to try the current approach in this comment

@mgaido91 I re-read your comments again and here is what i think you are implying :

In UnevaluableAggregate, define a method say replacedExpression

The sub-class overrides it and return an actual rewritten expression

In optimized, match UnevaluableAggregate and call replacedExpression

Did i understand it correctly ? If so, actually, i wouldn't prefer that way. The reason is, the replacedExpression is hidden from analyzer and so it may not be safe. The way ReplaceExpression framework is nicely designed, the analyzer resolves the rewritten expression normally (as its the child expression). Thats the reason, i have opted for specific/targeted rewrites.

If Wenchen and you think otherwise, then i can change.

yes, but in the analyzer you analyze the children of the current expression right? So we would just have something like def replacedExpression = Min(arg), which means doing exactly the same which is done now, the only difference is where the conversion logic is put. Adn IMHO having all the conversion logic for all the expression in ReplaceExpressions is harder to maintain than having all the logic related to the expression contained in it.

Anyway, since you have a different opinion, let's see what @cloud-fan thinks on this. Thanks.

We can leave a TODO saying that we should create a framework to replace aggregate functions, but I think the current patch is good enough for these 3 functions, and I'm not aware of more functions like them that we need to deal with.

@cloud-fan @mgaido91 Thank you. I have added a TODO for now.

mgaido91 · 2018-10-27T20:47:28Z

LGTM

SparkQA · 2018-10-28T00:06:11Z

Test build #98139 has finished for PR 22809 at commit 07205de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-28T01:38:55Z

thanks, merging to master!

dilipbiswal · 2018-10-28T04:09:47Z

Thanks a lot @cloud-fan @mgaido91

## What changes were proposed in this pull request? Implements Every, Some, Any aggregates in SQL. These new aggregate expressions are analyzed in normal way and rewritten to equivalent existing aggregate expressions in the optimizer. Every(x) => Min(x) where x is boolean. Some(x) => Max(x) where x is boolean. Any is a synonym for Some. SQL ``` explain extended select every(v) from test_agg group by k; ``` Plan : ``` == Parsed Logical Plan == 'Aggregate ['k], [unresolvedalias('every('v), None)] +- 'UnresolvedRelation `test_agg` == Analyzed Logical Plan == every(v): boolean Aggregate [k#0], [every(v#1) AS every(v)apache#5] +- SubqueryAlias `test_agg` +- Project [k#0, v#1] +- SubqueryAlias `test_agg` +- LocalRelation [k#0, v#1] == Optimized Logical Plan == Aggregate [k#0], [min(v#1) AS every(v)apache#5] +- LocalRelation [k#0, v#1] == Physical Plan == *(2) HashAggregate(keys=[k#0], functions=[min(v#1)], output=[every(v)apache#5]) +- Exchange hashpartitioning(k#0, 200) +- *(1) HashAggregate(keys=[k#0], functions=[partial_min(v#1)], output=[k#0, min#7]) +- LocalTableScan [k#0, v#1] Time taken: 0.512 seconds, Fetched 1 row(s) ``` ## How was this patch tested? Added tests in SQLQueryTestSuite, DataframeAggregateSuite Closes apache#22809 from dilipbiswal/SPARK-19851-specific-rewrite. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dilipbiswal added 2 commits October 23, 2018 09:42

tests

e6c5c84

Code changes

b793d06

dilipbiswal commented Oct 24, 2018

View reviewed changes

fix

9e194b5

cloud-fan reviewed Oct 25, 2018

View reviewed changes

Code review

6abf844

mgaido91 reviewed Oct 25, 2018

View reviewed changes

dilipbiswal added 2 commits October 25, 2018 10:22

Code review

08999f9

fix

2bc9965

mgaido91 reviewed Oct 26, 2018

View reviewed changes

Added todo

07205de

asfgit closed this in e545811 Oct 28, 2018

@@ @@ -38,6 +39,18 @@ object ReplaceExpressions extends Rule[LogicalPlan] { @@
                 }
               }
+              /**

[SPARK-19851][SQL] Add support for EVERY and ANY (SOME) aggregates #22809

[SPARK-19851][SQL] Add support for EVERY and ANY (SOME) aggregates #22809

Conversation

dilipbiswal commented Oct 23, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 23, 2018

dilipbiswal commented Oct 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Oct 24, 2018

dilipbiswal commented Oct 24, 2018

dilipbiswal Oct 24, 2018 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Oct 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Oct 25, 2018

SparkQA commented Oct 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal Oct 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 25, 2018

SparkQA commented Oct 25, 2018

dilipbiswal commented Oct 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal Oct 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal Oct 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal Oct 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgaido91 commented Oct 27, 2018

SparkQA commented Oct 28, 2018

cloud-fan commented Oct 28, 2018

dilipbiswal commented Oct 28, 2018

dilipbiswal Oct 24, 2018 •

edited

Loading

dilipbiswal Oct 25, 2018 •

edited

Loading

dilipbiswal Oct 26, 2018 •

edited

Loading

dilipbiswal Oct 26, 2018 •

edited

Loading

dilipbiswal Oct 26, 2018 •

edited

Loading