[SPARK-31620][SQL] Fix reference binding failure in case of an final agg contains subquery #28496

Ngone51 · 2020-05-11T07:20:51Z

What changes were proposed in this pull request?

Instead of using child.output directly, we should use inputAggBufferAttributes from the current agg expression for Final and PartialMerge aggregates to bind references for their mergeExpression.

Why are the changes needed?

When planning aggregates, the partial aggregate uses agg fucs' inputAggBufferAttributes as its output, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L105

For final HashAggregateExec, we need to bind the DeclarativeAggregate.mergeExpressions with the output of the partial aggregate operator, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L348

This is usually fine. However, if we copy the agg func somehow after agg planning, like PlanSubqueries, the DeclarativeAggregate will be replaced by a new instance with new inputAggBufferAttributes and mergeExpressions. Then we can't bind the mergeExpressions with the output of the partial aggregate operator, as it uses the inputAggBufferAttributes of the original DeclarativeAggregate before copy.

Note that, ImperativeAggregate doesn't have this problem, as we don't need to bind its mergeExpressions. It has a different mechanism to access buffer values, via mutableAggBufferOffset and inputAggBufferOffset.

Does this PR introduce any user-facing change?

Yes, user hit error previously but run query successfully after this change.

How was this patch tested?

Added a regression test.

AngersZhuuuu · 2020-05-11T07:26:26Z

Is there any material about makeCopy, tag etc... To better know the TreeNode.

Ngone51 · 2020-05-11T07:27:56Z

Is there any material about makeCopy, tag etc... To better know the TreeNode.

Just check the source code?

fyi, tags could be copied along with makeCopy @AngersZhuuuu

Ngone51 · 2020-05-11T07:28:13Z

cc @AngersZhuuuu @cloud-fan @wangyum

AngersZhuuuu · 2020-05-11T07:57:54Z

fyi, tags could be copied along with makeCopy @AngersZhuuuu

thanks a lot, it's not clear to me that which one will be retained and which one will be reset in makeCopy, I will check the SourceCode detail.

And for

private lazy val sum = AttributeReference("sum", sumDataType)()

In Sum, it will be change in PlanSubqueries too, and result in different ExprId,
Do we need to handle this ?

Ngone51 · 2020-05-11T08:02:25Z

Do we need to handle this ?

I think no, because those attributes are not referenced by others.

SparkQA · 2020-05-11T12:40:22Z

Test build #122492 has finished for PR 28496 at commit bc1b7e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-11T18:26:57Z

Test build #122506 has finished for PR 28496 at commit cbd891d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rednaxelafx · 2020-05-12T01:21:53Z

I'm wondering if there could be a solution that doesn't involve using tree tags, but perhaps use a secondary constructor like the way AggregateExpression passes resultIds around, and make concrete derived types of DeclarativeAggregate use the explicitly set Id if defined.
So e.g. for Sum, instead of

private lazy val sum = AttributeReference("sum", sumDataType)(/* implicitly make new exprId every time)

maybe have some supporting utils in DeclarativeAggregate like:

+  val explicitAggBufferAttributeExprIds: Option[Seq[ExprId]] = None
+
+  // TODO... maybe put these in a mixin instead
+  override def otherCopyArgs(): Seq[AnyRef] = explicitAggBufferAttributeExprIds :: Nil
+
+  protected def makeAggBufferAttribute(
+      ordinal: Int,
+      name: String,
+      dataType: DataType,
+      nullable: Boolean = true): AttributeReference = {
+    explicitAggBufferAttributeExprIds match {
+      case Some(exprIds) => AttributeReference(name, dataType, nullable)(exprId = exprIds(ordinal))
+      case _ => AttributeReference(name, dataType, nullable)()
+    }
+  }

and make Sum agg buffer declaration as:

private lazy val sum = makeAggBufferAttribute(0, "sum", sumDataType)

I tried this and unfortunately it requires all the places that create a Sum to go from Sum( ... args ... ) to Sum( ... args ... )() for the secondary ctor. That's quite a few places touched :-(

Ngone51 · 2020-05-12T02:55:19Z

Those attributes are mostly lazy. So does that mean we need to generate exprIds eagerly before create agg functions in this way? @rednaxelafx

cloud-fan · 2020-05-13T09:35:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala

+      val aggAttrs = aggregateExpressions.map(_.aggregateFunction)
+        .flatMap(_.inputAggBufferAttributes)
+      val distinctAttrs = child.output.filterNot(
+        a => (groupingAttributes ++ aggAttrs).exists(_.name == a.name))


name matching is fragile, how about

child.output.dropRight(aggAttrs.length) ++ aggAttrs

oh yea, this looks better!

cloud-fan · 2020-05-13T09:36:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala

    val fastRowKeys = ctx.generateExpressions(
-      bindReferences[Expression](groupingExpressions, child.output))
+      bindReferences[Expression](groupingExpressions, inputAttributes))


it's for resolving grouping keys, child.output is fine here.

cloud-fan · 2020-05-13T09:37:10Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

@@ -973,4 +973,21 @@ class DataFrameAggregateSuite extends QueryTest
      assert(error.message.contains("function count_if requires boolean type"))
    }
  }
+
+  Seq(true, false).foreach { value =>
+    test(s"SPARK-31620: agg with subquery (codegen = $value)") {


codegen -> "whole-stage-codegen"

SparkQA · 2020-05-13T11:13:20Z

Test build #122582 has finished for PR 28496 at commit 882467b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-13T12:26:12Z

Test build #122583 has finished for PR 28496 at commit 9a0b788.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-13T16:05:45Z

Test build #122585 has finished for PR 28496 at commit 3fd39c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-05-13T17:05:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala

+      // the `inputAggBufferAttributes` of the original `DeclarativeAggregate` before copy. Instead,
+      // we shall use `inputAggBufferAttributes` after copy to match the new `mergeExpressions`.
+      val aggAttrs = aggregateExpressions
+        .filter(a => a.mode == Final || !a.isDistinct).map(_.aggregateFunction)


can you add a few comments to explain the isDistinct check?

After another thinking, I changed it to filter(a => a.mode == Final || a.mode == PartialMerge).

how about add UT with distinct aggregate expression?

sounds good.

rednaxelafx · 2020-05-13T20:02:37Z

Those attributes are mostly lazy. So does that mean we need to generate exprIds eagerly before create agg functions in this way? @rednaxelafx

I'd say "yes", generating exprIds eagerly shouldn't be a big overhead, and it'll help keep the AttributeReferences stable later on

Ngone51 · 2020-05-14T01:29:15Z

Please note that we've switched to another way to fix the issue in order to avoid using TreeTagNode. @AngersZhuuuu @rednaxelafx

AngersZhuuuu · 2020-05-14T01:59:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala

@@ -129,7 +129,7 @@ case class HashAggregateExec(
            resultExpressions,
            (expressions, inputSchema) =>
              MutableProjection.create(expressions, inputSchema),
-            child.output,
+            inputAttributes,


I try to fix like this before, but I forgot to change child's output here then output column can't work well.

LGTM

Thanks for your confirm.

rednaxelafx

LGTM. This looks like it'll work well.

(That said, this is more black magic...it sort of makes the current situation of brittle pairing between partial/final aggs creep into more places. If we ever get to add another physical Aggregate that support whole-stage codegen, it'll probably painfully rediscover this design again.

I'm glad your test case included w/o grouping keys + w/ grouping keys + distinct, that's very good!)

SparkQA · 2020-05-14T07:05:01Z

Test build #122605 has finished for PR 28496 at commit 89ae4bf.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-14T07:05:01Z

Test build #122604 has finished for PR 28496 at commit 64d13fc.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-05-14T08:54:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectHashAggregateExec.scala

@@ -123,7 +123,7 @@ case class ObjectHashAggregateExec(
            resultExpressions,
            (expressions, inputSchema) =>
              MutableProjection.create(expressions, inputSchema),
-            child.output,
+            inputAttributes,


There's actually another child.output at line 118, but it's dead code indeed. Not sure if we should touch it.

You can open a new PR to remove dead code

cloud-fan · 2020-05-14T13:24:19Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

+            " d, 0))) as csum from t2 group by c"), Row(4) :: Nil)
+
+          // test SortAggregateExec
+          checkAnswer(sql("select max(if(c > (select a from t1), 'str1', 'str2')) as csum from t2"),


it's better to check the physical plan and make sure it's sort agg

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

SparkQA · 2020-05-14T13:49:10Z

Test build #122619 has started for PR 28496 at commit 493157a.

SparkQA · 2020-05-14T14:39:03Z

Test build #122609 has finished for PR 28496 at commit 8aecd57.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-05-15T03:13:44Z

retest this please

SparkQA · 2020-05-15T07:05:02Z

Test build #122642 has finished for PR 28496 at commit 493157a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-05-15T08:46:23Z

retest this please

SparkQA · 2020-05-15T14:52:47Z

Test build #122662 has finished for PR 28496 at commit 493157a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-05-15T15:36:18Z

thanks, merging to master/3.0!

…agg contains subquery ### What changes were proposed in this pull request? Instead of using `child.output` directly, we should use `inputAggBufferAttributes` from the current agg expression for `Final` and `PartialMerge` aggregates to bind references for their `mergeExpression`. ### Why are the changes needed? When planning aggregates, the partial aggregate uses agg fucs' `inputAggBufferAttributes` as its output, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L105 For final `HashAggregateExec`, we need to bind the `DeclarativeAggregate.mergeExpressions` with the output of the partial aggregate operator, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L348 This is usually fine. However, if we copy the agg func somehow after agg planning, like `PlanSubqueries`, the `DeclarativeAggregate` will be replaced by a new instance with new `inputAggBufferAttributes` and `mergeExpressions`. Then we can't bind the `mergeExpressions` with the output of the partial aggregate operator, as it uses the `inputAggBufferAttributes` of the original `DeclarativeAggregate` before copy. Note that, `ImperativeAggregate` doesn't have this problem, as we don't need to bind its `mergeExpressions`. It has a different mechanism to access buffer values, via `mutableAggBufferOffset` and `inputAggBufferOffset`. ### Does this PR introduce _any_ user-facing change? Yes, user hit error previously but run query successfully after this change. ### How was this patch tested? Added a regression test. Closes #28496 from Ngone51/spark-31620. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d8b001f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Ngone51 · 2020-05-15T17:28:12Z

thanks all!

…agg contains subquery Instead of using `child.output` directly, we should use `inputAggBufferAttributes` from the current agg expression for `Final` and `PartialMerge` aggregates to bind references for their `mergeExpression`. When planning aggregates, the partial aggregate uses agg fucs' `inputAggBufferAttributes` as its output, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L105 For final `HashAggregateExec`, we need to bind the `DeclarativeAggregate.mergeExpressions` with the output of the partial aggregate operator, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L348 This is usually fine. However, if we copy the agg func somehow after agg planning, like `PlanSubqueries`, the `DeclarativeAggregate` will be replaced by a new instance with new `inputAggBufferAttributes` and `mergeExpressions`. Then we can't bind the `mergeExpressions` with the output of the partial aggregate operator, as it uses the `inputAggBufferAttributes` of the original `DeclarativeAggregate` before copy. Note that, `ImperativeAggregate` doesn't have this problem, as we don't need to bind its `mergeExpressions`. It has a different mechanism to access buffer values, via `mutableAggBufferOffset` and `inputAggBufferOffset`. Yes, user hit error previously but run query successfully after this change. Added a regression test. Closes apache#28496 from Ngone51/spark-31620. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Ngone51 added 3 commits May 11, 2020 14:58

fix

ba75e36

update

4dcc1b1

update2

bc1b7e5

probot-autolabeler bot added the SQL label May 11, 2020

update3

cbd891d

Ngone51 added 2 commits May 13, 2020 17:17

update

882467b

fix

9a0b788

Ngone51 changed the title ~~[SPARK-31620][SQL] Use TreeTagNode to preserve inputAggBufferAttributes for agg function~~ [SPARK-31620][SQL] Fix reference binding failure in case of an final agg contains subquery May 13, 2020

cloud-fan reviewed May 13, 2020

View reviewed changes

address comment

3fd39c5

cloud-fan approved these changes May 13, 2020

View reviewed changes

cloud-fan reviewed May 13, 2020

View reviewed changes

AngersZhuuuu reviewed May 14, 2020

View reviewed changes

upate

64d13fc

improve test

89ae4bf

rednaxelafx approved these changes May 14, 2020

View reviewed changes

AngersZhuuuu approved these changes May 14, 2020

View reviewed changes

fix sort/object agg as well

8aecd57

Ngone51 commented May 14, 2020

View reviewed changes

cloud-fan reviewed May 14, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala Show resolved Hide resolved

cloud-fan approved these changes May 14, 2020

View reviewed changes

check operator

493157a

cloud-fan closed this in d8b001f May 15, 2020

Nnicolini mentioned this pull request Jun 11, 2020

Nn/spark 31620 palantir/spark#690

Closed

rshkv mentioned this pull request Dec 8, 2020

[SPARK-31620][SQL] Fix reference binding failure in case of an final agg contains subquery palantir/spark#720

Merged

[SPARK-31620][SQL] Fix reference binding failure in case of an final agg contains subquery #28496

[SPARK-31620][SQL] Fix reference binding failure in case of an final agg contains subquery #28496

Conversation

Ngone51 commented May 11, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AngersZhuuuu commented May 11, 2020 • edited

Ngone51 commented May 11, 2020 • edited

Ngone51 commented May 11, 2020

AngersZhuuuu commented May 11, 2020

Ngone51 commented May 11, 2020

SparkQA commented May 11, 2020

SparkQA commented May 11, 2020

rednaxelafx commented May 12, 2020

Ngone51 commented May 12, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 13, 2020

SparkQA commented May 13, 2020

SparkQA commented May 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rednaxelafx commented May 13, 2020

Ngone51 commented May 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rednaxelafx left a comment

Choose a reason for hiding this comment

SparkQA commented May 14, 2020

SparkQA commented May 14, 2020

Ngone51 May 14, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 14, 2020

SparkQA commented May 14, 2020

Ngone51 commented May 15, 2020

SparkQA commented May 15, 2020

cloud-fan commented May 15, 2020

SparkQA commented May 15, 2020

cloud-fan commented May 15, 2020

Ngone51 commented May 15, 2020

Ngone51 commented May 11, 2020 •

edited

AngersZhuuuu commented May 11, 2020 •

edited

Ngone51 commented May 11, 2020 •

edited

Ngone51 commented May 12, 2020 •

edited

Ngone51 May 14, 2020 •

edited