Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44846][SQL] Pull out complex grouping expressions after remove redundant aggregates #42531

Closed
wants to merge 4 commits into from

Conversation

zml1206
Copy link
Contributor

@zml1206 zml1206 commented Aug 17, 2023

What changes were proposed in this pull request?

This PR makes the aggregate do not contains complex grouping expressions after RemoveRedundantAggregates. In rule RemoveRedundantAggregates, if trigger remove redundant aggregates, then do PullOutGroupingExpressions to pull out complex grouping expressions to a Project node under an Aggregate.

Why are the changes needed?

The aggregate contains complex grouping expressions after RemoveRedundantAggregates, if aggregateExpressions has (if / case) branches, it is possible that groupingExpressions is no longer a subexpression of aggregateExpressions after execute PushFoldableIntoBranches rule, Then cause boundReference error.
For example

SELECT c * 2 AS d
FROM (
         SELECT if(b > 1, 1, b) AS c
         FROM (
                  SELECT if(a < 0, 0, a) AS b
                  FROM VALUES (-1), (1), (2) AS t1(a)
              ) t2
         GROUP BY b
     ) t3
GROUP BY c

Before pr

== Optimized Logical Plan ==
Aggregate [if ((b#0 > 1)) 1 else b#0], [if ((b#0 > 1)) 2 else (b#0 * 2) AS d#2]
+- Project [if ((a#3 < 0)) 0 else a#3 AS b#0]
   +- LocalRelation [a#3]
== Error ==
Couldn't find b#0 in [if ((b#0 > 1)) 1 else b#0#7]
java.lang.IllegalStateException: Couldn't find b#0 in [if ((b#0 > 1)) 1 else b#0#7]
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466)
	at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1241)
	at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1240)
	at org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:653)
        ......

After pr

== Optimized Logical Plan ==
Aggregate [_groupingexpression#5], [(_groupingexpression#5 * 2) AS d#2]
+- Project [if ((b#0 > 1)) 1 else b#0 AS _groupingexpression#5]
   +- Project [if ((a#3 < 0)) 0 else a#3 AS b#0]
      +- LocalRelation [a#3]

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

…ndReference error

PushFoldableIntoBranches in complex grouping expressions may cause bindReference error
@github-actions github-actions bot added the SQL label Aug 17, 2023
@@ -3674,6 +3674,21 @@ class DataFrameSuite extends QueryTest
parameters = Map("viewName" -> "AUTHORIZATION"))
}
}

test("SPARK-44846: PushFoldableIntoBranches in complex grouping expressions " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 647 to 648
case a @ Aggregate(groupingExpressions, _, _)
if !groupingExpressions.forall(_.isInstanceOf[NamedExpression]) => a
Copy link
Member

@wangyum wangyum Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the aggregate contains complex grouping expressions after RemoveRedundantAggregates. So could we fix it there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. In PhysicalAggregation,if the expression is not a NamedExpressions, it add an alias. Also I'm not sure if complex expressions are generated anywhere else other than RemoveRedundantAggregates. So is it safer to fix PushFoldableIntoBranches, and the impact is smaller?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a similar issue before: https://issues.apache.org/jira/browse/SPARK-34581

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, later I will do something like PullOutGroupingExpressions after RemoveRedundantAggregates triggers Remove.

update

update

update

update
@@ -96,7 +96,7 @@ class RemoveRedundantAggregatesSuite extends PlanTest {
.groupBy($"a" + $"b")(($"a" + $"b") as "c")
.analyze
val optimized = Optimize.execute(query)
comparePlans(optimized, expected)
comparePlans(optimized, PullOutGroupingExpressions.apply(expected))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

      val expected = relation
        .select($"a", $"b", ($"a" + $"b") as "_groupingexpression")
        .groupBy($"_groupingexpression")($"_groupingexpression" as "c")
        .analyze

@wangyum
Copy link
Member

wangyum commented Aug 19, 2023

Could you update the PR title and PR description?

@zml1206 zml1206 changed the title [SPARK-44846][SQL] PushFoldableIntoBranches in complex grouping expressions may cause bi… [SPARK-44846][SQL] Pull out complex grouping expressions after remove redundant aggregates Aug 19, 2023
@zml1206
Copy link
Contributor Author

zml1206 commented Aug 19, 2023

Could you update the PR title and PR description?

Done.

@wangyum
Copy link
Member

wangyum commented Aug 19, 2023

cc @cloud-fan @peter-toth

@@ -42,11 +42,12 @@ object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
)

// We might have introduces non-deterministic grouping expression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a different idea. I think it's risky to put the new grouping expressions back to the Aggregate, as the grouping expression contains things in the SELECT list. This is a long-standing issue and I feel it's better to just create a Project below the Aggregate to calculate grouping expressions, and other optimizer rules can merge/eliminate this extra Project if it only contains Attributes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Thanks. I'm sorry I didn't understand. Before this PR, it will create a Project below the Aggregate to Pulls out nondeterministic grouping expressions. In this PR, I keep this logic, and create a Project below the Aggregate to Pulls out complex grouping expressions by rule PullOutGroupingExpressions. I do not put the new grouping expressions back to the Aggregate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me put it this way: if we can safely remove a distinct-like Aggregate, it means we can just turn it into a Project (e.g. case a: Aggregate => Project(a.aggregateExpressions, a.child)).

The code today is merging the distinct-like Aggregate to the upper Aggregate, which is problematic and we have to pull something out to a Project case by case. This is fragile and this PR just finds another case.

My proposal is to not merge. Just convert the distinct-like Aggregate to Project, and let other optimizer rules decide if we can merge it or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I understand, I will resubmit a PR later.

@zml1206
Copy link
Contributor Author

zml1206 commented Aug 23, 2023

New PR: #42633

@zml1206 zml1206 closed this Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants