-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-38533][SQL] DS V2 aggregate push-down supports project with alias #35823
Conversation
Codecov Report
@@ Coverage Diff @@
## master #35823 +/- ##
=======================================
Coverage 91.19% 91.19%
=======================================
Files 297 297
Lines 64696 64724 +28
Branches 9919 9921 +2
=======================================
+ Hits 58999 59025 +26
- Misses 4330 4332 +2
Partials 1367 1367
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@@ -234,7 +257,7 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper { | |||
// Aggregate [c2#10], [min(min(c1)#21) AS min(c1)#17, max(max(c1)#22) AS max(c1)#18] | |||
// +- RelationV2[c2#10, min(c1)#21, max(c1)#22] ... | |||
// scalastyle:on | |||
plan.transformExpressions { | |||
val agg = plan.transformExpressions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary change?
I actually have an alias over aggregate test in FileSource too. Could you please change that one as well? |
Thank you for the remind. |
@@ -779,15 +779,19 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel | |||
checkAnswer(df, Seq(Row(1d), Row(1d), Row(null))) | |||
} | |||
|
|||
test("scan with aggregate push-down: aggregate over alias NOT push down") { | |||
test("scan with aggregate push-down: aggregate over alias push down") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi. Is it better to specify SPARK-38533
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter.
@@ -1032,4 +1036,76 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel | |||
|ON h2.test.view1.`|col1` = h2.test.view2.`|col1`""".stripMargin) | |||
checkAnswer(df, Seq.empty[Row]) | |||
} | |||
|
|||
test("scan with aggregate push-down: complete push-down aggregate with alias") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
checkAnswer(df2, Seq(Row(1, 19000.00), Row(2, 22000.00), Row(6, 12000.00))) | ||
} | ||
|
||
test("scan with aggregate push-down: partial push-down aggregate with alias") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
checkAggregateRemoved(df) | ||
df.queryExecution.optimizedPlan.collect { | ||
case _: DataSourceV2ScanRelation => | ||
val expected_plan_fragment = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use camel case
naming? https://docs.scala-lang.org/style/naming-conventions.html
checkAggregateRemoved(df, false) | ||
df.queryExecution.optimizedPlan.collect { | ||
case _: DataSourceV2ScanRelation => | ||
val expected_plan_fragment = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
checkAggregateRemoved(df2) | ||
df2.queryExecution.optimizedPlan.collect { | ||
case _: DataSourceV2ScanRelation => | ||
val expected_plan_fragment = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
checkAggregateRemoved(df2, false) | ||
df2.queryExecution.optimizedPlan.collect { | ||
case _: DataSourceV2ScanRelation => | ||
val expected_plan_fragment = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
val newGroupingExpressions = groupingExpressions.map { expr => | ||
expr.transform { | ||
case r: AttributeReference if aliasAttrToOriginAttr.contains(r.canonicalized) => | ||
aliasAttrToOriginAttr(r.canonicalized) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lambda expressions behave the same, we can consider reusing a function
ping @huaxingao cc @cloud-fan |
case ScanOperation(project, filters, sHolder: ScanBuilderHolder) | ||
if filters.isEmpty && project.forall(_.isInstanceOf[AttributeReference]) => | ||
case ScanOperation(project, filters, sHolder: ScanBuilderHolder) if filters.isEmpty && | ||
project.forall(p => p.isInstanceOf[AttributeReference] || p.isInstanceOf[Alias]) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the predicate pushdown optimizer rule and leverage AliasHelper
to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the reminder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we follow it, then here should be project.forall(_.deterministic)
expr: NamedExpression, | ||
aliasMap: AttributeMap[Alias]): NamedExpression = { | ||
replaceAliasButKeepName(expr, aliasMap).transform { | ||
case Alias(attr: Attribute, _) => attr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this line needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attribute in groupingExpressions may be alias. Before push down SQL to JDBC, I want replace the alias Attribute with origin Attribute, not the Alias.
sHolder.builder match { | ||
case r: SupportsPushDownAggregates => | ||
val aliasMap = getAliasMap(project) | ||
val newResultExpressions = resultExpressions.map(replaceAliasWithAttr(_, aliasMap)) | ||
val newGroupingExpressions = groupingExpressions.asInstanceOf[Seq[NamedExpression]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
groupingExpressions
may not be NamedExpression
@@ -92,23 +91,27 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper { | |||
// update the scan builder with agg pushdown and return a new plan with agg pushed | |||
case aggNode @ Aggregate(groupingExpressions, resultExpressions, child) => | |||
child match { | |||
case ScanOperation(project, filters, sHolder: ScanBuilderHolder) | |||
if filters.isEmpty && project.forall(_.isInstanceOf[AttributeReference]) => | |||
case ScanOperation(project, filters, sHolder: ScanBuilderHolder) if filters.isEmpty && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to clearly describe the final plan. This is more complicated now as the project may contain arbitrary expressions.
For example
Aggregate(sum(a + b) + max(a - c) + x, group by x,
Project(x, x + 1 as a, x * 2 as b , x + y as c,
Table(x, y, z)
)
)
what the final plan looks like if the aggregate can be pushed, or can be partial pushed, or can't be pushed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggregate [myDept#0], [((cast(sum(CheckOverflow((promote_precision(cast(mySalary#1 as decimal(23,2))) + promote_precision(cast(yourSalary#2 as decimal(23,2)))), DecimalType(23,2))) as double) + max((cast(mySalary#1 as double) - bonus#6))) + cast(myDept#0 as double)) AS ((sum((mySalary + yourSalary)) + max((mySalary - bonus))) + myDept)#9]
+- Project [dept#3 AS myDept#0, CheckOverflow((promote_precision(cast(salary#5 as decimal(21,2))) + 1.00), DecimalType(21,2)) AS mySalary#1, CheckOverflow((promote_precision(salary#5) * 2.00), DecimalType(22,2)) AS yourSalary#2, bonus#6]
+- ScanBuilderHolder [DEPT#3, NAME#4, SALARY#5, BONUS#6], RelationV2[DEPT#3, NAME#4, SALARY#5, BONUS#6] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession@463a1f47,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions@47224d5d)
cast
, CheckOverflow
, promote_precision
not supported in aggregate push-down.
I updated description of PR and add a plan.
val translatedAggregates = DataSourceStrategy.translateAggregation( | ||
normalizedAggregates, normalizedGroupingExpressions) | ||
val (finalResultExpressions, finalAggregates, finalTranslatedAggregates) = { | ||
val (selectedResultExpressions, selectedAggregates, selectedTranslatedAggregates) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we rename these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not confirmed yet.
val newGroupingExpressions = groupingExpressions.map { | ||
case e: NamedExpression => replaceAliasWithAttr(e, aliasMap) | ||
case other => other | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make the code more explicit? We need to clearly show the steps
- collapse aggregate and project
- remove the alias from aggregate functions and group by expressions (this logic should be put here instead of
AliasHelper
as this is not a common logic) - push down agg
- add back alias for group by expressions only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the good idea.
#35932 replaces this PR. |
What changes were proposed in this pull request?
Currently, Spark DS V2 aggregate push-down doesn't supports project with alias.
Refer
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala
Line 96 in c91c2e9
This PR let it works good with alias.
The first example:
the origin plan show below:
If we can complete push down the aggregate, then the plan will be:
If we can partial push down the aggregate, then the plan will be:
The second example:
the origin plan show below:
If we can complete push down the aggregate, then the plan will be:
If we can partial push down the aggregate, then the plan will be:
Why are the changes needed?
Alias is more useful.
Does this PR introduce any user-facing change?
'Yes'.
Users could see DS V2 aggregate push-down supports project with alias.
How was this patch tested?
New tests.