-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-27217][SQL] Nested schema pruning with Aggregation #27056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@cloud-fan @HyukjinKwon @maropu kindly review this approach for nested schema pruning. |
|
ok to test |
| case _ => false | ||
| } | ||
|
|
||
| object OverAggregate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AggregateNestedColumnAliasing?
| case a @ Aggregate(_, _, child) if !child.outputSet.subsetOf(a.references) => | ||
| a.copy(child = prunedChild(child, a.references)) | ||
| // case a @ Aggregate(_, _, child) if !child.outputSet.subsetOf(a.references) => | ||
| // a.copy(child = prunedChild(child, a.references)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
| checkAnswer(query, Row("Y.", 1) :: Row("X.", 1) :: Row(null, 2) :: Row(null, 2) :: Nil) | ||
| } | ||
|
|
||
| testSchemaPruning("Spark-27217: Push nested column when used in Aggregate") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: plz capitalize Spark in the head.
|
Test build #116032 has finished for PR 27056 at commit
|
|
Test build #116034 has finished for PR 27056 at commit
|
|
@cloud-fan kindly give feedback on current approach |
|
cc @cloud-fan @wangyum |
| val (nestedFieldReferences, otherRootReferences) = | ||
| allExpressions.flatMap(collectRootReferenceAndExtractValue).partition { | ||
| case _: ExtractValue => true | ||
| case _ => false | ||
| } | ||
|
|
||
| val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]] | ||
| .filter(!_.references.subsetOf(AttributeSet(otherRootReferences))) | ||
| .groupBy(_.references.head).flatMap { | ||
| case (attr, nestedFields: Seq[ExtractValue]) => | ||
| val nestedFieldToAlias = nestedFields.distinct.map { f => | ||
| Alias(f, f.sql)() | ||
| } | ||
|
|
||
| if (nestedFieldToAlias.nonEmpty && | ||
| nestedFieldToAlias.length < totalFieldNum(attr.dataType)) { | ||
| Some(nestedFieldToAlias) | ||
| } else { | ||
| None | ||
| } | ||
| } | ||
| val newProjectList: Seq[NamedExpression] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code seems be copied from NestedColumnAliasing. I think we can reuse the methods like getAliasSubMap.
| case a @ Aggregate(_, _, child) if !child.outputSet.subsetOf(a.references) => | ||
| a.copy(child = prunedChild(child, a.references)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove that? This is for top-level column pruning.
| testSchemaPruning("SPARK-27217: Push nested column when used in Aggregate") { | ||
| val query = sql("select sum(employer.id) from contacts") | ||
| checkScan(query, "struct<employer:struct<id:INT>>") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not a bug. We may not need a JIRA ticket prefix.
|
Actually I'm thinking to add nested column pruning rule for these logical operators. I think it should be feasible to have a more general one instead of adding one by one for each operator. |
|
Retest this please |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amanomer . Could you fix all UT failures?
|
Test build #116122 has finished for PR 27056 at commit
|
I will make this rule general after resolving issues for Aggregate. |
|
Yea, as @viirya said above, I also like more general one. |
|
Gentle ping, @amanomer . |
|
Hi Spark Team, I am inclined to add this change as a custom logical rule by copying over the Aggregate Nesting object in my spark project.We are using Spark 2.3 version and no immediate plans to move over to 3.x. Do you think it is a viable approach ?Any guidance much appreciated |
|
Probably, you'd be better to ask that in the spark mailing list. Anyway, we already have |
|
Thanks takeshi for the quick reply.i utilized extra optimization to include
spark 4502 changes and it works pretty well.Very interested in including
27217 as well
…On Sat, Feb 22, 2020 at 7:36 PM Takeshi Yamamuro ***@***.***> wrote:
Probably, you'd be better to ask that in the spark mailing list. Anyway,
we already have SparkSessionExtensions (or
SparkSession.experimental.extraOptimizations) for injecting custom rules
in 3rd-party projects. So, you can do so by using these interfaces (they
are experimental interfaces though).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#27056?email_source=notifications&email_token=AJRGS32H242ZR4YXPBDBDJDREHAHPA5CNFSM4KBTBI4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMVOGCA#issuecomment-590013192>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJRGS33GL4SESZKXLX6I2HLREHAHPANCNFSM4KBTBI4A>
.
|
|
This optimization in aggregates would greatly benefit some of our most expensive queries against our nested schema. We've seen up to 8x performance improvement against the same schema outside of aggregates, and to see anywhere close to this for our aggregation queries would be amazing! There are numerous other optimizations in 3.0 that we're very excited for, but this SPARK-27217 seems like the only thing left that would hold back some of those optimizations from realizing their full potential in aggregations. Thanks so much for everyone's time and work on this so far. Just patiently wondering, @amanomer, are there are any plans to re-open this pull request with the requested changes in the near future? |
What changes were proposed in this pull request?
Added a new rule
NestColumnAliasing.Overaggregatewhich will help pushdown nested columns wrapped insideAggregate.Why are the changes needed?
Since, spark is supporting nested schema pushdown when used with
Project(SELECT query), we also need to support same pushdown ability when user perform aggregation (such as sum) on nested columns.Does this PR introduce any user-facing change?
No
How was this patch tested?
Added test cases.