[SPARK-27217][SQL] Nested schema pruning with Aggregation #27056

amanomer · 2019-12-31T06:39:41Z

What changes were proposed in this pull request?

Added a new rule NestColumnAliasing.Overaggregate which will help pushdown nested columns wrapped inside Aggregate.

Why are the changes needed?

Since, spark is supporting nested schema pushdown when used with Project (SELECT query), we also need to support same pushdown ability when user perform aggregation (such as sum) on nested columns.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added test cases.

amanomer · 2020-01-02T08:41:58Z

@cloud-fan @HyukjinKwon @maropu kindly review this approach for nested schema pruning.

maropu · 2020-01-02T11:14:45Z

also, cc: @viirya @dbtsai

maropu · 2020-01-02T11:14:51Z

ok to test

maropu · 2020-01-02T11:16:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

    case _ => false
  }
+
+  object OverAggregate {


AggregateNestedColumnAliasing?

maropu · 2020-01-02T11:16:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

-    case a @ Aggregate(_, _, child) if !child.outputSet.subsetOf(a.references) =>
-      a.copy(child = prunedChild(child, a.references))
+//    case a @ Aggregate(_, _, child) if !child.outputSet.subsetOf(a.references) =>
+//      a.copy(child = prunedChild(child, a.references))


maropu · 2020-01-02T11:17:19Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

    checkAnswer(query, Row("Y.", 1) :: Row("X.", 1) :: Row(null, 2) :: Row(null, 2) :: Nil)
  }

+  testSchemaPruning("Spark-27217: Push nested column when used in Aggregate") {


super nit: plz capitalize Spark in the head.

SparkQA · 2020-01-02T11:52:22Z

Test build #116032 has finished for PR 27056 at commit 32dd333.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-02T15:01:21Z

Test build #116034 has finished for PR 27056 at commit bb77e86.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

amanomer · 2020-01-03T10:54:45Z

@cloud-fan kindly give feedback on current approach

amanomer · 2020-01-03T15:01:03Z

cc @cloud-fan @wangyum
Kindly review this PR

viirya · 2020-01-03T19:29:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala

+      val (nestedFieldReferences, otherRootReferences) =
+        allExpressions.flatMap(collectRootReferenceAndExtractValue).partition {
+          case _: ExtractValue => true
+          case _ => false
+        }
+
+      val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
+        .filter(!_.references.subsetOf(AttributeSet(otherRootReferences)))
+        .groupBy(_.references.head).flatMap {
+        case (attr, nestedFields: Seq[ExtractValue]) =>
+          val nestedFieldToAlias = nestedFields.distinct.map { f =>
+            Alias(f, f.sql)()
+          }
+
+          if (nestedFieldToAlias.nonEmpty &&
+            nestedFieldToAlias.length < totalFieldNum(attr.dataType)) {
+            Some(nestedFieldToAlias)
+          } else {
+            None
+          }
+      }
+      val newProjectList: Seq[NamedExpression] =


This code seems be copied from NestedColumnAliasing. I think we can reuse the methods like getAliasSubMap.

viirya · 2020-01-03T19:30:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

-    case a @ Aggregate(_, _, child) if !child.outputSet.subsetOf(a.references) =>
-      a.copy(child = prunedChild(child, a.references))


Why remove that? This is for top-level column pruning.

viirya · 2020-01-03T19:31:46Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

+  testSchemaPruning("SPARK-27217: Push nested column when used in Aggregate") {
+    val query = sql("select sum(employer.id) from contacts")
+    checkScan(query, "struct<employer:struct<id:INT>>")
+  }


I think this is not a bug. We may not need a JIRA ticket prefix.

viirya · 2020-01-03T19:33:59Z

Actually I'm thinking to add nested column pruning rule for these logical operators. I think it should be feasible to have a more general one instead of adding one by one for each operator.

dongjoon-hyun · 2020-01-05T22:34:39Z

Retest this please

dongjoon-hyun

@amanomer . Could you fix all UT failures?

SparkQA · 2020-01-05T23:43:53Z

Test build #116122 has finished for PR 27056 at commit bb77e86.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

amanomer · 2020-01-06T06:33:45Z

Actually I'm thinking to add nested column pruning rule for these logical operators. I think it should be feasible to have a more general one instead of adding one by one for each operator.

I will make this rule general after resolving issues for Aggregate.

maropu · 2020-01-08T01:04:38Z

Yea, as @viirya said above, I also like more general one.

dongjoon-hyun · 2020-01-22T05:12:43Z

Gentle ping, @amanomer .

nandini57 · 2020-02-22T23:45:56Z

Hi Spark Team, I am inclined to add this change as a custom logical rule by copying over the Aggregate Nesting object in my spark project.We are using Spark 2.3 version and no immediate plans to move over to 3.x. Do you think it is a viable approach ?Any guidance much appreciated

maropu · 2020-02-23T00:36:04Z

Probably, you'd be better to ask that in the spark mailing list. Anyway, we already have SparkSessionExtensions (or SparkSession.experimental.extraOptimizations) for injecting custom rules in 3rd-party projects. So, I think you can do so by using these interfaces (they are experimental interfaces though).

nandini57 · 2020-02-23T01:21:31Z

Thanks takeshi for the quick reply.i utilized extra optimization to include spark 4502 changes and it works pretty well.Very interested in including 27217 as well

…

On Sat, Feb 22, 2020 at 7:36 PM Takeshi Yamamuro ***@***.***> wrote: Probably, you'd be better to ask that in the spark mailing list. Anyway, we already have SparkSessionExtensions (or SparkSession.experimental.extraOptimizations) for injecting custom rules in 3rd-party projects. So, you can do so by using these interfaces (they are experimental interfaces though). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#27056?email_source=notifications&email_token=AJRGS32H242ZR4YXPBDBDJDREHAHPA5CNFSM4KBTBI4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMVOGCA#issuecomment-590013192>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJRGS33GL4SESZKXLX6I2HLREHAHPANCNFSM4KBTBI4A> .

ntlanglois · 2020-05-12T19:08:05Z

This optimization in aggregates would greatly benefit some of our most expensive queries against our nested schema. We've seen up to 8x performance improvement against the same schema outside of aggregates, and to see anywhere close to this for our aggregation queries would be amazing!

There are numerous other optimizations in 3.0 that we're very excited for, but this SPARK-27217 seems like the only thing left that would hold back some of those optimizations from realizing their full potential in aggregations.

Thanks so much for everyone's time and work on this so far. Just patiently wondering, @amanomer, are there are any plans to re-open this pull request with the requested changes in the near future?

Initial commit

32dd333

maropu reviewed Jan 2, 2020

View reviewed changes

Review comments

bb77e86

amanomer mentioned this pull request Jan 3, 2020

[SPARK-30292][SQL]Throw Exception when invalid string is cast to numeric type in ANSI mode #26933

Closed

viirya reviewed Jan 3, 2020

View reviewed changes

dongjoon-hyun requested changes Jan 5, 2020

View reviewed changes

dongjoon-hyun added the SQL label Feb 5, 2020

amanomer closed this Feb 10, 2020

		case a @ Aggregate(_, _, child) if !child.outputSet.subsetOf(a.references) =>
		a.copy(child = prunedChild(child, a.references))

[SPARK-27217][SQL] Nested schema pruning with Aggregation #27056

[SPARK-27217][SQL] Nested schema pruning with Aggregation #27056

Uh oh!

Conversation

amanomer commented Dec 31, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

amanomer commented Jan 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maropu commented Jan 2, 2020

Uh oh!

maropu commented Jan 2, 2020

Uh oh!

maropu Jan 2, 2020

Choose a reason for hiding this comment

Uh oh!

maropu Jan 2, 2020

Choose a reason for hiding this comment

Uh oh!

maropu Jan 2, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 2, 2020

Uh oh!

SparkQA commented Jan 2, 2020

Uh oh!

amanomer commented Jan 3, 2020

Uh oh!

amanomer commented Jan 3, 2020

Uh oh!

viirya Jan 3, 2020

Choose a reason for hiding this comment

Uh oh!

viirya Jan 3, 2020

Choose a reason for hiding this comment

Uh oh!

viirya Jan 3, 2020

Choose a reason for hiding this comment

Uh oh!

viirya commented Jan 3, 2020

Uh oh!

dongjoon-hyun commented Jan 5, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 5, 2020

Uh oh!

amanomer commented Jan 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maropu commented Jan 8, 2020

Uh oh!

dongjoon-hyun commented Jan 22, 2020

Uh oh!

nandini57 commented Feb 22, 2020

Uh oh!

maropu commented Feb 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nandini57 commented Feb 23, 2020 via email

Uh oh!

ntlanglois commented May 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

amanomer commented Jan 2, 2020 •

edited

Loading

amanomer commented Jan 6, 2020 •

edited

Loading

maropu commented Feb 23, 2020 •

edited

Loading