[SPARK-36646][SQL] Push down group by partition column for aggregate #34445

huaxingao · 2021-10-30T03:27:21Z

What changes were proposed in this pull request?

lift the restriction for aggregate push down for parquet and orc if group by columns are the same as the partition cols

Why are the changes needed?

previously, if there are group by columns, we don't push down aggregate to data source.
After the change, if the group by columns are the same as the partition columns, we will push down aggregates.

Does this PR introduce any user-facing change?

no

How was this patch tested?

new tests

…for Parquet

SparkQA · 2021-10-30T04:13:12Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49242/

SparkQA · 2021-10-30T05:11:54Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49242/

SparkQA · 2021-10-30T08:13:33Z

Test build #144773 has finished for PR 34445 at commit 9ea1c2d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21

Thanks @huaxingao. I think core logic looks good, just have some minor comments. cc @viirya.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

...test/scala/org/apache/spark/sql/execution/datasources/FileSourceAggregatePushDownSuite.scala

huaxingao · 2021-11-02T17:26:22Z

@c21 Thanks a lot for review! I have addressed the comments. Please take one more look when you have time. Thanks!
also cc @sunchao @viirya

SparkQA · 2021-11-02T18:21:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49325/

sunchao

looks good just a few nits

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala

...test/scala/org/apache/spark/sql/execution/datasources/FileSourceAggregatePushDownSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala

SparkQA · 2021-11-02T18:46:54Z

Test build #144855 has finished for PR 34445 at commit 03c2bd4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-02T19:16:01Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49325/

SparkQA · 2021-11-06T23:34:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49428/

SparkQA · 2021-11-07T00:33:52Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49428/

SparkQA · 2021-11-07T03:46:14Z

Test build #144957 has finished for PR 34445 at commit 0e655a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao · 2021-11-12T21:03:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

+    if (!partitionSchema.names.sameElements(groupByColNames)) {
+      groupByColNames.foreach { col =>
+        val index = partitionSchema.names.indexOf(col)
+        val v = partitionValues.asInstanceOf[GenericInternalRow].values(index)


just curious: is this always guaranteed to be GenericInternalRow?

Seems to me that the partitionValues comes from PartitionPath, which always contains GenericInternalRow.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

...la/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala

SparkQA · 2021-11-14T08:07:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49673/

SparkQA · 2021-11-14T09:12:44Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49673/

SparkQA · 2021-11-14T11:53:17Z

Test build #145204 has finished for PR 34445 at commit 4fb313b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao

Thanks @huaxingao , looks good, just a few more nits after which it's ready to merge I think.

sunchao · 2021-11-16T17:56:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

      return None
    }

+    if (aggregation.groupByColumns.nonEmpty &&


nit: maybe add some comments explaining the reasoning why we have this check and only support the case when group by columns is the same as partition columns. What if the number of group by columns is smaller than that of partition columns?

sunchao · 2021-11-16T17:57:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

+      partitionNames.size != aggregation.groupByColumns.length) {
+      return None
+    }
+    aggregation.groupByColumns.foreach { col =>


nit: maybe also add some comments here - it's not that easy to understand and can help the maintenance of this code.

sunchao · 2021-11-16T18:01:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

+   */
+  def getSchemaWithoutGroupingExpression(
+      aggregation: Aggregation,
+      aggSchema: StructType): StructType = {


nit: maybe swap the order of aggSchema and aggregation here, as we're modifying the schema here with the info from aggregation.

sunchao · 2021-11-16T18:23:38Z

...test/scala/org/apache/spark/sql/execution/datasources/FileSourceAggregatePushDownSuite.scala

+                val expected_plan_fragment =
+                  "PushedAggregation: [COUNT(*), COUNT(value), MAX(value), MIN(value)]," +
+                    " PushedFilters: [], PushedGroupBy: [p1, p2, p3, p4]"
+                // checkKeywordsExistsInExplain(df, expected_plan_fragment)


nit: remove this?

sunchao

LGTM pending CI, thanks @huaxingao ! it'd be great if you can add a bit more details in the PR description too.

SparkQA · 2021-11-17T00:16:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49764/

SparkQA · 2021-11-17T01:16:11Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49764/

viirya · 2021-11-17T02:38:40Z

...test/scala/org/apache/spark/sql/execution/datasources/FileSourceAggregatePushDownSuite.scala

+      withTempView("tmp") {
+        spark.read.format(format).load(dir.getCanonicalPath).createOrReplaceTempView("tmp");
+        Seq("false", "true").foreach { enableVectorizedReader =>
+          withSQLConf(aggPushDownEnabledKey -> "true",


Hmm, can you test both aggPushDownEnabledKey as true and false and see if the results are the same?

viirya · 2021-11-17T02:39:08Z

...test/scala/org/apache/spark/sql/execution/datasources/FileSourceAggregatePushDownSuite.scala

+      withTempView("tmp") {
+        spark.read.format(format).load(dir.getCanonicalPath).createOrReplaceTempView("tmp");
+        Seq("false", "true").foreach { enableVectorizedReader =>
+          withSQLConf(aggPushDownEnabledKey -> "true",


here too. We should make sure aggPushDownEnabledKey won't change results.

viirya · 2021-11-17T02:43:30Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala

    val filePath = new Path(new URI(file.filePath))

    if (aggregation.nonEmpty) {
-      return buildReaderWithAggregates(filePath, conf)
+      return buildReaderWithAggregates(file, conf)
    }


Seems filePath can be created after the if block:

if (aggregation.nonEmpty) { return buildReaderWithAggregates(file, conf) } val filePath = new Path(new URI(file.filePath))

viirya · 2021-11-17T02:43:51Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala

    if (aggregation.nonEmpty) {
-      return buildColumnarReaderWithAggregates(filePath, conf)
+      return buildColumnarReaderWithAggregates(file, conf)
    }


viirya · 2021-11-17T02:45:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala

@@ -250,8 +261,7 @@ object ParquetUtils {
            schemaName = "count(" + count.column.fieldNames.head + ")"
            rowCount += block.getRowCount
            var isPartitionCol = false
-            if (partitionSchema.fields.map(PartitioningUtils.getColName(_, isCaseSensitive))
-              .toSet.contains(count.column.fieldNames.head)) {
+            if (partitionSchema.fields.map(_.name).toSet.contains(count.column.fieldNames.head)) {


Don't need check case sensitivity now?

seems to me no need to check case sensitivity because I have normalized aggregates and group by columns in V2ScanRelationPushDown.

SparkQA · 2021-11-17T04:08:39Z

Test build #145294 has finished for PR 34445 at commit 4eeae6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-11-17T17:58:54Z

...test/scala/org/apache/spark/sql/execution/datasources/FileSourceAggregatePushDownSuite.scala

+        spark.read.format(format).load(dir.getCanonicalPath).createOrReplaceTempView("tmp")
+        val query = "SELECT count(*), count(value), max(value), min(value)," +
+          " p4, p2, p3, p1 FROM tmp GROUP BY p1, p2, p3, p4"
+        val expected = sql(query).collect


Hmm, if we enable aggregate push down later one day, this test might be ignorantly missed to its original goal. Should we explicitly set aggPushDownEnabledKey as false and collect the expected results?

viirya

Looks good. One remaining question about test.

SparkQA · 2021-11-17T22:01:43Z

Test build #145344 has finished for PR 34445 at commit b561d09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21

LGTM

…ult row

SparkQA · 2021-11-18T00:17:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49824/

SparkQA · 2021-11-18T01:16:53Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49824/

viirya · 2021-11-18T01:51:15Z

Last commit is test-only and GA was passed. Merging to master. Thanks.

huaxingao · 2021-11-18T02:09:06Z

Thank you all

SparkQA · 2021-11-18T04:23:36Z

Test build #145353 has finished for PR 34445 at commit 1f45a04.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

[SPARK-36646][SQL] Push down group by partition column for Aggregate …

9ea1c2d

…for Parquet

github-actions bot added the SQL label Oct 30, 2021

c21 reviewed Nov 1, 2021

View reviewed changes

address comments

03c2bd4

sunchao reviewed Nov 2, 2021

View reviewed changes

address comments

0e655a8

sunchao reviewed Nov 12, 2021

View reviewed changes

address comments

4fb313b

sunchao reviewed Nov 16, 2021

View reviewed changes

address comments

4eeae6d

sunchao approved these changes Nov 16, 2021

View reviewed changes

viirya reviewed Nov 17, 2021

View reviewed changes

addess comments

b561d09

viirya reviewed Nov 17, 2021

View reviewed changes

c21 approved these changes Nov 17, 2021

View reviewed changes

set aggPushDownEnabledKey to false explicitly to get the expected res…

1f45a04

…ult row

viirya approved these changes Nov 18, 2021

View reviewed changes

viirya closed this in b9a0c56 Nov 18, 2021

huaxingao deleted the group_by branch November 18, 2021 02:09

[SPARK-36646][SQL] Push down group by partition column for aggregate #34445

[SPARK-36646][SQL] Push down group by partition column for aggregate #34445

Conversation

huaxingao commented Oct 30, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Oct 30, 2021

SparkQA commented Oct 30, 2021

SparkQA commented Oct 30, 2021

c21 left a comment

Choose a reason for hiding this comment

huaxingao commented Nov 2, 2021

SparkQA commented Nov 2, 2021

sunchao left a comment

Choose a reason for hiding this comment

SparkQA commented Nov 2, 2021

SparkQA commented Nov 2, 2021

SparkQA commented Nov 6, 2021

SparkQA commented Nov 7, 2021

SparkQA commented Nov 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 14, 2021

SparkQA commented Nov 14, 2021

SparkQA commented Nov 14, 2021

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

SparkQA commented Nov 17, 2021

SparkQA commented Nov 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 17, 2021

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

SparkQA commented Nov 17, 2021

c21 left a comment

Choose a reason for hiding this comment

SparkQA commented Nov 18, 2021

SparkQA commented Nov 18, 2021

viirya commented Nov 18, 2021

huaxingao commented Nov 18, 2021

SparkQA commented Nov 18, 2021

huaxingao commented Oct 30, 2021 •

edited