Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36646][SQL] Push down group by partition column for aggregate #34445

Closed
wants to merge 7 commits into from

Conversation

huaxingao
Copy link
Contributor

@huaxingao huaxingao commented Oct 30, 2021

What changes were proposed in this pull request?

lift the restriction for aggregate push down for parquet and orc if group by columns are the same as the partition cols

Why are the changes needed?

previously, if there are group by columns, we don't push down aggregate to data source.
After the change, if the group by columns are the same as the partition columns, we will push down aggregates.

Does this PR introduce any user-facing change?

no

How was this patch tested?

new tests

@github-actions github-actions bot added the SQL label Oct 30, 2021
@SparkQA
Copy link

SparkQA commented Oct 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49242/

@SparkQA
Copy link

SparkQA commented Oct 30, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49242/

@SparkQA
Copy link

SparkQA commented Oct 30, 2021

Test build #144773 has finished for PR 34445 at commit 9ea1c2d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @huaxingao. I think core logic looks good, just have some minor comments. cc @viirya.

@huaxingao
Copy link
Contributor Author

@c21 Thanks a lot for review! I have addressed the comments. Please take one more look when you have time. Thanks!
also cc @sunchao @viirya

@SparkQA
Copy link

SparkQA commented Nov 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49325/

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good just a few nits

@SparkQA
Copy link

SparkQA commented Nov 2, 2021

Test build #144855 has finished for PR 34445 at commit 03c2bd4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49325/

@SparkQA
Copy link

SparkQA commented Nov 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49428/

@SparkQA
Copy link

SparkQA commented Nov 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49428/

@SparkQA
Copy link

SparkQA commented Nov 7, 2021

Test build #144957 has finished for PR 34445 at commit 0e655a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (!partitionSchema.names.sameElements(groupByColNames)) {
groupByColNames.foreach { col =>
val index = partitionSchema.names.indexOf(col)
val v = partitionValues.asInstanceOf[GenericInternalRow].values(index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious: is this always guaranteed to be GenericInternalRow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me that the partitionValues comes from PartitionPath, which always contains GenericInternalRow.

@SparkQA
Copy link

SparkQA commented Nov 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49673/

@SparkQA
Copy link

SparkQA commented Nov 14, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49673/

@SparkQA
Copy link

SparkQA commented Nov 14, 2021

Test build #145204 has finished for PR 34445 at commit 4fb313b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @huaxingao , looks good, just a few more nits after which it's ready to merge I think.

return None
}

if (aggregation.groupByColumns.nonEmpty &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe add some comments explaining the reasoning why we have this check and only support the case when group by columns is the same as partition columns. What if the number of group by columns is smaller than that of partition columns?

partitionNames.size != aggregation.groupByColumns.length) {
return None
}
aggregation.groupByColumns.foreach { col =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe also add some comments here - it's not that easy to understand and can help the maintenance of this code.

*/
def getSchemaWithoutGroupingExpression(
aggregation: Aggregation,
aggSchema: StructType): StructType = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe swap the order of aggSchema and aggregation here, as we're modifying the schema here with the info from aggregation.

val expected_plan_fragment =
"PushedAggregation: [COUNT(*), COUNT(value), MAX(value), MIN(value)]," +
" PushedFilters: [], PushedGroupBy: [p1, p2, p3, p4]"
// checkKeywordsExistsInExplain(df, expected_plan_fragment)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this?

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending CI, thanks @huaxingao ! it'd be great if you can add a bit more details in the PR description too.

@SparkQA
Copy link

SparkQA commented Nov 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49764/

@SparkQA
Copy link

SparkQA commented Nov 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49764/

withTempView("tmp") {
spark.read.format(format).load(dir.getCanonicalPath).createOrReplaceTempView("tmp");
Seq("false", "true").foreach { enableVectorizedReader =>
withSQLConf(aggPushDownEnabledKey -> "true",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, can you test both aggPushDownEnabledKey as true and false and see if the results are the same?

withTempView("tmp") {
spark.read.format(format).load(dir.getCanonicalPath).createOrReplaceTempView("tmp");
Seq("false", "true").foreach { enableVectorizedReader =>
withSQLConf(aggPushDownEnabledKey -> "true",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too. We should make sure aggPushDownEnabledKey won't change results.

Comment on lines 86 to 90
val filePath = new Path(new URI(file.filePath))

if (aggregation.nonEmpty) {
return buildReaderWithAggregates(filePath, conf)
return buildReaderWithAggregates(file, conf)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems filePath can be created after the if block:

if (aggregation.nonEmpty) {
  return buildReaderWithAggregates(file, conf)
}

val filePath = new Path(new URI(file.filePath))

Comment on lines 132 to 134
if (aggregation.nonEmpty) {
return buildColumnarReaderWithAggregates(filePath, conf)
return buildColumnarReaderWithAggregates(file, conf)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@@ -250,8 +261,7 @@ object ParquetUtils {
schemaName = "count(" + count.column.fieldNames.head + ")"
rowCount += block.getRowCount
var isPartitionCol = false
if (partitionSchema.fields.map(PartitioningUtils.getColName(_, isCaseSensitive))
.toSet.contains(count.column.fieldNames.head)) {
if (partitionSchema.fields.map(_.name).toSet.contains(count.column.fieldNames.head)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need check case sensitivity now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to me no need to check case sensitivity because I have normalized aggregates and group by columns in V2ScanRelationPushDown.

@SparkQA
Copy link

SparkQA commented Nov 17, 2021

Test build #145294 has finished for PR 34445 at commit 4eeae6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

spark.read.format(format).load(dir.getCanonicalPath).createOrReplaceTempView("tmp")
val query = "SELECT count(*), count(value), max(value), min(value)," +
" p4, p2, p3, p1 FROM tmp GROUP BY p1, p2, p3, p4"
val expected = sql(query).collect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if we enable aggregate push down later one day, this test might be ignorantly missed to its original goal. Should we explicitly set aggPushDownEnabledKey as false and collect the expected results?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One remaining question about test.

@SparkQA
Copy link

SparkQA commented Nov 17, 2021

Test build #145344 has finished for PR 34445 at commit b561d09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Nov 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49824/

@SparkQA
Copy link

SparkQA commented Nov 18, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49824/

@viirya
Copy link
Member

viirya commented Nov 18, 2021

Last commit is test-only and GA was passed. Merging to master. Thanks.

@viirya viirya closed this in b9a0c56 Nov 18, 2021
@huaxingao
Copy link
Contributor Author

Thank you all

@huaxingao huaxingao deleted the group_by branch November 18, 2021 02:09
@SparkQA
Copy link

SparkQA commented Nov 18, 2021

Test build #145353 has finished for PR 34445 at commit 1f45a04.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants