Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32788][SQL] non-partitioned table scan should not have partition filter #29637

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This PR fixes a bug FileSourceStrategy, which generates partition filters even if the table is not partitioned. This can confuse FileSourceScanExec, which mistakenly think the table is partitioned and tries to update the numPartitions metrics, and cause a failure. We should not generate partition filters for non-partitioned table.

Why are the changes needed?

The bug was exposed by #29436.

Does this PR introduce any user-facing change?

Yes, fix a bug.

How was this patch tested?

new test

@cloud-fan
Copy link
Contributor Author

@wangyum @viirya @maropu

@@ -439,6 +435,9 @@ case class FileSourceScanExec(
driverMetrics("staticFilesNum") = filesNum
driverMetrics("staticFilesSize") = filesSize
}
if (relation.partitionSchemaOption.isDefined) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not needed, but I keep it here for: 1. avoid duplicated code. 2. make it super safe that we only update the numPartitions if table is partitioned.

test("SPARK-32788: non-partitioned table scan should not have partition filter") {
withTable("t") {
spark.range(1).write.saveAsTable("t")
checkAnswer(sql("SELECT id FROM t WHERE (SELECT true)"), Row(0L))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit surprised that this query failed... nice catch.

ExpressionSet(normalizedFilters
.filter(_.references.subsetOf(partitionSet)))
}

logInfo(s"Pruning directories with: ${partitionKeyFilters.mkString(",")}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is this logging still meaningful for the non-partition case? I mean, could we move it inside the else block in the line 159-162?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

@maropu
Copy link
Member

maropu commented Sep 3, 2020

LGTM

@SparkQA
Copy link

SparkQA commented Sep 3, 2020

Test build #128253 has finished for PR 29637 at commit 40c3420.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum wangyum closed this in 76330e0 Sep 3, 2020
wangyum pushed a commit that referenced this pull request Sep 3, 2020
…on filter

### What changes were proposed in this pull request?

This PR fixes a bug `FileSourceStrategy`, which generates partition filters even if the table is not partitioned. This can confuse `FileSourceScanExec`, which mistakenly think the table is partitioned and tries to update the `numPartitions` metrics, and cause a failure. We should not generate partition filters for non-partitioned table.

### Why are the changes needed?

The bug was exposed by #29436.

### Does this PR introduce _any_ user-facing change?

Yes, fix a bug.

### How was this patch tested?

new test

Closes #29637 from cloud-fan/refactor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
(cherry picked from commit 76330e0)
Signed-off-by: Yuming Wang <yumwang@ebay.com>
@wangyum
Copy link
Member

wangyum commented Sep 3, 2020

Merged to master and branch-3.0.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm too

@SparkQA
Copy link

SparkQA commented Sep 3, 2020

Test build #128259 has finished for PR 29637 at commit 3e4a6a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
…on filter

### What changes were proposed in this pull request?

This PR fixes a bug `FileSourceStrategy`, which generates partition filters even if the table is not partitioned. This can confuse `FileSourceScanExec`, which mistakenly think the table is partitioned and tries to update the `numPartitions` metrics, and cause a failure. We should not generate partition filters for non-partitioned table.

### Why are the changes needed?

The bug was exposed by apache#29436.

### Does this PR introduce _any_ user-facing change?

Yes, fix a bug.

### How was this patch tested?

new test

Closes apache#29637 from cloud-fan/refactor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
(cherry picked from commit 76330e0)
Signed-off-by: Yuming Wang <yumwang@ebay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants