-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26893][SQL] Allow partition pruning with subquery filters on file source #23802
Conversation
cc. @mgaido91 @cloud-fan as you worked on #22518 |
a.withName(logicalRelation.output.find(_.semanticEquals(a)).get.name) | ||
} | ||
} | ||
val normalizedFilters = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewers please note that this change is not related to this PR actually. But I touched DataSourceStrategy.normalizeFilters()
so I thought this is it is good occasion to refactor this.
@peter-toth I expressed my major concern on the JIRA actually. I don't see them addressed here and actually IIRC before that change subqueries were pushed down but not used for actually filtering out partitions (I don't see this change making them of use). I have not yet carefully reviewed this PR (will do as soon as I have time), so I may be missing something, I apologize in advance if I do. |
I think before your PR they were pushed down as |
do you have a prototype about how to leverage the subquery filters in file source scan? |
I might be wrong but this PR does that. |
I extended the UT a bit. Now you can see that there is no additional filter above |
I don't see the subquery filters are actually pushed down to file source. Looks like you just revert #22518. |
if partitionFilters.exists(ExecSubqueryExpression.hasSubquery) && | ||
fs.inputRDDs().forall( | ||
_.asInstanceOf[FileScanRDD].filePartitions.forall( | ||
_.files.forall(_.filePath.contains("p=0")))) => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This asserts partition p=0
is read, but doesn't assert that other partitions are not read. Isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, nvm. i read it wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peter-toth I checked and the subquery is executed twice in your UT. Please set a breakpoint in updateResult
of scalar subquery. I am not sure why honestly since in the execution plan I indeed see only one subquery exec: may you please investigate this? Moreover, it would be great if we could enforce this in a UT: the only way I can think of it to create a dummy plan on the tests which extends ScalarSubquery and keeps a global counter about the number of times updateResult is invoked.
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
Outdated
Show resolved
Hide resolved
@@ -170,7 +173,7 @@ object FileSourceStrategy extends Strategy with Logging { | |||
l.resolve(fsRelation.dataSchema, fsRelation.sparkSession.sessionState.analyzer.resolver) | |||
|
|||
// Partition keys are not available in the statistics of the files. | |||
val dataFilters = normalizedFilters.filter(_.references.intersect(partitionSet).isEmpty) | |||
val dataFilters = normalizedFiltersWithoutSubqueries.filter(_.references.intersect(partitionSet).isEmpty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why data filters use normalizedFiltersWithoutSubqueries
instead of normalizedFilters
? subquery filters only apply to partition keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this assumes that data filters are only used in data source pushing down filters. But FileIndex.listFiles
also has data filters as parameter. listFiles
doesn't clearly define how it will use...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a few lines of comment here and mention why we use normalizedFiltersWithoutSubqueries
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
@@ -147,7 +147,7 @@ object FileSourceStrategy extends Strategy with Logging { | |||
// - filters that need to be evaluated again after the scan | |||
val filterSet = ExpressionSet(filters) | |||
|
|||
val normalizedFilters = DataSourceStrategy.normalizeFilters(filters, l.output) | |||
val normalizedFilters = DataSourceStrategy.normalizeFilters(filters, l.output, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By doing this, is it meaning the issue at #22518 is back again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment above.
Hmm, thanks. I will look into it. |
@@ -433,8 +433,14 @@ object DataSourceStrategy { | |||
*/ | |||
protected[sql] def normalizeFilters( | |||
filters: Seq[Expression], | |||
attributes: Seq[AttributeReference]): Seq[Expression] = { | |||
filters.filterNot(SubqueryExpression.hasSubquery).map { e => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filtering out subqueries is out of the purpose of normalizeFilters
. It was not good to implicitly do this. I think we can either updating method document, or moving of filtering of subqueries as a small helper method filterSubqueryFilters
.
Then we can do normalizeFilters(filters, attributes)
and normalizeFilters(filterSubqueryFilters(filters), attributes)
, individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@peter-toth Thanks for replying the comments. I understood more about what this fix goes to do. #22518 removed subquery filters in Can you update the PR description? I think the current description is unclear. |
Based on https://github.com/apache/spark/pull/22518/files#diff-7393042ad384133558061bee01daa3f7L166 subquery filters were never included in partition filtering. Sorry for the description, I will make it more detailed. |
Well the goal of that UT was to ensure the presence of a single scalar subquery instance in the plan, because a single subquery cannot be executed more than once. Here actually, despite I saw only one subquery in the plan, 2 different instances are executed. This needs further investigation in order to understand the root cause: once we know that, we can decide how to move forward, also w.r.t. the UT.
Yes, that's my understanding too. |
@mgaido91 I debugged it and double execution seems expected to me. I use |
I changed the title and description to make it clear what this PR want to do. I removed the term |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, you're right, I checked with 2 collects and the subquery is executed only once, porting to rdd creates a new logical plan and hence a new subquery plan. I was afraid that the transformations happening when normalizing filters could cause a re-execution, but I haven't found any case when this happens.
The change makes sense to me. @cloud-fan do you think we can trigger the tests?
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
Outdated
Show resolved
Hide resolved
ok to test |
@@ -185,6 +185,13 @@ case class FileSourceScanExec( | |||
ret | |||
} | |||
|
|||
@transient private lazy val selectedPartitionsCount = | |||
if (partitionFilters.exists(ExecSubqueryExpression.hasSubquery)) { | |||
None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is invoked from FileSourceScanExec.metadata
and part of FileSourceScanExec
's toString
value too, so it would throw
requirement failed: Subquery subquery250 has not finished
java.lang.IllegalArgumentException: requirement failed: Subquery subquery250 has not finished
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.execution.ScalarSubquery.eval(subquery.scala:95)
exception when printing executed plan.
And I didn't want to get rid of PartitionCount
metadata entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this problem present before https://issues.apache.org/jira/browse/SPARK-25482 ? i.e. Spark 2.4 and prior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This is required due to this PR. partitionFilters
always filtered out subqueries. SPARK-25482 just filtered out subqueries from dataFilters
(pushed down filters) because they were not used but caused double execution.
Test build #102468 has finished for PR 23802 at commit
|
if subquery filter can be a partition filter, why can't it be a data filter? |
4a65fad
to
5f18bf0
Compare
the point is: when it is a partition filter it is present only once in the plan, while if it is a dataFilter it is present both in the FileScanExec and in the Filter (which cannot be removed in this case because we cannot enforce that the data source performs properly the filtering according to the dataFilters). |
|
shall we turn subquery filter into normal filter before calling |
+1, I like the idea (it should be feasible I think). Maybe in a followup though? |
…an't touch selectedPartitions before execution time if it contains subquery expression Change-Id: I0188d007c858e0dc9f885731f2f7bfda2b7834ad
@@ -221,7 +221,8 @@ case class FileSourceScanExec( | |||
val sortColumns = | |||
spec.sortColumnNames.map(x => toAttribute(x)).takeWhile(x => x.isDefined).map(_.get) | |||
|
|||
val sortOrder = if (sortColumns.nonEmpty) { | |||
val sortOrder = if (sortColumns.nonEmpty && | |||
!partitionFilters.exists(ExecSubqueryExpression.hasSubquery)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because outputPartitioning
and outputOrdering
are used before execution (in EnsureRequirements
as far as I see) and partitionFilters
can't be evaluated before execution if they have subquery expressions, otherwise we get a Subquery ... has not finished
exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partitionFilters.exists(ExecSubqueryExpression.hasSubquery)
is used more than once, maybe we can create
private def partitionsOnlyAvailableAtRuntime: Boolean = {
partitionFilters.exists(ExecSubqueryExpression.hasSubquery)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done.
Sorry, I don't get this. Can you please elaborate on it? Where they should be turned to normal filters? |
Never mind. I get it now. |
I might be wrong but |
@peter-toth I think what @cloud-fan is suggesting is to replace the subquery with its literal value before listing the partitions from the Hive metastore (please see |
Test build #102544 has finished for PR 23802 at commit
|
@mgaido91 thanks, I get it now and agree about independent change. |
Change-Id: I0a2fc1752644e1983d8bc8e1cdc9518cf0654d93
Test build #102559 has finished for PR 23802 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change seems reasonable to me, I have just a minor comment, otherwise seems fine to me
!partitionOnlyAvailableAtRuntime) { | ||
metadata + ("PartitionCount" -> selectedPartitions.size.toString) | ||
} else { | ||
metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering about adding here "PartitionCount" -> "unknown"
or something which shows that we are reading partitioned data but we don't know how many of them. On the other side maybe people trust this field to be a numeric one so I am not sure about the best thing to do honestly. @cloud-fan @viirya thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think current way is good. We don't need to include unknown metadata.
@cloud-fan @viirya do you have any comments on this PR? |
retest this please |
LGTM pending jenkins |
@@ -185,6 +185,10 @@ case class FileSourceScanExec( | |||
ret | |||
} | |||
|
|||
private def partitionOnlyAvailableAtRuntime: Boolean = { | |||
partitionFilters.exists(ExecSubqueryExpression.hasSubquery) | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why this is named as partitionOnlyAvailableAtRuntime
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually it should read partitionsOnlyAvailableAtRuntime
at the suggestion of @cloud-fan
will rename it soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to rethink to link the meaning of partitionOnlyAvailableAtRuntime
to its implementation and understand its usage below.
I'd prefer to rename this as something like hasPartitionsAvailableAtRunTime
and add some comments to explain its usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Renamed it and added some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this LGTM but with a comment regarding code readability.
Change-Id: I9b80799dfbc2b63328380ae690c9a43d1cd027ec
Change-Id: Ia3c2a52c659537fe65fd92545ace79e1e110e815
Change-Id: I4f4fcea6faebe04261c8b534b1f768f5067893a8
Test build #102864 has finished for PR 23802 at commit
|
Test build #102867 has finished for PR 23802 at commit
|
Test build #102868 has finished for PR 23802 at commit
|
thanks, merging to master! |
Thanks @cloud-fan @mgaido91 and @viirya for the review! |
What changes were proposed in this pull request?
This PR introduces leveraging of subquery filters for partition pruning in file source.
Subquery expressions are not allowed to be used for partition pruning in
FileSourceStrategy
now, instead aFilterExec
is added around theFileSourceScanExec
to do the job.This PR optimizes the process by allowing partition pruning subquery expressions as partition filters.
How was this patch tested?
Added new UT and run existing UTs especially SPARK-25482 and SPARK-24085 related ones.