-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26736][SQL] if filter condition And
has non-determined sub function it does not do partition prunning
#24118
Conversation
…t do partition prunning
Could you describe more (e.g., an example query to reproduce the issue you described) in the PR description? |
@maropu The description was updated. Please review it, thanks~ |
@@ -63,7 +63,7 @@ object PhysicalOperation extends PredicateHelper { | |||
val substitutedFields = fields.map(substitute(aliases)).asInstanceOf[Seq[NamedExpression]] | |||
(Some(substitutedFields), filters, other, collectAliases(substitutedFields)) | |||
|
|||
case Filter(condition, child) if condition.deterministic => | |||
case Filter(condition, child) if condition.deterministic || condition.isInstanceOf[And] => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this pattern should not return non-determinisitc exprs, but this current change does so, right?
Can we modify code to extract deterministic exprs in line 67-69?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. I will do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has been updated. Please review it, thanks~~
ok to test |
Test build #103613 has finished for PR 24118 at commit
|
Could you make the PR title/description more precise before reviews? I think this pr does not target the rand() function only.... |
And
has non-determined function it does not do partition prunning
OK, The title has been updated~ |
And
has non-determined function it does not do partition prunningAnd
having non-determined sub function it does not do partition prunning
And
having non-determined sub function it does not do partition prunningAnd
has non-determined sub function it does not do partition prunning
val (fields, filters, other, aliases) = collectProjectsAndFilters(child) | ||
val substitutedCondition = substitute(aliases)(condition) | ||
(fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases) | ||
case filter: Filter if filter.condition.deterministic || filter.condition.isInstanceOf[And] => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PhysicalOperation
is used in many places, so have you checked this change has no side-effect for the other behivours?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have checked this change and I think that it has no side-effect for the other behaviors.
It seems you forget to update the title in the jira? |
It has been updated. |
Some(condition) | ||
} else { | ||
val andCondition = condition.asInstanceOf[And] | ||
if (andCondition.left.deterministic) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we intend to handle only at the top level ? What happens when the non-deterministic predicate is nested like :
where ((c1 = 10 and rand() < 1) and (c2 = 20))
in this case, in my understanding of the code, we will ignore (c1 = 10) for pruning purposes ? cc @maropu
Edited: Another example :
where c1 = 10 and c2 = 20 and c3 = 30 and rand() < 1 and c4 = 40
in this case, we would only consider c4 = 40 for pruning, no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, we need more general solution for that..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dilipbiswal @maropu It has been updated. Please review it, thanks~~
Test build #103830 has finished for PR 24118 at commit
|
(fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases) | ||
} else { | ||
(None, Nil, filter, Map.empty) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests somewhere to check if this addition could extract deterministic conditions you expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the test case named Partition pruning - with filter containing non-determined condition in sub And-expr
in PruningSuite
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, but it seems they are end-to-end tests, so I think we need more fine-grained tests for collectProjectsAndFilters
.
val substitutedCondition = substitute(aliases)(condition) | ||
(fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases) | ||
case filter @ Filter(condition, child) | ||
if condition.deterministic || condition.isInstanceOf[And] => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the isInstanceOf
check required any more ?
@@ -91,6 +97,27 @@ object PhysicalOperation extends PredicateHelper { | |||
.map(Alias(_, a.name)(a.exprId, a.qualifier)).getOrElse(a) | |||
} | |||
} | |||
|
|||
private def getDeterminedExpression(expr: Expression): Option[Expression] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getDeterministicExpression or extractDeterministicExpression ? What do you think @maropu ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I like the name including Deterministic
@@ -91,6 +97,27 @@ object PhysicalOperation extends PredicateHelper { | |||
.map(Alias(_, a.name)(a.exprId, a.qualifier)).getOrElse(a) | |||
} | |||
} | |||
|
|||
private def getDeterminedExpression(expr: Expression): Option[Expression] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also some comments about this function at the top with small snippets of input and output.
We should also update comments here to reflect the changes ? |
I have updated some codes and add test cases. Please review it, thanks~ |
Test build #103976 has finished for PR 24118 at commit
|
* col = 1 and rand() < 1 | ||
* (col1 = 1 and rand() < 1) and col2 = 1 | ||
* col1 = 1 or rand() < 1 | ||
* (col1 = 1 and rand() < 1) or (col2 = 1 and rand() < 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we don't need to handle this case (col1 = 1 and rand() < 1) or (col2 = 1 and rand() < 1)
in this pr because DNF forms should be handled in another normalization logic (e.g., SPARK-6624). So, I think its ok to handle CFN forms only here. In fact, I think we should keep the same semantics with PushDownPredicate. cc: @gatorsmile @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to be consistent with PushDownPredicate
Test build #104131 has finished for PR 24118 at commit
|
retest this please |
Test build #104154 has finished for PR 24118 at commit
|
@maropu Is there any progress about this PR? |
Can one of the admins verify this patch? |
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
Is there any problems about the pr? We also encounter this problem, and we find that hive can hanle this well. |
this should have been fixed by a58d91b |
It seems this pr intended to target hive tables? I tried the example query in the current master;
|
Seems we need to update |
Ah, right. I'll make a pr for that later. |
…ions in Hive tables ### What changes were proposed in this pull request? This PR intends to improve partition pruning for nondeterministic expressions in Hive tables: Before this PR: ``` scala> sql("""create table test(id int) partitioned by (dt string)""") scala> sql("""select * from test where dt='20190101' and rand() < 0.5""").explain() == Physical Plan == *(1) Filter ((isnotnull(dt#19) AND (dt#19 = 20190101)) AND (rand(6515336563966543616) < 0.5)) +- Scan hive default.test [id#18, dt#19], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#18], [dt#19], Statistics(sizeInBytes=8.0 EiB) ``` After this PR: ``` == Physical Plan == *(1) Filter (rand(-9163956883277176328) < 0.5) +- Scan hive default.test [id#0, dt#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#0], [dt#1], Statistics(sizeInBytes=8.0 EiB), [isnotnull(dt#1), (dt#1 = 20190101)] ``` This PR is the rework of #24118. ### Why are the changes needed? For better performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit tests added. Closes #27219 from maropu/SPARK-26736. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
What changes were proposed in this pull request?
If filter condition
And
includes non-determined sub expression partition prunning will not work. This patch will take out the determined sub expression inAnd
to make the partition prunning work.Example:
A partitioned table definition:
create table test(id int) partitioned by (dt string);
The following sql does not do partition prunning:
select * from test where dt='20190101' and rand() < 0.5;
This PR will fix this problem.
How was this patch tested?
It will be tested in
PruningSuite
by adding test casePartition pruning - with filter containing non-determined condition
Please review http://spark.apache.org/contributing.html before opening a pull request.