[SPARK-26736][SQL] if filter condition `And` has non-determined sub function it does not do partition prunning #24118

zhaorongsheng · 2019-03-17T06:47:27Z

What changes were proposed in this pull request?

If filter condition And includes non-determined sub expression partition prunning will not work. This patch will take out the determined sub expression in And to make the partition prunning work.

Example:
A partitioned table definition:
create table test(id int) partitioned by (dt string);
The following sql does not do partition prunning:
select * from test where dt='20190101' and rand() < 0.5;

This PR will fix this problem.

How was this patch tested?

It will be tested in PruningSuite by adding test case Partition pruning - with filter containing non-determined condition

Please review http://spark.apache.org/contributing.html before opening a pull request.

…t do partition prunning

maropu · 2019-03-17T09:54:53Z

Could you describe more (e.g., an example query to reproduce the issue you described) in the PR description?

zhaorongsheng · 2019-03-17T12:38:34Z

@maropu The description was updated. Please review it, thanks~

maropu · 2019-03-18T04:31:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

@@ -63,7 +63,7 @@ object PhysicalOperation extends PredicateHelper {
        val substitutedFields = fields.map(substitute(aliases)).asInstanceOf[Seq[NamedExpression]]
        (Some(substitutedFields), filters, other, collectAliases(substitutedFields))

-      case Filter(condition, child) if condition.deterministic =>
+      case Filter(condition, child) if condition.deterministic || condition.isInstanceOf[And] =>


I think this pattern should not return non-determinisitc exprs, but this current change does so, right?
Can we modify code to extract deterministic exprs in line 67-69?

Yes, you are right. I will do it.

It has been updated. Please review it, thanks~~

maropu · 2019-03-18T13:14:22Z

ok to test

SparkQA · 2019-03-18T21:25:18Z

Test build #103613 has finished for PR 24118 at commit b5d0c67.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-03-19T00:17:20Z

Could you make the PR title/description more precise before reviews? I think this pr does not target the rand() function only....

zhaorongsheng · 2019-03-19T01:06:11Z

OK, The title has been updated~

maropu · 2019-03-19T12:49:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

-        val (fields, filters, other, aliases) = collectProjectsAndFilters(child)
-        val substitutedCondition = substitute(aliases)(condition)
-        (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases)
+      case filter: Filter if filter.condition.deterministic || filter.condition.isInstanceOf[And] =>


PhysicalOperation is used in many places, so have you checked this change has no side-effect for the other behivours?

Yes, I have checked this change and I think that it has no side-effect for the other behaviors.

maropu · 2019-03-19T12:50:54Z

It seems you forget to update the title in the jira?

zhaorongsheng · 2019-03-20T02:00:21Z

It has been updated.

dilipbiswal · 2019-03-20T07:08:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+          Some(condition)
+        } else {
+          val andCondition = condition.asInstanceOf[And]
+          if (andCondition.left.deterministic) {


Do we intend to handle only at the top level ? What happens when the non-deterministic predicate is nested like :

where ((c1 = 10 and rand() < 1) and (c2 = 20))

in this case, in my understanding of the code, we will ignore (c1 = 10) for pruning purposes ? cc @maropu

Edited: Another example :

where c1 = 10 and c2 = 20 and c3 = 30 and rand() < 1 and c4 = 40

in this case, we would only consider c4 = 40 for pruning, no ?

yea, we need more general solution for that..

@dilipbiswal @maropu It has been updated. Please review it, thanks~~

SparkQA · 2019-03-22T20:55:37Z

Test build #103830 has finished for PR 24118 at commit 39fac02.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-03-22T23:39:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+          (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases)
+        } else {
+          (None, Nil, filter, Map.empty)
+        }


Could you add tests somewhere to check if this addition could extract deterministic conditions you expect?

I have added the test case named Partition pruning - with filter containing non-determined condition in sub And-expr in PruningSuite.

Yea, but it seems they are end-to-end tests, so I think we need more fine-grained tests for collectProjectsAndFilters.

dilipbiswal · 2019-03-23T18:58:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

-        val substitutedCondition = substitute(aliases)(condition)
-        (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases)
+      case filter @ Filter(condition, child)
+          if condition.deterministic || condition.isInstanceOf[And] =>


is the isInstanceOf check required any more ?

dilipbiswal · 2019-03-23T19:01:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

@@ -91,6 +97,27 @@ object PhysicalOperation extends PredicateHelper {
          .map(Alias(_, a.name)(a.exprId, a.qualifier)).getOrElse(a)
    }
  }
+
+  private def getDeterminedExpression(expr: Expression): Option[Expression] = {


getDeterministicExpression or extractDeterministicExpression ? What do you think @maropu ?

yea, I like the name including Deterministic

dilipbiswal · 2019-03-23T19:02:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

@@ -91,6 +97,27 @@ object PhysicalOperation extends PredicateHelper {
          .map(Alias(_, a.name)(a.exprId, a.qualifier)).getOrElse(a)
    }
  }
+
+  private def getDeterminedExpression(expr: Expression): Option[Expression] = {


Also some comments about this function at the top with small snippets of input and output.

dilipbiswal · 2019-03-23T19:16:12Z

We should also update comments here to reflect the changes ?

zhaorongsheng · 2019-03-26T13:00:16Z

I have updated some codes and add test cases. Please review it, thanks~

SparkQA · 2019-03-26T17:44:26Z

Test build #103976 has finished for PR 24118 at commit 9093df4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-03-28T01:40:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+   *   col = 1 and rand() < 1
+   *   (col1 = 1 and rand() < 1) and col2 = 1
+   *   col1 = 1 or rand() < 1
+   *   (col1 = 1 and rand() < 1) or (col2 = 1 and rand() < 1)


IMO we don't need to handle this case (col1 = 1 and rand() < 1) or (col2 = 1 and rand() < 1) in this pr because DNF forms should be handled in another normalization logic (e.g., SPARK-6624). So, I think its ok to handle CFN forms only here. In fact, I think we should keep the same semantics with PushDownPredicate. cc: @gatorsmile @cloud-fan

+1 to be consistent with PushDownPredicate

SparkQA · 2019-03-31T07:05:01Z

Test build #104131 has finished for PR 24118 at commit be633be.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-04-01T02:25:43Z

retest this please

SparkQA · 2019-04-01T06:44:58Z

Test build #104154 has finished for PR 24118 at commit be633be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhaorongsheng · 2019-04-08T11:14:36Z

@maropu Is there any progress about this PR?

AmplabJenkins · 2019-09-16T18:15:12Z

Can one of the admins verify this patch?

github-actions · 2020-01-01T00:06:02Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

boneanxs · 2020-01-15T02:25:32Z

Is there any problems about the pr? We also encounter this problem, and we find that hive can hanle this well.

cloud-fan · 2020-01-15T06:14:26Z

this should have been fixed by a58d91b

maropu · 2020-01-15T07:22:51Z

It seems this pr intended to target hive tables? I tried the example query in the current master;

// hive table
scala> sql("""create table test(id int) partitioned by (dt string)""")
scala> sql("""select * from test where dt='20190101' and rand() < 0.5""").explain(true)

== Physical Plan ==
*(1) Filter ((isnotnull(dt#19) AND (dt#19 = 20190101)) AND (rand(6515336563966543616) < 0.5))
+- Scan hive default.test [id#18, dt#19], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#18], [dt#19], Statistics(sizeInBytes=8.0 EiB)


// datasource table
sql("""create table test(id int, dt string) using parquet partitioned by (dt)""")
sql("""select * from test where dt='20190101' and rand() < 0.5""").explain(true)

== Physical Plan ==
*(1) Filter (rand(1519810875701056142) < 0.5)
+- *(1) ColumnarToRow
   +- FileScan parquet default.test[id#30,dt#31] Batched: true, DataFilters: [], Format: Parquet, Location: CatalogFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/test], PartitionFilters: [isnotnull(dt#31), (dt#31 = 20190101)], PushedFilters: [], ReadSchema: struct<id:int>

cloud-fan · 2020-01-15T07:41:08Z

Seems we need to update HiveTableScans to use ScanOperation as well.

maropu · 2020-01-15T08:22:37Z

Ah, right. I'll make a pr for that later.

…ions in Hive tables ### What changes were proposed in this pull request? This PR intends to improve partition pruning for nondeterministic expressions in Hive tables: Before this PR: ``` scala> sql("""create table test(id int) partitioned by (dt string)""") scala> sql("""select * from test where dt='20190101' and rand() < 0.5""").explain() == Physical Plan == *(1) Filter ((isnotnull(dt#19) AND (dt#19 = 20190101)) AND (rand(6515336563966543616) < 0.5)) +- Scan hive default.test [id#18, dt#19], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#18], [dt#19], Statistics(sizeInBytes=8.0 EiB) ``` After this PR: ``` == Physical Plan == *(1) Filter (rand(-9163956883277176328) < 0.5) +- Scan hive default.test [id#0, dt#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#0], [dt#1], Statistics(sizeInBytes=8.0 EiB), [isnotnull(dt#1), (dt#1 = 20190101)] ``` This PR is the rework of #24118. ### Why are the changes needed? For better performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit tests added. Closes #27219 from maropu/SPARK-26736. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

zhaorongsheng and others added 6 commits March 19, 2017 00:03

Merge remote-tracking branch 'upstream/master' into master_git

d4bb158

Merge branch 'master' into roncen_master

1b418b5

Merge remote-tracking branch 'apache_stream/master'

e0a7fbf

Merge branch 'master' of https://github.com/apache/spark

44ff977

Merge branch 'master' of https://github.com/zhaorongsheng/spark

38a4550

[SPARK-26736][SQL] if filter condition has rand() function it does no…

c0a69e1

…t do partition prunning

maropu reviewed Mar 18, 2019

View reviewed changes

remove non-determined condition from collectProjectsAndFilters()

b5d0c67

zhaorongsheng changed the title ~~[SPARK-26736][SQL] if filter condition has rand() function it does not do partition prunning~~ [SPARK-26736][SQL] if filter condition has non-determined function it does not do partition prunning Mar 19, 2019

zhaorongsheng changed the title ~~[SPARK-26736][SQL] if filter condition has non-determined function it does not do partition prunning~~ [SPARK-26736][SQL] if filter condition And has non-determined function it does not do partition prunning Mar 19, 2019

zhaorongsheng changed the title ~~[SPARK-26736][SQL] if filter condition And has non-determined function it does not do partition prunning~~ [SPARK-26736][SQL] if filter condition And having non-determined sub function it does not do partition prunning Mar 19, 2019

zhaorongsheng changed the title ~~[SPARK-26736][SQL] if filter condition And having non-determined sub function it does not do partition prunning~~ [SPARK-26736][SQL] if filter condition And has non-determined sub function it does not do partition prunning Mar 19, 2019

maropu reviewed Mar 19, 2019

View reviewed changes

dilipbiswal reviewed Mar 20, 2019

View reviewed changes

recursively get determined expr in And expression

39fac02

maropu reviewed Mar 22, 2019

View reviewed changes

dilipbiswal reviewed Mar 23, 2019

View reviewed changes

extract non-deterministic expression in some case and add test case

9093df4

maropu reviewed Mar 28, 2019

View reviewed changes

only handle CNF forms

be633be

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Jan 1, 2020

github-actions bot closed this Jan 2, 2020

maropu mentioned this pull request Jan 15, 2020

[SPARK-26736][SQL] Partition pruning through nondeterministic expressions in Hive tables #27219

Closed

[SPARK-26736][SQL] if filter condition And has non-determined sub function it does not do partition prunning #24118

[SPARK-26736][SQL] if filter condition And has non-determined sub function it does not do partition prunning #24118

Conversation

zhaorongsheng commented Mar 17, 2019 • edited

What changes were proposed in this pull request?

How was this patch tested?

maropu commented Mar 17, 2019

zhaorongsheng commented Mar 17, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Mar 18, 2019

SparkQA commented Mar 18, 2019

maropu commented Mar 19, 2019

zhaorongsheng commented Mar 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Mar 19, 2019

zhaorongsheng commented Mar 20, 2019

dilipbiswal Mar 20, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Mar 23, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal commented Mar 23, 2019

zhaorongsheng commented Mar 26, 2019 • edited

SparkQA commented Mar 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 31, 2019

maropu commented Apr 1, 2019

SparkQA commented Apr 1, 2019

zhaorongsheng commented Apr 8, 2019

AmplabJenkins commented Sep 16, 2019

github-actions bot commented Jan 1, 2020

boneanxs commented Jan 15, 2020

cloud-fan commented Jan 15, 2020

maropu commented Jan 15, 2020 • edited

cloud-fan commented Jan 15, 2020

maropu commented Jan 15, 2020

[SPARK-26736][SQL] if filter condition `And` has non-determined sub function it does not do partition prunning #24118

[SPARK-26736][SQL] if filter condition `And` has non-determined sub function it does not do partition prunning #24118

zhaorongsheng commented Mar 17, 2019 •

edited

zhaorongsheng commented Mar 17, 2019 •

edited

dilipbiswal Mar 20, 2019 •

edited

maropu Mar 23, 2019 •

edited

zhaorongsheng commented Mar 26, 2019 •

edited

maropu commented Jan 15, 2020 •

edited