[SPARK-17712][SQL] Fix invalid pushdown of data-independent filters beneath aggregates #15289

JoshRosen · 2016-09-28T23:38:11Z

What changes were proposed in this pull request?

This patch fixes a minor correctness issue impacting the pushdown of filters beneath aggregates. Specifically, if a filter condition references no grouping or aggregate columns (e.g. WHERE false) then it would be incorrectly pushed beneath an aggregate.

Intuitively, the only case where you can push a filter beneath an aggregate is when that filter is deterministic and is defined over the grouping columns / expressions, since in that case the filter is acting to exclude entire groups from the query (like a HAVING clause). The existing code would only push deterministic filters beneath aggregates when all of the filter's references were grouping columns, but this logic missed the case where a filter has no references. For example, WHERE false is deterministic but is independent of the actual data.

This patch fixes this minor bug by adding a new check to ensure that we don't push filters beneath aggregates when those filters don't reference any columns.

How was this patch tested?

New regression test in FilterPushdownSuite.

JoshRosen · 2016-09-28T23:44:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -710,7 +710,7 @@ object PushDownPredicate extends Rule[LogicalPlan] with PredicateHelper {

      val (pushDown, rest) = candidates.partition { cond =>
        val replaced = replaceAlias(cond, aliasMap)
-        replaced.references.subsetOf(aggregate.child.outputSet)
+        cond.references.nonEmpty && replaced.references.subsetOf(aggregate.child.outputSet)


If you use replaced.references here then this will break the "aggregate: push down filters with literal" test case. Here's that test:

test("aggregate: push down filters with literal") { val originalQuery = testRelation .select('a, 'b) .groupBy('a)('a, count('b) as 'c, "s" as 'd) .where('c === 2L && 'd === "s") val optimized = Optimize.execute(originalQuery.analyze) val correctAnswer = testRelation .where("s" === "s") .select('a, 'b) .groupBy('a)('a, count('b) as 'c, "s" as 'd) .where('c === 2L) .analyze comparePlans(optimized, correctAnswer) }

If you use replaced.references then the 'd in the filter is replaced with "s" and the resulting expression no longer contains any references and thus isn't pushed.

hvanhovell · 2016-09-29T01:01:46Z

LGTM - pending Jenkins.

SparkQA · 2016-09-29T02:00:44Z

Test build #66074 has finished for PR 15289 at commit 87504e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-09-29T02:02:32Z

Merging to master. Thanks!

hvanhovell · 2016-09-29T02:04:06Z

@JoshRosen I cannot cherry-pick this one into 2.0. Could you open a PR against 2.0 I feel you that this is required?

ioana-delaney · 2016-09-29T04:09:22Z

@JoshRosen Wouldn't be a better design to push down the predicate but keep the original predicate as well? If the aggregate is above a complex join, not pushing down the predicate may have significant performance implications.

JoshRosen · 2016-09-29T18:20:52Z

@ioana-delaney, to clarify, are you suggesting that there are predicates which we could correctly be pushing down past aggregates but are not pushing? I believe that the current logic encompasses all of the cases that are safe to push.

Or are you suggesting that we're affecting correctness by not keeping the original predicate following the pushdown?

…eneath aggregates ## What changes were proposed in this pull request? This patch fixes a minor correctness issue impacting the pushdown of filters beneath aggregates. Specifically, if a filter condition references no grouping or aggregate columns (e.g. `WHERE false`) then it would be incorrectly pushed beneath an aggregate. Intuitively, the only case where you can push a filter beneath an aggregate is when that filter is deterministic and is defined over the grouping columns / expressions, since in that case the filter is acting to exclude entire groups from the query (like a `HAVING` clause). The existing code would only push deterministic filters beneath aggregates when all of the filter's references were grouping columns, but this logic missed the case where a filter has no references. For example, `WHERE false` is deterministic but is independent of the actual data. This patch fixes this minor bug by adding a new check to ensure that we don't push filters beneath aggregates when those filters don't reference any columns. ## How was this patch tested? New regression test in FilterPushdownSuite. Author: Josh Rosen <joshrosen@databricks.com> Closes #15289 from JoshRosen/SPARK-17712. (cherry picked from commit 37eb918) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

JoshRosen · 2016-09-29T19:12:14Z

@hvanhovell, I've cherry-picked this fix to branch-2.0.

ioana-delaney · 2016-09-30T22:35:36Z

@JoshRosen The original predicate has to be kept above the aggregation. An optimization would be to also push down the predicate below the aggregation, lower in the plan for early filtering. In theory an always-false predicate should eliminate an entire sub-plan.

ioana-delaney · 2016-09-30T22:52:57Z

@JoshRosen In your example, we don't want to first count one million rows coming from the base table and then to return zero rows based on the false predicate in the outer query block. Instead, by pushing down the predicate to the base table, you do a pre-filtering and return zero rows early in the plan. Then you apply the aggregate followed by the the original predicate that will do the final filtering. Anyway, just some thought for further optimizations of predicates pushed down through aggregation. Also, a more realistic query would imply false predicates through predicate transitivity e.g. a != b and a =1 and b =1 => 1 != 1 So there might be some real customer queries that can take advantage of these optimizations.

…ng expressions ## What changes were proposed in this pull request? The following SQL query should return zero rows, but in Spark it actually returns one row: ``` SELECT 1 from ( SELECT 1 AS z, MIN(a.x) FROM (select 1 as x) a WHERE false ) b where b.z != b.z ``` The problem stems from the `PushDownPredicate` rule: when this rule encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes the original filter and adds a new filter onto Aggregate's child, e.g. `Agg(Filter(...))`. This is sometimes okay, but the case above is a counterexample: because there is no explicit `GROUP BY`, we are implicitly computing a global aggregate over the entire table so the original filter was not acting like a `HAVING` clause filtering the number of groups: if we push this filter then it fails to actually reduce the cardinality of the Aggregate output, leading to the wrong answer. In 2016 I fixed a similar problem involving invalid pushdowns of data-independent filters (filters which reference no columns of the filtered relation). There was additional discussion after my fix was merged which pointed out that my patch was an incomplete fix (see #15289), but it looks I must have either misunderstood the comment or forgot to follow up on the additional points raised there. This patch fixes the problem by choosing to never push down filters in cases where there are no grouping expressions. Since there are no grouping keys, the only columns are aggregate columns and we can't push filters defined over aggregate results, so this change won't cause us to miss out on any legitimate pushdown opportunities. ## How was this patch tested? New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes #20180 from JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions. (cherry picked from commit 2c73d2a) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…ng expressions ## What changes were proposed in this pull request? The following SQL query should return zero rows, but in Spark it actually returns one row: ``` SELECT 1 from ( SELECT 1 AS z, MIN(a.x) FROM (select 1 as x) a WHERE false ) b where b.z != b.z ``` The problem stems from the `PushDownPredicate` rule: when this rule encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes the original filter and adds a new filter onto Aggregate's child, e.g. `Agg(Filter(...))`. This is sometimes okay, but the case above is a counterexample: because there is no explicit `GROUP BY`, we are implicitly computing a global aggregate over the entire table so the original filter was not acting like a `HAVING` clause filtering the number of groups: if we push this filter then it fails to actually reduce the cardinality of the Aggregate output, leading to the wrong answer. In 2016 I fixed a similar problem involving invalid pushdowns of data-independent filters (filters which reference no columns of the filtered relation). There was additional discussion after my fix was merged which pointed out that my patch was an incomplete fix (see apache#15289), but it looks I must have either misunderstood the comment or forgot to follow up on the additional points raised there. This patch fixes the problem by choosing to never push down filters in cases where there are no grouping expressions. Since there are no grouping keys, the only columns are aggregate columns and we can't push filters defined over aggregate results, so this change won't cause us to miss out on any legitimate pushdown opportunities. ## How was this patch tested? New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#20180 from JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions.

…ng expressions ## What changes were proposed in this pull request? The following SQL query should return zero rows, but in Spark it actually returns one row: ``` SELECT 1 from ( SELECT 1 AS z, MIN(a.x) FROM (select 1 as x) a WHERE false ) b where b.z != b.z ``` The problem stems from the `PushDownPredicate` rule: when this rule encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes the original filter and adds a new filter onto Aggregate's child, e.g. `Agg(Filter(...))`. This is sometimes okay, but the case above is a counterexample: because there is no explicit `GROUP BY`, we are implicitly computing a global aggregate over the entire table so the original filter was not acting like a `HAVING` clause filtering the number of groups: if we push this filter then it fails to actually reduce the cardinality of the Aggregate output, leading to the wrong answer. In 2016 I fixed a similar problem involving invalid pushdowns of data-independent filters (filters which reference no columns of the filtered relation). There was additional discussion after my fix was merged which pointed out that my patch was an incomplete fix (see #15289), but it looks I must have either misunderstood the comment or forgot to follow up on the additional points raised there. This patch fixes the problem by choosing to never push down filters in cases where there are no grouping expressions. Since there are no grouping keys, the only columns are aggregate columns and we can't push filters defined over aggregate results, so this change won't cause us to miss out on any legitimate pushdown opportunities. ## How was this patch tested? New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes #20180 from JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions. (cherry picked from commit 2c73d2a) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…ng expressions ## What changes were proposed in this pull request? The following SQL query should return zero rows, but in Spark it actually returns one row: ``` SELECT 1 from ( SELECT 1 AS z, MIN(a.x) FROM (select 1 as x) a WHERE false ) b where b.z != b.z ``` The problem stems from the `PushDownPredicate` rule: when this rule encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes the original filter and adds a new filter onto Aggregate's child, e.g. `Agg(Filter(...))`. This is sometimes okay, but the case above is a counterexample: because there is no explicit `GROUP BY`, we are implicitly computing a global aggregate over the entire table so the original filter was not acting like a `HAVING` clause filtering the number of groups: if we push this filter then it fails to actually reduce the cardinality of the Aggregate output, leading to the wrong answer. In 2016 I fixed a similar problem involving invalid pushdowns of data-independent filters (filters which reference no columns of the filtered relation). There was additional discussion after my fix was merged which pointed out that my patch was an incomplete fix (see apache#15289), but it looks I must have either misunderstood the comment or forgot to follow up on the additional points raised there. This patch fixes the problem by choosing to never push down filters in cases where there are no grouping expressions. Since there are no grouping keys, the only columns are aggregate columns and we can't push filters defined over aggregate results, so this change won't cause us to miss out on any legitimate pushdown opportunities. ## How was this patch tested? New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#20180 from JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions. (cherry picked from commit 2c73d2a) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

JoshRosen added 2 commits September 28, 2016 16:29

Add regression test for SPARK-17712

09870fc

Minimal fix.

87504e4

JoshRosen commented Sep 28, 2016

View reviewed changes

asfgit closed this in 37eb918 Sep 29, 2016

JoshRosen deleted the SPARK-17712 branch September 29, 2016 19:12

JoshRosen mentioned this pull request Jan 7, 2018

[SPARK-22983] Don't push filters beneath aggregates with empty grouping expressions #20180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17712][SQL] Fix invalid pushdown of data-independent filters beneath aggregates #15289

[SPARK-17712][SQL] Fix invalid pushdown of data-independent filters beneath aggregates #15289

JoshRosen commented Sep 28, 2016

JoshRosen Sep 28, 2016

hvanhovell commented Sep 29, 2016

SparkQA commented Sep 29, 2016

hvanhovell commented Sep 29, 2016 •

edited

Loading

hvanhovell commented Sep 29, 2016

ioana-delaney commented Sep 29, 2016

JoshRosen commented Sep 29, 2016

JoshRosen commented Sep 29, 2016

ioana-delaney commented Sep 30, 2016

ioana-delaney commented Sep 30, 2016

[SPARK-17712][SQL] Fix invalid pushdown of data-independent filters beneath aggregates #15289

[SPARK-17712][SQL] Fix invalid pushdown of data-independent filters beneath aggregates #15289

Conversation

JoshRosen commented Sep 28, 2016

What changes were proposed in this pull request?

How was this patch tested?

JoshRosen Sep 28, 2016

Choose a reason for hiding this comment

hvanhovell commented Sep 29, 2016

SparkQA commented Sep 29, 2016

hvanhovell commented Sep 29, 2016 • edited Loading

hvanhovell commented Sep 29, 2016

ioana-delaney commented Sep 29, 2016

JoshRosen commented Sep 29, 2016

JoshRosen commented Sep 29, 2016

ioana-delaney commented Sep 30, 2016

ioana-delaney commented Sep 30, 2016

hvanhovell commented Sep 29, 2016 •

edited

Loading