[SPARK-22983] Don't push filters beneath aggregates with empty grouping expressions #20180

JoshRosen · 2018-01-07T23:45:16Z

What changes were proposed in this pull request?

The following SQL query should return zero rows, but in Spark it actually returns one row:

SELECT 1 from (
  SELECT 1 AS z,
  MIN(a.x)
  FROM (select 1 as x) a
  WHERE false
) b
where b.z != b.z

The problem stems from the PushDownPredicate rule: when this rule encounters a filter on top of an Aggregate operator, e.g. Filter(Agg(...)), it removes the original filter and adds a new filter onto Aggregate's child, e.g. Agg(Filter(...)). This is sometimes okay, but the case above is a counterexample: because there is no explicit GROUP BY, we are implicitly computing a global aggregate over the entire table so the original filter was not acting like a HAVING clause filtering the number of groups: if we push this filter then it fails to actually reduce the cardinality of the Aggregate output, leading to the wrong answer.

In 2016 I fixed a similar problem involving invalid pushdowns of data-independent filters (filters which reference no columns of the filtered relation). There was additional discussion after my fix was merged which pointed out that my patch was an incomplete fix (see #15289), but it looks I must have either misunderstood the comment or forgot to follow up on the additional points raised there.

This patch fixes the problem by choosing to never push down filters in cases where there are no grouping expressions. Since there are no grouping keys, the only columns are aggregate columns and we can't push filters defined over aggregate results, so this change won't cause us to miss out on any legitimate pushdown opportunities.

How was this patch tested?

New regression tests in SQLQueryTestSuite and FilterPushdownSuite.

gatorsmile · 2018-01-08T02:11:08Z

sql/core/src/test/resources/sql-tests/inputs/group-by.sql

+  SELECT 1 AS z,
+  MIN(a.x)
+  FROM (select 1 as x) a
+  WHERE false


Many RDBMS do not accept false as a predicate. The typical way is 1 != 1

gatorsmile · 2018-01-08T02:11:16Z

LGTM

SparkQA · 2018-01-08T02:22:30Z

Test build #85774 has finished for PR 20180 at commit 5568d55.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-08T02:43:20Z

retest this please

SparkQA · 2018-01-08T05:52:43Z

Test build #85783 has finished for PR 20180 at commit 5568d55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ng expressions ## What changes were proposed in this pull request? The following SQL query should return zero rows, but in Spark it actually returns one row: ``` SELECT 1 from ( SELECT 1 AS z, MIN(a.x) FROM (select 1 as x) a WHERE false ) b where b.z != b.z ``` The problem stems from the `PushDownPredicate` rule: when this rule encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes the original filter and adds a new filter onto Aggregate's child, e.g. `Agg(Filter(...))`. This is sometimes okay, but the case above is a counterexample: because there is no explicit `GROUP BY`, we are implicitly computing a global aggregate over the entire table so the original filter was not acting like a `HAVING` clause filtering the number of groups: if we push this filter then it fails to actually reduce the cardinality of the Aggregate output, leading to the wrong answer. In 2016 I fixed a similar problem involving invalid pushdowns of data-independent filters (filters which reference no columns of the filtered relation). There was additional discussion after my fix was merged which pointed out that my patch was an incomplete fix (see #15289), but it looks I must have either misunderstood the comment or forgot to follow up on the additional points raised there. This patch fixes the problem by choosing to never push down filters in cases where there are no grouping expressions. Since there are no grouping keys, the only columns are aggregate columns and we can't push filters defined over aggregate results, so this change won't cause us to miss out on any legitimate pushdown opportunities. ## How was this patch tested? New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes #20180 from JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions. (cherry picked from commit 2c73d2a) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

gatorsmile · 2018-01-08T08:06:01Z

Thanks! Merged to master/2.3/2.2

…ng expressions ## What changes were proposed in this pull request? The following SQL query should return zero rows, but in Spark it actually returns one row: ``` SELECT 1 from ( SELECT 1 AS z, MIN(a.x) FROM (select 1 as x) a WHERE false ) b where b.z != b.z ``` The problem stems from the `PushDownPredicate` rule: when this rule encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes the original filter and adds a new filter onto Aggregate's child, e.g. `Agg(Filter(...))`. This is sometimes okay, but the case above is a counterexample: because there is no explicit `GROUP BY`, we are implicitly computing a global aggregate over the entire table so the original filter was not acting like a `HAVING` clause filtering the number of groups: if we push this filter then it fails to actually reduce the cardinality of the Aggregate output, leading to the wrong answer. In 2016 I fixed a similar problem involving invalid pushdowns of data-independent filters (filters which reference no columns of the filtered relation). There was additional discussion after my fix was merged which pointed out that my patch was an incomplete fix (see #15289), but it looks I must have either misunderstood the comment or forgot to follow up on the additional points raised there. This patch fixes the problem by choosing to never push down filters in cases where there are no grouping expressions. Since there are no grouping keys, the only columns are aggregate columns and we can't push filters defined over aggregate results, so this change won't cause us to miss out on any legitimate pushdown opportunities. ## How was this patch tested? New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes #20180 from JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions. (cherry picked from commit 2c73d2a) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…ng expressions ## What changes were proposed in this pull request? The following SQL query should return zero rows, but in Spark it actually returns one row: ``` SELECT 1 from ( SELECT 1 AS z, MIN(a.x) FROM (select 1 as x) a WHERE false ) b where b.z != b.z ``` The problem stems from the `PushDownPredicate` rule: when this rule encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes the original filter and adds a new filter onto Aggregate's child, e.g. `Agg(Filter(...))`. This is sometimes okay, but the case above is a counterexample: because there is no explicit `GROUP BY`, we are implicitly computing a global aggregate over the entire table so the original filter was not acting like a `HAVING` clause filtering the number of groups: if we push this filter then it fails to actually reduce the cardinality of the Aggregate output, leading to the wrong answer. In 2016 I fixed a similar problem involving invalid pushdowns of data-independent filters (filters which reference no columns of the filtered relation). There was additional discussion after my fix was merged which pointed out that my patch was an incomplete fix (see apache#15289), but it looks I must have either misunderstood the comment or forgot to follow up on the additional points raised there. This patch fixes the problem by choosing to never push down filters in cases where there are no grouping expressions. Since there are no grouping keys, the only columns are aggregate columns and we can't push filters defined over aggregate results, so this change won't cause us to miss out on any legitimate pushdown opportunities. ## How was this patch tested? New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#20180 from JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions. (cherry picked from commit 2c73d2a) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

Don't push filters beneath aggregates with empty grouping expressions

5568d55

gatorsmile reviewed Jan 8, 2018

View reviewed changes

asfgit closed this in 2c73d2a Jan 8, 2018

JoshRosen deleted the SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions branch January 8, 2018 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22983] Don't push filters beneath aggregates with empty grouping expressions #20180

[SPARK-22983] Don't push filters beneath aggregates with empty grouping expressions #20180

JoshRosen commented Jan 7, 2018

gatorsmile Jan 8, 2018 •

edited

Loading

gatorsmile commented Jan 8, 2018

SparkQA commented Jan 8, 2018

gatorsmile commented Jan 8, 2018

SparkQA commented Jan 8, 2018

gatorsmile commented Jan 8, 2018

[SPARK-22983] Don't push filters beneath aggregates with empty grouping expressions #20180

[SPARK-22983] Don't push filters beneath aggregates with empty grouping expressions #20180

Conversation

JoshRosen commented Jan 7, 2018

What changes were proposed in this pull request?

How was this patch tested?

gatorsmile Jan 8, 2018 • edited Loading

Choose a reason for hiding this comment

gatorsmile commented Jan 8, 2018

SparkQA commented Jan 8, 2018

gatorsmile commented Jan 8, 2018

SparkQA commented Jan 8, 2018

gatorsmile commented Jan 8, 2018

gatorsmile Jan 8, 2018 •

edited

Loading