ensure dynamic filters are correctly pushed down through aggregations#21059
ensure dynamic filters are correctly pushed down through aggregations#21059jayshrivastava wants to merge 1 commit intoapache:mainfrom
Conversation
In plans such as the following, dynamic filters are not pushed down through the aggregation
```
CREATE TABLE data (a VARCHAR, ts TIMESTAMP, value DOUBLE)
AS VALUES
('h1', '2024-01-01T00:05:00', 1.0),
('h1', '2024-01-01T00:15:00', 2.0),
('h2', '2024-01-01T00:25:00', 3.0),
('h3', '2024-01-01T00:35:00', 4.0);
SELECT * FROM contexts c
INNER JOIN (
SELECT a, date_bin(interval '1 hour', ts) AS bucket, min(value) AS min_val
FROM (SELECT value, a, ts FROM data)
GROUP BY a, date_bin(interval '1 hour', ts)
) agg ON c.a = agg.a;
```
```
HashJoinExec: mode=Auto, join_type=Inner, on=[(a@0, a@0)]
DataSourceExec: partitions=1
ProjectionExec: [a@0, date_bin(1h, ts)@1 as bucket, min(value)@2 as min_val]
AggregateExec: mode=FinalPartitioned, gby=[a@0, date_bin(1h, ts)@1], aggr=[min(value)]
AggregateExec: mode=Partial, gby=[a@1, date_bin(1h, ts@2)], aggr=[min(value)]
ProjectionExec: [value@2, a@0, ts@1] ← reorders columns
DataSourceExec: partitions=1
```
`AggregateExec::gather_filters_for_pushdown` compared parent filter columns (output schema indices) against grouping expression columns (input schema indices). When a `ProjectionExec` below the aggregate reorders columns, the index mismatch causes filters (such as HashJoin dynamic filters) to be incorrectly blocked.
This change fixes the column index mapping in `AggregateExec::gather_filters_for_pushdown`
- `test_pushdown_through_aggregate_with_reordered_input_columns` — filter on grouping column with reordered input is pushed down
- `test_pushdown_through_aggregate_with_reordered_input_no_pushdown_on_agg_result` — filter on aggregate result column is not pushed down
- `test_pushdown_through_aggregate_grouping_sets_with_reordered_input` — GROUPING SETS: filter on common column pushed, filter on missing column blocked
- `test_hashjoin_dynamic_filter_pushdown_through_aggregate_with_reordered_input` — HashJoin dynamic filter pushes through aggregate with reordered input and is populated with values after
execution
- All tests verified to fail without the fix
No.
cba2558 to
5c792af
Compare
|
This is happening because there is no |
There was a problem hiding this comment.
👍 Looks good! left just a couple of minor comments.
EDIT: will also wait for @LiaCastaneda to approve before merging, thanks both!
| let grouping_columns: HashSet<_> = (0..self.group_by.expr().len()) | ||
| .map(|i| Column::new(output_schema.field(i).name(), i)) |
There was a problem hiding this comment.
🤔 this assumes that the output schema is going to have the grouping expressions always be the first ones.
Is this assumption correct? maybe a comment referencing where this is enforced?
| /// Regression test for https://github.com/apache/datafusion/issues/21065. | ||
| /// | ||
| /// Given a plan similar to the following, ensure that the filter is pushed down | ||
| /// through an AggregateExec whose input columns are reordered by a ProjectionExec. | ||
| #[tokio::test] |
There was a problem hiding this comment.
I see this is the only test referencing the issue? are the other tests also contributing to reproducing the bug?
If you manage to reduce the quantity of test code necessary to reproduce the issue, that would be awesome. Otherwise, if you think all four tests contribute distinct ways of reproducing the issue, let's keep them all.
LiaCastaneda
left a comment
There was a problem hiding this comment.
This makes sense to me, I was just courious on why this only happens on plans without RepartitionExec:
When there is a RepartitionExec between the ProjectionExec and the AggregateExec, the repartition normalizes the column indices — the aggregate's input schema ends up with a at index 0 matching the output schema. The bug only surfaces when the reordering projection is directly below the aggregate with no intermediate operator to "reset" the index numbering.
Which issue does this PR close?
Rationale for this change
In plans such as the following, dynamic filters are not pushed down through the aggregation
What changes are included in this PR?
AggregateExec::gather_filters_for_pushdowncompared parent filter columns (output schema indices) against grouping expression columns (input schema indices). When aProjectionExecbelow the aggregate reorders columns, the index mismatch causes filters (such as HashJoin dynamic filters) to be incorrectly blocked.This change fixes the column index mapping in
AggregateExec::gather_filters_for_pushdownAre these changes tested?
test_pushdown_through_aggregate_with_reordered_input_columns— filter on grouping column with reordered input is pushed downtest_pushdown_through_aggregate_with_reordered_input_no_pushdown_on_agg_result— filter on aggregate result column is not pushed downtest_pushdown_through_aggregate_grouping_sets_with_reordered_input— GROUPING SETS: filter on common column pushed, filter on missing column blockedtest_hashjoin_dynamic_filter_pushdown_through_aggregate_with_reordered_input— HashJoin dynamic filter pushes through aggregate with reordered input and is populated with values afterexecution
Are there any user-facing changes?
No.