Move categorical dimension filter pushdown to PredicatePushdownOptimizer #1286

tlento · 2024-06-17T18:08:22Z

Move categorical dimension filter pushdown to PredicatePushdownOptimizer

We are ready to move the categorical dimension filter predicate pushdown
out of the DataflowPlanBuilder and into the PredicatePushdownOptimizer.
This change makes the move with as close to parity on the existing pushdown
as possible, and adds some concrete tests of the optimizer now that it is
effecting changes on the DataflowPlan.

In theory, the outcome of this change should be that snapshots generated
from un-optimized DataflowPlans should no longer have an extra where
filter from the original build-time pushdown operation, while optimized
snapshots should be unchanged. In practice, the optimized snapshots still
have changes. These fit into two categories:

Subquery ID number changes caused by the removal of the extra subquery
in the non-optimized snapshots, since the ID numbers do not re-set between
the non-optimized and optimized runs.
Conversion metric rendering for query time filters now includes an
extra where constraint subquery due to an irrelevant pushdown operation. This is
caused by a difference in where the "disable pushdown for conversion metrics"
logic is applied - the optimizer cannot apply it until the join on conversion
node, while the original builder could effectively apply the change at compute
metric node level, and so we push down query-time filters one extra step. This
will be reverted when we remove the duplicated filters from the pushdown
operation.

This change also required fixes to a few bugs with the original tracking
implementation that only showed up when the full range of potential
snapshot updates had to be applied. In particular, a long-standing issue with
the previously never-called with_new_parents method of the CombineAggregatedOutputs
node has been resolved, and some missing or incorrectly applied checks to
prevent pushing time-based predicates down through offset windows and other
joins were addressed.

tlento · 2024-06-17T18:09:01Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @tlento and the rest of your teammates on Graphite

tlento · 2024-06-17T18:09:24Z

metricflow/dataflow/optimizer/predicate_pushdown_optimizer.py

@@ -284,7 +296,7 @@ def visit_where_constraint_node(self, node: WhereConstraintNode) -> OptimizeBran
                for element in spec.linkable_elements
                if element.element_type not in current_pushdown_state.pushdown_eligible_element_types
            ]
-            if len(semantic_models) != 1 and len(invalid_element_types) > 0:
+            if len(semantic_models) != 1 or len(invalid_element_types) > 0:


Big oops on this one, it was pushing down time dimensions and stuff.

tlento · 2024-06-17T18:14:02Z

metricflow/dataflow/optimizer/predicate_pushdown_optimizer.py

        """
        self._log_visit_node_type(node)
-        return self._default_handler(node)
+        current_pushdown_state = self._predicate_pushdown_tracker.last_pushdown_state
+        if node.join_type is SqlJoinType.LEFT_OUTER or node.join_type is SqlJoinType.FULL_OUTER:


Forgot to handle join types for time spines earlier, the pushdown scenarios I was seeing were tracking INNER join only, so this is an area worth testing.

tlento · 2024-06-17T18:15:19Z

metricflow/dataflow/optimizer/predicate_pushdown_optimizer.py

+        if any(metric_spec.has_time_offset for metric_spec in node.metric_specs):
+            # TODO: Allow non-time filters for offset metrics. This is for parity with the original hook preventing
+            # invalid pushdown for offset metrics
+            updated_pushdown_state = PredicatePushdownState.with_pushdown_disabled()


Without this we push categorical dimension filters past time spine joins. It's safe in conjunction with the join type handling, but it causes a bunch of weird thrash in snapshots so I'll do it as a follow-up.

tlento · 2024-06-17T18:20:16Z

...g.py/SqlQueryPlan/DuckDB/test_conversion_metric_with_categorical_filter__plan0_optimized.sql

+        metric_time__day
+        , visit__referrer_id
+        , visits
+      FROM (
+        -- Read Elements From Semantic Model 'visits_source'
+        -- Metric Time Dimension 'ds'
+        SELECT
+          DATE_TRUNC('day', ds) AS metric_time__day
+          , referrer_id AS visit__referrer_id
+          , 1 AS visits
+        FROM ***************************.fct_visits visits_source_src_28000
+      ) subq_17
+      WHERE visit__referrer_id = 'ref_id_01'


Note for reviewers - this is the one irreconcilable difference between the existing build-time pushdown and the optimizer approach. For conversion metrics the build-time pushdown is able short-circuit any pushdown operations for query-level filters, because we can see that the metric type is conversion in advance of our pushdown application.

With the optimizer, the first indication we get that we are dealing with a conversion metric is at the conversion metric join node. We short-circuit all where constraint pushdown at that stage, ensuring that we don't push past the conversion metric computation boundary, but we may push predicates down to the outer edge of the conversion metric join node itself. This is functionally identical to the original logic, but it adds this extra where constraint node at the moment.

The additional subquery should vanish when we update the optimizer to remove filters applied via pushdown.

tlento · 2024-06-17T18:20:53Z

...ring.py/SqlQueryPlan/DuckDB/test_conversion_metric_with_time_constraint__plan0_optimized.sql

+        visit__referrer_id
+        , visits
+      FROM (
+        -- Read Elements From Semantic Model 'visits_source'
+        -- Metric Time Dimension 'ds'
+        SELECT
+          DATE_TRUNC('day', ds) AS metric_time__day
+          , referrer_id AS visit__referrer_id
+          , 1 AS visits
+        FROM ***************************.fct_visits visits_source_src_28000
+      ) subq_19
+      WHERE (
+        metric_time__day BETWEEN '2020-01-01' AND '2020-01-02'
+      ) AND (
+        visit__referrer_id = 'ref_id_01'
+      )
+    ) subq_22


Another query-time filter on a conversion metric....

tlento · 2024-06-17T18:21:22Z

...QueryPlan/DuckDB/test_conversion_metric_with_window_and_time_constraint__plan0_optimized.sql

+        metric_time__day
+        , visit__referrer_id
+        , visits
+      FROM (
+        -- Read Elements From Semantic Model 'visits_source'
+        -- Metric Time Dimension 'ds'
+        SELECT
+          DATE_TRUNC('day', ds) AS metric_time__day
+          , referrer_id AS visit__referrer_id
+          , 1 AS visits
+        FROM ***************************.fct_visits visits_source_src_28000
+      ) subq_19
+      WHERE (
+        metric_time__day BETWEEN '2020-01-01' AND '2020-01-02'
+      ) AND (
+        visit__referrer_id = 'ref_id_01'
+      )


....and another, but this should be the last one.

tlento · 2024-06-17T18:21:55Z

...low_plan_builder.py/DataflowPlan/test_measure_constraint_with_reused_measure_plan__dfp_0.xml

@@ -100,53 +100,16 @@
                                    <!-- include_spec =                                                        -->
                                    <!--   TimeDimensionSpec(element_name='metric_time', time_granularity=DAY) -->
                                    <!-- distinct = False -->
-                                    <WhereConstraintNode>


This is a pre-optimization plan, so it is expected.

tlento · 2024-06-17T18:24:11Z

...est_derived_metric_rendering.py/SqlQueryPlan/DuckDB/test_nested_filters__plan0_optimized.sql

-      MAX(subq_48.average_booking_value) AS average_booking_value
-      , MAX(subq_61.bookings) AS bookings
-      , MAX(subq_69.booking_value) AS booking_value
+      MAX(subq_45.average_booking_value) AS average_booking_value


The remaining snapshots should have the following pattern:

The standard snapshots should lose the duplicated where constraint subqueries that the original pushdown implementation added to the DataflowPlanBuilder. I added a comment in tests_metricflow/snapshots/test_predicate_pushdown_rendering.py/SqlQueryPlan/DuckDB/test_single_categorical_dimension_pushdown__plan0.sql to illustrate this.

The optimized snapshots will have a bunch of subquery ID updates due to the removal of the subqueries from the un-optimized plans. The should not include other changes.

tlento · 2024-06-17T18:25:48Z

...hdown_rendering.py/SqlQueryPlan/DuckDB/test_single_categorical_dimension_pushdown__plan0.sql

            FROM (
-              -- Constrain Output with WHERE


The removal of this subquery is what causes all of the ID number thrash.

courtneyholcomb

LGTM - just one question about some of the optimized SQL that mayyy not be totally relevant to this PR

courtneyholcomb · 2024-06-17T18:34:49Z

...g.py/SqlQueryPlan/DuckDB/test_conversion_metric_with_categorical_filter__plan0_optimized.sql

+      FROM (
+        -- Read Elements From Semantic Model 'visits_source'
+        -- Metric Time Dimension 'ds'
+        SELECT


Curious why this subquery doesn't get collapsed into the next outer query?

I'm not 100% sure. The basic subquery reducer skips any subquery with a where expression, but the rewriting subquery reducer should do this collapsing at least for the inner where query. It's a bit odd that it doesn't, and I haven't figured out why not. My initial theory was that this is due to the nested where constraint subqueries, but if that's the case I don't understand why it wasn't collapsed in the first place. It may be an interaction between the join and filter subqueries - renaming things is harder when joins get involved.

courtneyholcomb · 2024-06-17T18:42:19Z

...s/test_source_scan_optimizer.py/DataflowPlan/test_constrained_metric_not_combined__dfp_0.xml

@@ -86,47 +86,16 @@
                                <!--   )                                                          -->
                                <!-- include_spec = TimeDimensionSpec(element_name='metric_time', time_granularity=DAY) -->
                                <!-- distinct = False -->
-                                <WhereConstraintNode>


Ok so this must be the un-optimized DFP

Ah, right, that test only runs the source scan optimizer, so whether optimized or not the predicate pushdown won't be applied.

tlento · 2024-06-25T03:29:56Z

Merge activity

Jun 24, 8:29 PM PDT: @tlento started a stack merge that includes this pull request via Graphite.
Jun 24, 9:07 PM PDT: Graphite rebased this pull request as part of a merge.
Jun 24, 9:11 PM PDT: @tlento merged this pull request with Graphite.

We are ready to move the categorical dimension filter predicate pushdown out of the DataflowPlanBuilder and into the PredicatePushdownOptimizer. This change makes the move with as close to parity on the existing pushdown as possible, and adds some concrete tests of the optimizer now that it is effecting changes on the DataflowPlan. In theory, the outcome of this change should be that snapshots generated from un-optimized DataflowPlans should no longer have an extra where filter from the original build-time pushdown operation, while optimized snapshots should be unchanged. In practice, the optimized snapshots still have changes. These fit into two categories: 1. Subquery ID number changes caused by the removal of the extra subquery in the non-optimized snapshots, since the ID numbers do not re-set between the non-optimized and optimized runs. 2. Conversion metric rendering for query time filters now includes an extra where constraint subquery due to an irrelevant pushdown operation. This is caused by a difference in where the "disable pushdown for conversion metrics" logic is applied - the optimizer cannot apply it until the join on conversion node, while the original builder could effectively apply the change at compute metric node level, and so we push down query-time filters one extra step. This will be reverted when we remove the duplicated filters from the pushdown operation. This change also required fixes to a few bugs with the original tracking implementation that only showed up when the full range of potential snapshot updates had to be applied. In particular, a long-standing issue with the previously never-called with_new_parents method of the CombineAggregatedOutputs node has been resolved, and some missing or incorrectly applied checks to prevent pushing time-based predicates down through offset windows and other joins were addressed.

cla-bot bot added the cla:yes label Jun 17, 2024

tlento added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Jun 17, 2024

tlento mentioned this pull request Jun 17, 2024

Enable DataflowPlanOptimizers for query rendering tests #1263

Merged

tlento temporarily deployed to DW_INTEGRATION_TESTS June 17, 2024 18:08 — with GitHub Actions Inactive

tlento mentioned this pull request Jun 17, 2024

Store a list of input WhereFilterSpecs in the WhereConstraintNode #1277

Merged

tlento temporarily deployed to DW_INTEGRATION_TESTS June 17, 2024 18:08 — with GitHub Actions Inactive

tlento mentioned this pull request Jun 17, 2024

Update API for requesting dataflow plan optimization #1278

Merged

tlento temporarily deployed to DW_INTEGRATION_TESTS June 17, 2024 18:09 — with GitHub Actions Inactive

tlento mentioned this pull request Jun 17, 2024

Verify column sets match in optimizer pushdown evaluation #1279

Merged

tlento mentioned this pull request Jun 17, 2024

Move node dataset resolver subquery ids into a separate namespace #1280

Merged

tlento commented Jun 17, 2024

View reviewed changes

courtneyholcomb approved these changes Jun 17, 2024

View reviewed changes

github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Jun 17, 2024

tlento force-pushed the separate-node-resolver-ids branch from 90caf50 to 4459fb6 Compare June 25, 2024 03:12

tlento force-pushed the move-categorical-dimension-pushdown-to-optimizer branch from cb3a4ce to 3cbd865 Compare June 25, 2024 03:12

This was referenced Jun 25, 2024

Consolidate where constraint predicate pushdown management #1300

Merged

Simplify predicate pushdown state tracking #1301

Merged

Track and propagate applied where filter specs to outer plan nodes #1302

Merged

tlento added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Jun 25, 2024

tlento temporarily deployed to DW_INTEGRATION_TESTS June 25, 2024 03:16 — with GitHub Actions Inactive

github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Jun 25, 2024

tlento force-pushed the separate-node-resolver-ids branch from 4459fb6 to 5d4109c Compare June 25, 2024 04:01

Base automatically changed from separate-node-resolver-ids to main June 25, 2024 04:06

tlento added 3 commits June 25, 2024 04:07

Update SQL engine snapshots

0254082

Changelog

cfc7833

tlento force-pushed the move-categorical-dimension-pushdown-to-optimizer branch from 3cbd865 to cfc7833 Compare June 25, 2024 04:07

tlento merged commit b09b5fe into main Jun 25, 2024
15 checks passed

tlento deleted the move-categorical-dimension-pushdown-to-optimizer branch June 25, 2024 04:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move categorical dimension filter pushdown to PredicatePushdownOptimizer #1286

Move categorical dimension filter pushdown to PredicatePushdownOptimizer #1286

tlento commented Jun 17, 2024 •

edited

Loading

tlento commented Jun 17, 2024 •

edited

Loading

tlento Jun 17, 2024

tlento Jun 17, 2024

tlento Jun 17, 2024

tlento Jun 17, 2024

tlento Jun 17, 2024

tlento Jun 17, 2024

tlento Jun 17, 2024

tlento Jun 17, 2024

tlento Jun 17, 2024

courtneyholcomb left a comment

courtneyholcomb Jun 17, 2024

tlento Jun 18, 2024

courtneyholcomb Jun 17, 2024

tlento Jun 18, 2024

tlento commented Jun 25, 2024 •

edited

Loading

Move categorical dimension filter pushdown to PredicatePushdownOptimizer #1286

Move categorical dimension filter pushdown to PredicatePushdownOptimizer #1286

Conversation

tlento commented Jun 17, 2024 • edited Loading

tlento commented Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

courtneyholcomb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlento commented Jun 25, 2024 • edited Loading

Merge activity

tlento commented Jun 17, 2024 •

edited

Loading

tlento commented Jun 17, 2024 •

edited

Loading

tlento commented Jun 25, 2024 •

edited

Loading