Enable predicate pushdown for categorical dimension filters #1227

tlento · 2024-05-22T03:23:29Z

Enable predicate pushdown for categorical dimension filters

We now have the ability to push down filter predicates within
the DataflowPlan. We start with categorical dimension filters,
as they are the simplest.

This change simply tracks the where filters applied at the measure
node and pushes all of them down to the construction of the source
node for evaluation. At this time a filter is eligible to be applied
to the source node if it only contains references to categorical dimensions
that originate from the same, singular semantic model definition that
feeds into the source node in question.

We do not support time dimensions at this time, as they can cause strange
interactions with things like cumulative metrics, which could result in
inappropriate input filtering that produces non-obviously censored metric
results.

We also do not support entities at this time, as entities may be defined
in multiple semantic models and as such filters must be applied with more
care to ensure we are correctly accounting for the entity link paths to
the relevant source node, if any, when we apply the filter.

Finally, we are not able to safely push predicates down for the "null value"
side of an outer join, which, in practice, restricts us to only doing predicate
pushdown to the measure source nodes.

The snapshot test changes for existing snapshots highlight the new behavior,
while the added test snapshots demonstrate specific circumstances of
interest.

tlento · 2024-05-22T03:23:50Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @tlento and the rest of your teammates on Graphite

tlento

Quick note - the intervening commits show where I got a little too aggressive with pushdown, and the snapshots show one of the issues with the "during construction" approach - it's exceedingly difficult to properly push the filter down without duplicating it in the query, because the recursive tracking of which filters got pushed and which ones didn't is super complicated. Some of these pushdown operations are effectively undoing the subquery reducing optimizer, but have no other practical effect.

Future updates will:

Add some dedicated join + filter test cases to ensure things keep returning the same results in the face of predicate pushdown. We incidentally seem to be covering the currently operating edge cases, but having explicit tests for all of the boundary conditions will be helpful as we restructure.
Finish consolidating our pushdown logic - including time window adjustments - into the pushdown state object
Convert our process to an optimizer pass and remove most predicate management from the DataflowPlanBuilder
Fix up the pushdown rendering so we don't double-render filters without justification

Whether those come before or after basic support for time dimensions remains to be seen, but I think it likely we'll do them first.

tlento · 2024-05-22T03:32:03Z

...ring.py/SqlQueryPlan/DuckDB/test_measure_constraint_with_reused_measure__plan0_optimized.sql

+      ) subq_21
+      WHERE booking__is_instant
+    ) subq_23


This is pretty dumb. If the filter is defined on a measure we typically apply it immediately anyway. The only way to not do this is to be more clever about storing recursive state and ensuring that predicates only get pushed past joins.

For now I think this will do, but it's something we should probably resolve when we add support for time filters, as simple queries with time filters will often do an unnecessary pushdown.

courtneyholcomb

All the snapshots and tests look good.
After reading this, I'm still confused as to how we ensure only categorical dimensions get pushed down 🤔 So trying to dig in to understand that before I approve!

courtneyholcomb · 2024-05-25T00:07:56Z

metricflow/dataflow/builder/dataflow_plan_builder.py

@@ -680,7 +691,9 @@ def _build_plan_for_distinct_values(self, query_spec: MetricFlowQuerySpec) -> Da
        required_linkable_specs, _ = self.__get_required_and_extraneous_linkable_specs(
            queried_linkable_specs=query_spec.linkable_specs, filter_specs=query_level_filter_specs
        )
-        predicate_pushdown_state = PredicatePushdownState(time_range_constraint=query_spec.time_range_constraint)
+        predicate_pushdown_state = PredicatePushdownState(
+            time_range_constraint=query_spec.time_range_constraint, where_filter_specs=query_level_filter_specs


Where does this get narrowed down to only categorical dimensions? 🤔

courtneyholcomb · 2024-05-25T00:24:19Z

...t_derived_metric_rendering.py/SqlQueryPlan/BigQuery/test_nested_filters__plan0_optimized.sql

@@ -22,17 +22,34 @@ FROM (
        -- Join Standard Outputs
        -- Pass Only Elements: ['average_booking_value', 'listing__is_lux_latest', 'booking__is_instant']
        SELECT


Unrelated to this PR, but I've been wondering if we can optimize away these interim subqueries that do nothing but select columns 🤔

They're supposed to get optimized away, and I'm not entirely sure why they don't. There may be some column or subquery alias thing that's getting in the way.

courtneyholcomb

Looooks goood!!!
Left a couple of inline nits, nothing blocking!

courtneyholcomb · 2024-05-25T01:04:19Z

metricflow/plan_conversion/node_processor.py

+            invalid_element_types = [
+                element for element in spec.linkable_elements if element.element_type not in enabled_element_types
+            ]
+            if len(semantic_models) == 1 and len(invalid_element_types) == 0:


Ok, so we only push down filters if ALL the filtered elements are eligible element types. Is that because we don't know if this is an AND or an OR filter at this point?

Correct, we cannot push down a filter with any invalid element types. The AND vs OR nature of things isn't relevant, it's because we don't have a way to handle those element types. Right now that's just because we haven't implemented handling, but in future it could be due to a given query being too difficult to manage for a given element type.

For example, agg time dimension filters against a mixture of cumulative and derived offset metric inputs could get very tricky. In those cases we may not be able to push down a where filter with a time dimension.

My expectation is that this will be more refined than clobbering anything that has a time dimension of any kind in it, but for now this definitely works and we can use more finesse later.

courtneyholcomb · 2024-05-25T01:05:05Z

metricflow/plan_conversion/node_processor.py

+        eligible_filter_specs_by_model: Dict[SemanticModelReference, Sequence[WhereFilterSpec]] = {}
+        for spec in where_filter_specs:
+            semantic_models = set(element.semantic_model_origin for element in spec.linkable_elements)
+            invalid_element_types = [


Ok this is the logic I was looking for 👍

courtneyholcomb · 2024-05-25T01:08:05Z

metricflow/plan_conversion/node_processor.py

+                ]
+                if len(matching_filter_specs) == 0:
+                    filtered_nodes.append(source_node)
+                    continue


Does this continue actually do anything?

Oh yeah, I think originally I didn't have an else here. I've removed the redundant continue statement to keep the structure of this consistent with the time range constraint method.

courtneyholcomb · 2024-05-25T01:12:17Z

metricflow/plan_conversion/node_processor.py

+                matching_filter_specs = [
+                    filter_spec
+                    for filter_spec in eligible_filter_specs
+                    if all([spec in source_node_specs.linkable_specs for spec in filter_spec.linkable_specs])


Is there ever a time where this won't be true, since we get the semantic model from the linkable element above? Or is this just an extra safety check in case something gets misconfigured?

At this time it's a safeguard against something weird happening where a given source node isn't configured correctly. However, I expect this filter to be relevant for entities, since they may be defined in multiple semantic models and we need to be able to explicitly allow or disallow pushdown in those cases. If we ever add a pre-joined source node, for example, we might encounter a scenario where the entity and dimension come from different semantic models and then we couldn't push down past this point (and maybe shouldn't push down to this point, either).

We now have the ability to push down filter predicates within the DataflowPlan. We start with categorical dimension filters, as they are the simplest. This change simply tracks the where filters applied at the measure node and pushes all of them down to the construction of the source node for evaluation. At this time a filter is eligible to be applied to the source node if it only contains references to categorical dimensions that originate from the same, singular semantic model definition that feeds into the source node in question. We do not support time dimensions at this time, as they can cause strange interactions with things like cumulative metrics, which could result in inappropriate input filtering that produces non-obviously censored metric results. We also do not support entities at this time, as entities may be defined in multiple semantic models and as such filters must be applied with more care to ensure we are correctly accounting for the entity link paths to the relevant source node, if any, when we apply the filter. The snapshot test changes for existing snapshots highlight the new behavior, while the added test snapshots demonstrate specific circumstances of interest.

Note for reviewers, as this will be squashed: Pushdown on joined in dimensions was not working as expected. In the original predicate pushdown rendering tests, most joined in dimensions were being skipped for pushdown operations. The root cause of the issue was the semantic model source value we were accessing, which actually included the complete history of all semantic model inputs for the joined in dimension. Fixing that problem uncovered a separate issue, which is we were inappropriately pushing filters down past outer join expressions. This commit fixes both issues at once - we now only push down on the "safe" side of an outer join (the left side for LEFT OUTER and not at all for FULL OUTER joins), and we evaluate pushdown based on the singuolar semantic model source where each element is defined.

tlento requested review from courtneyholcomb and plypaul May 22, 2024 03:23

cla-bot bot added the cla:yes label May 22, 2024

tlento force-pushed the add-element-type-to-linkable-element-interface branch from 9040f9f to ee88ff9 Compare May 23, 2024 23:41

tlento force-pushed the enable-predicate-pushdown-for-categorical-dimension-filters branch from 8b41889 to fba565c Compare May 23, 2024 23:42

tlento changed the base branch from add-element-type-to-linkable-element-interface to add-semantic-model-origin-to-linkable-element-interface May 23, 2024 23:42

tlento mentioned this pull request May 23, 2024

Add semantic_model_origin property to LinkableElement interface #1230

Open

tlento added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label May 24, 2024

tlento requested a deployment to DW_INTEGRATION_TESTS May 24, 2024 00:11 — with GitHub Actions Waiting

tlento commented May 24, 2024

View reviewed changes

tlento marked this pull request as ready for review May 24, 2024 00:31

tlento added Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment and removed Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment labels May 24, 2024

tlento temporarily deployed to DW_INTEGRATION_TESTS May 24, 2024 00:35 — with GitHub Actions Inactive

github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label May 24, 2024

courtneyholcomb reviewed May 25, 2024

View reviewed changes

courtneyholcomb approved these changes May 25, 2024

View reviewed changes

tlento force-pushed the add-semantic-model-origin-to-linkable-element-interface branch from 7672db8 to 506ca50 Compare May 29, 2024 23:24

tlento force-pushed the enable-predicate-pushdown-for-categorical-dimension-filters branch from 2e250d1 to aea5c0b Compare May 29, 2024 23:25

tlento added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label May 29, 2024

tlento temporarily deployed to DW_INTEGRATION_TESTS May 29, 2024 23:25 — with GitHub Actions Inactive

github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label May 29, 2024

tlento force-pushed the add-semantic-model-origin-to-linkable-element-interface branch from 506ca50 to 7b8a6b8 Compare June 4, 2024 00:04

tlento force-pushed the enable-predicate-pushdown-for-categorical-dimension-filters branch from aea5c0b to 7c2c34d Compare June 4, 2024 00:04

tlento mentioned this pull request Jun 4, 2024

Add integration tests for filters against various join types #1240

Open

tlento added 7 commits June 11, 2024 17:00

Add predicate pushdown rendering tests

4e35762

Update SQL engine snapshots

c164a0e

changelog

ce8c482

Update SQL Engine Snapshots again

f7e5f44

Remove unnecessary continue

27e32dd

tlento force-pushed the add-semantic-model-origin-to-linkable-element-interface branch from 7b8a6b8 to 2e4a1e1 Compare June 12, 2024 00:00

tlento force-pushed the enable-predicate-pushdown-for-categorical-dimension-filters branch from 7c2c34d to 27e32dd Compare June 12, 2024 00:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable predicate pushdown for categorical dimension filters #1227

Enable predicate pushdown for categorical dimension filters #1227

tlento commented May 22, 2024 •

edited

tlento commented May 22, 2024 •

edited

tlento left a comment

tlento May 22, 2024

courtneyholcomb left a comment

courtneyholcomb May 25, 2024

courtneyholcomb May 25, 2024

tlento May 29, 2024

courtneyholcomb left a comment

courtneyholcomb May 25, 2024

tlento May 29, 2024

courtneyholcomb May 25, 2024

courtneyholcomb May 25, 2024

tlento May 29, 2024

courtneyholcomb May 25, 2024

tlento May 29, 2024

Enable predicate pushdown for categorical dimension filters #1227

Are you sure you want to change the base?

Enable predicate pushdown for categorical dimension filters #1227

Conversation

tlento commented May 22, 2024 • edited

tlento commented May 22, 2024 • edited

tlento left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

courtneyholcomb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

courtneyholcomb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlento commented May 22, 2024 •

edited

tlento commented May 22, 2024 •

edited