Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move categorical dimension filter pushdown to PredicatePushdownOptimizer #1286

Merged
merged 3 commits into from
Jun 25, 2024

Conversation

tlento
Copy link
Contributor

@tlento tlento commented Jun 17, 2024

Move categorical dimension filter pushdown to PredicatePushdownOptimizer

We are ready to move the categorical dimension filter predicate pushdown
out of the DataflowPlanBuilder and into the PredicatePushdownOptimizer.
This change makes the move with as close to parity on the existing pushdown
as possible, and adds some concrete tests of the optimizer now that it is
effecting changes on the DataflowPlan.

In theory, the outcome of this change should be that snapshots generated
from un-optimized DataflowPlans should no longer have an extra where
filter from the original build-time pushdown operation, while optimized
snapshots should be unchanged. In practice, the optimized snapshots still
have changes. These fit into two categories:

  1. Subquery ID number changes caused by the removal of the extra subquery
    in the non-optimized snapshots, since the ID numbers do not re-set between
    the non-optimized and optimized runs.
  2. Conversion metric rendering for query time filters now includes an
    extra where constraint subquery due to an irrelevant pushdown operation. This is
    caused by a difference in where the "disable pushdown for conversion metrics"
    logic is applied - the optimizer cannot apply it until the join on conversion
    node, while the original builder could effectively apply the change at compute
    metric node level, and so we push down query-time filters one extra step. This
    will be reverted when we remove the duplicated filters from the pushdown
    operation.

This change also required fixes to a few bugs with the original tracking
implementation that only showed up when the full range of potential
snapshot updates had to be applied. In particular, a long-standing issue with
the previously never-called with_new_parents method of the CombineAggregatedOutputs
node has been resolved, and some missing or incorrectly applied checks to
prevent pushing time-based predicates down through offset windows and other
joins were addressed.

@cla-bot cla-bot bot added the cla:yes label Jun 17, 2024
@tlento tlento added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Jun 17, 2024
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS June 17, 2024 18:08 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS June 17, 2024 18:08 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS June 17, 2024 18:08 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS June 17, 2024 18:09 — with GitHub Actions Inactive
Copy link
Contributor Author

tlento commented Jun 17, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @tlento and the rest of your teammates on Graphite Graphite

@@ -284,7 +296,7 @@ def visit_where_constraint_node(self, node: WhereConstraintNode) -> OptimizeBran
for element in spec.linkable_elements
if element.element_type not in current_pushdown_state.pushdown_eligible_element_types
]
if len(semantic_models) != 1 and len(invalid_element_types) > 0:
if len(semantic_models) != 1 or len(invalid_element_types) > 0:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big oops on this one, it was pushing down time dimensions and stuff.

"""
self._log_visit_node_type(node)
return self._default_handler(node)
current_pushdown_state = self._predicate_pushdown_tracker.last_pushdown_state
if node.join_type is SqlJoinType.LEFT_OUTER or node.join_type is SqlJoinType.FULL_OUTER:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to handle join types for time spines earlier, the pushdown scenarios I was seeing were tracking INNER join only, so this is an area worth testing.

if any(metric_spec.has_time_offset for metric_spec in node.metric_specs):
# TODO: Allow non-time filters for offset metrics. This is for parity with the original hook preventing
# invalid pushdown for offset metrics
updated_pushdown_state = PredicatePushdownState.with_pushdown_disabled()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this we push categorical dimension filters past time spine joins. It's safe in conjunction with the join type handling, but it causes a bunch of weird thrash in snapshots so I'll do it as a follow-up.

Comment on lines +24 to +36
metric_time__day
, visit__referrer_id
, visits
FROM (
-- Read Elements From Semantic Model 'visits_source'
-- Metric Time Dimension 'ds'
SELECT
DATE_TRUNC('day', ds) AS metric_time__day
, referrer_id AS visit__referrer_id
, 1 AS visits
FROM ***************************.fct_visits visits_source_src_28000
) subq_17
WHERE visit__referrer_id = 'ref_id_01'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers - this is the one irreconcilable difference between the existing build-time pushdown and the optimizer approach. For conversion metrics the build-time pushdown is able short-circuit any pushdown operations for query-level filters, because we can see that the metric type is conversion in advance of our pushdown application.

With the optimizer, the first indication we get that we are dealing with a conversion metric is at the conversion metric join node. We short-circuit all where constraint pushdown at that stage, ensuring that we don't push past the conversion metric computation boundary, but we may push predicates down to the outer edge of the conversion metric join node itself. This is functionally identical to the original logic, but it adds this extra where constraint node at the moment.

The additional subquery should vanish when we update the optimizer to remove filters applied via pushdown.

Comment on lines +22 to +38
visit__referrer_id
, visits
FROM (
-- Read Elements From Semantic Model 'visits_source'
-- Metric Time Dimension 'ds'
SELECT
DATE_TRUNC('day', ds) AS metric_time__day
, referrer_id AS visit__referrer_id
, 1 AS visits
FROM ***************************.fct_visits visits_source_src_28000
) subq_19
WHERE (
metric_time__day BETWEEN '2020-01-01' AND '2020-01-02'
) AND (
visit__referrer_id = 'ref_id_01'
)
) subq_22
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another query-time filter on a conversion metric....

Comment on lines +25 to +41
metric_time__day
, visit__referrer_id
, visits
FROM (
-- Read Elements From Semantic Model 'visits_source'
-- Metric Time Dimension 'ds'
SELECT
DATE_TRUNC('day', ds) AS metric_time__day
, referrer_id AS visit__referrer_id
, 1 AS visits
FROM ***************************.fct_visits visits_source_src_28000
) subq_19
WHERE (
metric_time__day BETWEEN '2020-01-01' AND '2020-01-02'
) AND (
visit__referrer_id = 'ref_id_01'
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

....and another, but this should be the last one.

@@ -100,53 +100,16 @@
<!-- include_spec = -->
<!-- TimeDimensionSpec(element_name='metric_time', time_granularity=DAY) -->
<!-- distinct = False -->
<WhereConstraintNode>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pre-optimization plan, so it is expected.

MAX(subq_48.average_booking_value) AS average_booking_value
, MAX(subq_61.bookings) AS bookings
, MAX(subq_69.booking_value) AS booking_value
MAX(subq_45.average_booking_value) AS average_booking_value
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remaining snapshots should have the following pattern:

  1. The standard snapshots should lose the duplicated where constraint subqueries that the original pushdown implementation added to the DataflowPlanBuilder. I added a comment in tests_metricflow/snapshots/test_predicate_pushdown_rendering.py/SqlQueryPlan/DuckDB/test_single_categorical_dimension_pushdown__plan0.sql to illustrate this.
  2. The optimized snapshots will have a bunch of subquery ID updates due to the removal of the subqueries from the un-optimized plans. The should not include other changes.

FROM (
-- Constrain Output with WHERE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of this subquery is what causes all of the ID number thrash.

Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just one question about some of the optimized SQL that mayyy not be totally relevant to this PR

FROM (
-- Read Elements From Semantic Model 'visits_source'
-- Metric Time Dimension 'ds'
SELECT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this subquery doesn't get collapsed into the next outer query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure. The basic subquery reducer skips any subquery with a where expression, but the rewriting subquery reducer should do this collapsing at least for the inner where query. It's a bit odd that it doesn't, and I haven't figured out why not. My initial theory was that this is due to the nested where constraint subqueries, but if that's the case I don't understand why it wasn't collapsed in the first place. It may be an interaction between the join and filter subqueries - renaming things is harder when joins get involved.

@@ -86,47 +86,16 @@
<!-- ) -->
<!-- include_spec = TimeDimensionSpec(element_name='metric_time', time_granularity=DAY) -->
<!-- distinct = False -->
<WhereConstraintNode>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so this must be the un-optimized DFP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, that test only runs the source scan optimizer, so whether optimized or not the predicate pushdown won't be applied.

@github-actions github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Jun 17, 2024
@tlento tlento force-pushed the separate-node-resolver-ids branch from 90caf50 to 4459fb6 Compare June 25, 2024 03:12
@tlento tlento force-pushed the move-categorical-dimension-pushdown-to-optimizer branch from cb3a4ce to 3cbd865 Compare June 25, 2024 03:12
@tlento tlento added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Jun 25, 2024
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS June 25, 2024 03:16 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS June 25, 2024 03:16 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS June 25, 2024 03:16 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS June 25, 2024 03:16 — with GitHub Actions Inactive
Copy link
Contributor Author

tlento commented Jun 25, 2024

Merge activity

  • Jun 24, 8:29 PM PDT: @tlento started a stack merge that includes this pull request via Graphite.
  • Jun 24, 9:07 PM PDT: Graphite rebased this pull request as part of a merge.
  • Jun 24, 9:11 PM PDT: @tlento merged this pull request with Graphite.

@github-actions github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Jun 25, 2024
@tlento tlento force-pushed the separate-node-resolver-ids branch from 4459fb6 to 5d4109c Compare June 25, 2024 04:01
Base automatically changed from separate-node-resolver-ids to main June 25, 2024 04:06
We are ready to move the categorical dimension filter predicate pushdown
out of the DataflowPlanBuilder and into the PredicatePushdownOptimizer.
This change makes the move with as close to parity on the existing pushdown
as possible, and adds some concrete tests of the optimizer now that it is
effecting changes on the DataflowPlan.

In theory, the outcome of this change should be that snapshots generated
from un-optimized DataflowPlans should no longer have an extra where
filter from the original build-time pushdown operation, while optimized
snapshots should be unchanged. In practice, the optimized snapshots still
have changes. These fit into two categories:

1. Subquery ID number changes caused by the removal of the extra subquery
in the non-optimized snapshots, since the ID numbers do not re-set between
the non-optimized and optimized runs.
2. Conversion metric rendering for query time filters now includes an
extra where constraint subquery due to an irrelevant pushdown operation. This is
caused by a difference in where the "disable pushdown for conversion metrics"
logic is applied - the optimizer cannot apply it until the join on conversion
node, while the original builder could effectively apply the change at compute
metric node level, and so we push down query-time filters one extra step. This
will be reverted when we remove the duplicated filters from the pushdown
operation.

This change also required fixes to a few bugs with the original tracking
implementation that only showed up when the full range of potential
snapshot updates had to be applied. In particular, a long-standing issue with
the previously never-called with_new_parents method of the CombineAggregatedOutputs
node has been resolved, and some missing or incorrectly applied checks to
prevent pushing time-based predicates down through offset windows and other
joins were addressed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants