Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable DataflowPlanOptimizers for query rendering tests #1263

Open
wants to merge 3 commits into
base: add-tracking-only-pushdown-optimizer
Choose a base branch
from

Conversation

tlento
Copy link
Contributor

@tlento tlento commented Jun 12, 2024

Enable DataflowPlanOptimizers for query rendering tests

The metricflow query rendering tests do snapshot generation and comparison
for standard rendering and optimized rendering. However, these plans only
run the SqlQueryPlanOptimizers - they do not use the DataflowPlanOptimizers.

This means our optimized plans were only partially optimized. Now with
predicate pushdown it would be helpful to see the complete optimization
effect on query plan rendering to SQL.

This change makes that possible by including DataflowPlanOptimizers in
the comparison helper function. For the time being, and to minimize
thrash in snapshot plans, we only include the no-op PredicatePushdownOptimizer.
This will allow us to track the impact of enabling predicate pushdown via
that optimizer through query plan snapshot changes.

A later change will add the branch combiner and update snapshot rendering
accordingly.

Note the distinct values tests needed a quick hack to keep working, which proved less
silly than a local refactor of the helper method.

Snapshot changes should be limited to ID number updates.

The metricflow query rendering tests do snapshot generation and comparison
for standard rendering and optimized rendering. However, these plans only
run the SqlQueryPlanOptimizers - they do not use the DataflowPlanOptimizers.

This means our optimized plans were only partially optimized. Now with
predicate pushdown it would be helpful to see the complete optimization
effect on query plan rendering to SQL.

This change makes that possible by including DataflowPlanOptimizers in
the comparison helper function. For the time being, and to minimize
thrash in snapshot plans, we only include the no-op PredicatePushdownOptimizer.
This will allow us to track the impact of enabling predicate pushdown via
that optimizer through query plan snapshot changes.

A later change will add the branch combiner and update snapshot rendering
accordingly.
Quick hack to get these working. Note for reviewers, this was less
silly than factoring out the common internal logic in this method
and splitting the entry-point - it's a whole lot of duplication of
mf_test_configuration and stuff no matter what, so we might as well
just jam in this conditional.
@cla-bot cla-bot bot added the cla:yes label Jun 12, 2024
Copy link

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

Copy link
Contributor Author

tlento commented Jun 12, 2024

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @tlento and the rest of your teammates on Graphite Graphite

@tlento tlento added Skip Changelog Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment labels Jun 12, 2024
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS June 12, 2024 02:18 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS June 12, 2024 02:18 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS June 12, 2024 02:18 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS June 12, 2024 02:18 — with GitHub Actions Inactive
Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing this!!

dimension_specs=(),
time_dimension_specs=(
TimeDimensionSpec(
element_name="ds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not really related to the PR, but we don't allow non-metric_time time dims without entity links so it would make this more realistic to add one here.

time_dimension_specs=(
TimeDimensionSpec(
element_name="ds",
entity_links=(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same nit here

dimension_specs=(),
time_dimension_specs=(
TimeDimensionSpec(
element_name="ds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

COALESCE(subq_27.visit__referrer_id, subq_38.visit__referrer_id) AS visit__referrer_id
, MAX(subq_27.visits) AS visits
, MAX(subq_38.buys) AS buys
COALESCE(subq_31.visit__referrer_id, subq_42.visit__referrer_id) AS visit__referrer_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confused how we ended up with MORE aliases after adding more optimization, which I would expect to reduce aliases 🤔 but alas, not particularly important!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's not aliases, it's subgraph ID numbers. The reason for the change has to do with the shift in the number of calls to the DataflowPlanBuilder.build_plan() methods.

Basically, we add subquery aliases inside the DataflowToSqlQueryPlanConverter. This happens in like 20 different places in there, because it makes a lot of subqueries. Fair enough.

The thing is, we have another class - the DataflowPlanNodeOutputDatasetResolver - which we use both for creating source nodes and also for doing certain operations inside the DataflowPlanBuilder. This class is a direct subclass of the DataflowToSqlQueryPlanConverter, and when it does its work it basically creates a whole bunch of new subqueries, which means every time we add calls to it we change these subquery numbers.

That's also why we have these huge ID number values - we've basically gone over the input DAG a whole bunch of times, incrementing ID numbers all the way, before we produce the final query output. So an unoptimized query still has high numbers and a bunch of hard to explain gaps in the sequence.

Adding more optimizers to the set won't change this going forward, since we will still be generating the same number of plans (although the plan emitted by the optimizer might have totally different ID number ranges, because optimizers actually have their own ), but the update in ID numbers is one of the reasons why I didn't add the branch combiner to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants