Remove Dataflow Plan Node Types #1205

plypaul · 2024-05-11T00:58:00Z

Description

The original class hierarchy for the DataflowPlanNodes included types that described the data that was output by the node. However, those turned out to not be useful in practice (e.g. BaseOutput was the majority of use cases), so this PR removes them.

tlento

I love this PR. It is third in the set of cleanup changes I was planning to make after predicate pushdown, but that ordering had more to do with personal preferences than anything else.

I have not really read it because it's, well, hard to read, and I'm trying to get another feature out the door.

The main problem I ran into here is this PR represents, conceptually, three or four changes in one:

removing useless type marker nodes
removing SinkNodes
simplifying the optimizer return types (which is probably fine to consolidate into the SinkNode removal, and is a separate commit here)
doing the natural consolidation of BaseOutput into DataflowPlanNode.

If this were split along these - or other equally readable - lines it'd be a "scroll through and make sure nothing looks weird" review. As it is now it's quite a bit more daunting.

Similarly, if this causes some weirdness then viewing via git diff or whatever is going to be a headache since we squash and merge, and it'd be more accessible in every piece of tooling if it was broken into more readily addressable chunks.

Anyway, you can do what you will with this. If you're going to break this up let me know and I won't review it until you're done, otherwise I'll take a pass through when I have more time.

tlento · 2024-05-12T20:52:39Z

metricflow/execution/dataflow_to_execution.py

+    def visit_source_node(self, node: ReadSqlSourceNode) -> ConvertToExecutionPlanResult:
+        raise NotImplementedError
+
+    @override
+    def visit_join_on_entities_node(self, node: JoinOnEntitiesNode) -> ConvertToExecutionPlanResult:
+        raise NotImplementedError
+
+    @override
+    def visit_aggregate_measures_node(self, node: AggregateMeasuresNode) -> ConvertToExecutionPlanResult:
+        raise NotImplementedError
+
+    @override
+    def visit_compute_metrics_node(self, node: ComputeMetricsNode) -> ConvertToExecutionPlanResult:
+        raise NotImplementedError
+
+    @override
+    def visit_order_by_limit_node(self, node: OrderByLimitNode) -> ConvertToExecutionPlanResult:
+        raise NotImplementedError
+
+    @override
+    def visit_where_constraint_node(self, node: WhereConstraintNode) -> ConvertToExecutionPlanResult:
+        raise NotImplementedError
+
+    @override
+    def visit_filter_elements_node(self, node: FilterElementsNode) -> ConvertToExecutionPlanResult:
+        raise NotImplementedError
+
+    @override
+    def visit_combine_aggregated_outputs_node(self, node: CombineAggregatedOutputsNode) -> ConvertToExecutionPlanResult:
+        raise NotImplementedError
+
+    @override
+    def visit_constrain_time_range_node(self, node: ConstrainTimeRangeNode) -> ConvertToExecutionPlanResult:
+        raise NotImplementedError
+
+    @override
+    def visit_join_over_time_range_node(self, node: JoinOverTimeRangeNode) -> ConvertToExecutionPlanResult:
+        raise NotImplementedError
+
+    @override
+    def visit_semi_additive_join_node(self, node: SemiAdditiveJoinNode) -> ConvertToExecutionPlanResult:
+        raise NotImplementedError
+
+    @override
+    def visit_metric_time_dimension_transform_node(
+        self, node: MetricTimeDimensionTransformNode
+    ) -> ConvertToExecutionPlanResult:
+        raise NotImplementedError
+
+    @override
+    def visit_join_to_time_spine_node(self, node: JoinToTimeSpineNode) -> ConvertToExecutionPlanResult:
+        raise NotImplementedError
+
+    @override
+    def visit_min_max_node(self, node: MinMaxNode) -> ConvertToExecutionPlanResult:
+        raise NotImplementedError
+
+    @override
+    def visit_add_generated_uuid_column_node(self, node: AddGeneratedUuidColumnNode) -> ConvertToExecutionPlanResult:
+        raise NotImplementedError
+
+    @override
+    def visit_join_conversion_events_node(self, node: JoinConversionEventsNode) -> ConvertToExecutionPlanResult:
+        raise NotImplementedError


Speaking of things looking weird, big blocks of green inside mechanical change diffs always make an impression. Now if anybody attempts to convert a dataflow plan to an execution plan using the wrong node type the runtime will blow up with a NotImplementedError. This seems undesirable. While I was never a fan of the existence of the SinkNodeVisitor interface, its one redeeming feature was preventing this from happening. Indeed, I originally suggested it as a "if you must use a visitor to do this graph level property access then please at least make it a different type"

Do you have a follow-up planned where you get rid of this one way or another? My planned stack was going to involve replacing the visitor itself with a property on the DataflowPlan and pausing on removing the execution plan stuff until later just because I don't want to deal with the MFS changes right now, but removing execution plans (so much for our earlier execution plan aspirations....) altogether would be welcome.

This is what I got now: https://github.com/dbt-labs/metricflow/blob/cdfdce9681f1cbd28b387efe2dbce697a70671c4/metricflow/execution/dataflow_to_execution.py

Consolidating to the one execution plan type we actually use looks like a win to me. Implementing a DagWalker that does not - and in fact absolutely SHOULD NOT - walk the DAG is confusing. Maybe just update that to be a regular class?

plypaul · 2024-05-12T22:01:20Z

Anyway, you can do what you will with this. If you're going to break this up let me know and I won't review it until you're done, otherwise I'll take a pass through when I have more time.

Unfortunately, breaking out that change might be a bit tough, so let's go with the assumption that the commits will be as is. If that changes, I'll let you know.

tlento

Thanks for splitting out the rename, that made a surprisingly big difference!

At some point we should really reconsider the multiple sink nodes with the runtime errors. I don't think we're likely to need multiple sinks anymore, since all we do is render queries to a single output stream (i.e., the outer SELECT statement), and any forking of that output data stream is probably best handled outside of MetricFlow. Not sure if you're doing that upstack or not, but it's a thing to consider.

tlento · 2024-05-16T01:17:13Z

metricflow/dataflow/dataflow_plan.py

-            raise RuntimeError("Can't create a dataflow plan without sink node(s).")
-        self._sink_output_nodes = tuple(sink_output_nodes)
+    def __init__(self, sink_nodes: Sequence[DataflowPlanNode], plan_id: Optional[DagId] = None) -> None:  # noqa: D107
+        assert len(sink_nodes) == 1, f"Exactly 1 sink node is supported. Got: {sink_nodes}"


Out of curiosity, are we going to formalize this via the type system?

I haven't had a chance to think about how to simplify this, but yeah, that sounds like a good idea.

tlento · 2024-05-16T01:17:31Z

metricflow/dataflow/dataflow_plan.py

-    def sink_output_node(self) -> DataflowPlanNode:  # noqa: D102
-        assert len(self._sink_output_nodes) == 1, f"Only 1 sink node supported. Got: {self._sink_output_nodes}"
-        return self._sink_output_nodes[0]
+    def checked_sink_node(self) -> DataflowPlanNode:


Can this just be sink_node? We already have the assertion in the initializer.

Yeah, updated.

…nodes.

…de`.

plypaul added the Skip Changelog label May 11, 2024

cla-bot bot added the cla:yes label May 11, 2024

plypaul force-pushed the p--smr--03 branch from 15830bc to 77f38ec Compare May 11, 2024 01:14

plypaul force-pushed the p--smr--04 branch from f77e184 to 28a631c Compare May 11, 2024 01:14

plypaul marked this pull request as ready for review May 11, 2024 01:20

tlento reviewed May 12, 2024

View reviewed changes

plypaul force-pushed the p--smr--03 branch from 77f38ec to d829dd4 Compare May 15, 2024 18:42

plypaul force-pushed the p--smr--04 branch from 28a631c to 7f99425 Compare May 15, 2024 18:42

plypaul force-pushed the p--smr--03 branch from d829dd4 to 3d7cf47 Compare May 15, 2024 20:09

plypaul force-pushed the p--smr--04 branch from 7f99425 to 07322cb Compare May 15, 2024 20:09

plypaul force-pushed the p--smr--03 branch from 3d7cf47 to 64f3c98 Compare May 15, 2024 20:23

plypaul force-pushed the p--smr--04 branch from 07322cb to c53b193 Compare May 15, 2024 20:23

plypaul force-pushed the p--smr--03 branch from 64f3c98 to ef16f78 Compare May 15, 2024 20:29

plypaul force-pushed the p--smr--04 branch from c53b193 to 3830a25 Compare May 15, 2024 20:29

Base automatically changed from p--smr--03 to main May 15, 2024 20:33

plypaul force-pushed the p--smr--04 branch 2 times, most recently from 7b1693b to abd1872 Compare May 16, 2024 00:22

tlento approved these changes May 16, 2024

View reviewed changes

plypaul added 8 commits May 15, 2024 19:15

/* PR_START p--smr 04 */ Remove DataflowPlanNode output types / sink …

04f2e90

…nodes.

Rename JoinToBaseOutputNode.

f946280

Rename visit_join_to_base_output_node -> `visit_join_on_entities_no…

e881c0a

…de`.

Remove references to sink_output_node from DataflowPlan.

b13603e

Rename base_output_node -> optimized_branch.

80f5af9

Simplify OptimizeBranchResult.

3c942fc

Update snapshots.

6f5dcce

Rename checked_sink_node.

73d7566

plypaul force-pushed the p--smr--04 branch from abd1872 to 73d7566 Compare May 16, 2024 02:19

plypaul enabled auto-merge (squash) May 16, 2024 02:20

plypaul merged commit 8ba3897 into main May 16, 2024
15 checks passed

plypaul deleted the p--smr--04 branch May 16, 2024 02:24

tlento mentioned this pull request May 16, 2024

Make source_semantic_models property accessible from a DataflowPlanNode #1218

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Dataflow Plan Node Types #1205

Remove Dataflow Plan Node Types #1205

plypaul commented May 11, 2024

tlento left a comment

tlento May 12, 2024

plypaul May 12, 2024

tlento May 13, 2024

plypaul commented May 12, 2024

tlento left a comment

tlento May 16, 2024

plypaul May 16, 2024

tlento May 16, 2024

plypaul May 16, 2024

Remove Dataflow Plan Node Types #1205

Remove Dataflow Plan Node Types #1205

Conversation

plypaul commented May 11, 2024

Description

tlento left a comment

Choose a reason for hiding this comment

tlento May 12, 2024

Choose a reason for hiding this comment

plypaul May 12, 2024

Choose a reason for hiding this comment

tlento May 13, 2024

Choose a reason for hiding this comment

plypaul commented May 12, 2024

tlento left a comment

Choose a reason for hiding this comment

tlento May 16, 2024

Choose a reason for hiding this comment

plypaul May 16, 2024

Choose a reason for hiding this comment

tlento May 16, 2024

Choose a reason for hiding this comment

plypaul May 16, 2024

Choose a reason for hiding this comment