Add a Resolver for Saved-Query Dependencies #1152

plypaul · 2024-04-24T22:15:06Z

Description

This PR adds a resolver to get the dependencies of a saved query.

github-actions · 2024-04-24T22:15:23Z

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

This was moved to address circular-dependency issues.

This was moved to address circular-dependency issues. Note: `GroupByMetricSpec` had to be moved back in a later commit.

…turn a result object.

This was moved to address circular-dependency issues. Some methods were removed / changed to help support the split. The general issue was that because the set classes had bidirectional dependences with the spec classes, circular dependencies could be easily formed when imported by other modules.

`LinkableElementSet` will be used in group-by resolution to keep track of the semantic models that a query would read.

… classes. Using `LinkableElementSet` allows retrieval of the semantic models that are needed for computation.

tlento

Seems reasonable! I can make use of the LinkableELementSet return value - I was going to make a similar set of changes before all this started anyway.

Note: this approval assumes the large blocks of green in cases where things existed before are copy/paste. I did not read those sections carefully.

Lots of small things inline. I think the main things are some added test coverage and the usual "I don't like visitors" thing. I'd like the tests done either separately as a follow-up or prior to merge, it's up to you.

tlento · 2024-04-26T22:46:12Z

metricflow/dataflow/builder/dataflow_plan_builder.py

@@ -812,6 +813,14 @@ def _build_measure_spec_properties(self, measure_specs: Sequence[MeasureSpec]) -
            non_additive_dimension_spec=non_additive_dimension_spec,
        )

+    def _query_spec_for_source_node(self, group_by_metric_spec: GroupByMetricSpec) -> MetricFlowQuerySpec:


attn @courtneyholcomb, just FYI, this function moved into the DataflowPlanBuilder.

tlento · 2024-04-26T22:48:11Z

metricflow-semantics/metricflow_semantics/query/query_parser.py


-    @log_runtime()
+    # @log_runtime()


Will this come back later or be deleted?

That's a typo - updated.

tlento · 2024-04-26T22:49:16Z

metricflow-semantics/metricflow_semantics/query/query_parser.py

@@ -115,15 +116,19 @@ def parse_and_validate_saved_query(
        if where_filter is not None:
            where_filters.append(where_filter)

-        return self.parse_and_validate_query(
+        return self._parse_and_validate_query(


Ok, so this isn't applied to parse_and_validate_query more broadly just yet.

tlento · 2024-04-26T22:52:49Z

metricflow-semantics/metricflow_semantics/specs/spec_set.py

+class _GroupSpecByTypeVisitor(InstanceSpecVisitor[None]):
+    """Groups a spec by type into an `InstanceSpecSet`."""
+
+    metric_specs: List[MetricSpec] = dataclasses.field(default_factory=list)
+    measure_specs: List[MeasureSpec] = dataclasses.field(default_factory=list)
+    dimension_specs: List[DimensionSpec] = dataclasses.field(default_factory=list)
+    entity_specs: List[EntitySpec] = dataclasses.field(default_factory=list)
+    time_dimension_specs: List[TimeDimensionSpec] = dataclasses.field(default_factory=list)
+    group_by_metric_specs: List[GroupByMetricSpec] = dataclasses.field(default_factory=list)
+    metadata_specs: List[MetadataSpec] = dataclasses.field(default_factory=list)
+
+    @override
+    def visit_measure_spec(self, measure_spec: MeasureSpec) -> None:
+        self.measure_specs.append(measure_spec)
+
+    @override
+    def visit_dimension_spec(self, dimension_spec: DimensionSpec) -> None:
+        self.dimension_specs.append(dimension_spec)
+
+    @override
+    def visit_time_dimension_spec(self, time_dimension_spec: TimeDimensionSpec) -> None:
+        self.time_dimension_specs.append(time_dimension_spec)
+
+    @override
+    def visit_entity_spec(self, entity_spec: EntitySpec) -> None:
+        self.entity_specs.append(entity_spec)
+
+    @override
+    def visit_group_by_metric_spec(self, group_by_metric_spec: GroupByMetricSpec) -> None:
+        self.group_by_metric_specs.append(group_by_metric_spec)
+
+    @override
+    def visit_metric_spec(self, metric_spec: MetricSpec) -> None:
+        self.metric_specs.append(metric_spec)
+
+    @override
+    def visit_metadata_spec(self, metadata_spec: MetadataSpec) -> None:
+        self.metadata_specs.append(metadata_spec)


Seriously though, the type refinement standard in python is the isinstance check. If that's unpalatable then that indicates we should avoid the need to perform this operation. Looking above I think the real issue is we've been disguising this for a very long time, so it'll be challenging to undo it.

Given that, I'd rather we follow the language standard here than add a visitor implementation. The drawback is if someone forgets to add a new spec type it's possible they'll forget to update the instance check logic, but that is an exceedingly rare case and it's something anybody adding a new spec has to deal with everywhere else in the InstanceSpecSet class anyway.

It's up to you if you want to change this here, or if you want me to do it when I rip out the InstanceSpecVisitor as a construct.

The rule I've been following is: no isinstance calls unless it's in a equals or comparison method. It's a code-smell, and when it's in the code base, it tends to spread.

The rule I've been following is: no isinstance calls unless it's in a equals or comparison method. It's a code-smell, and when it's in the code base, it tends to spread.

This rule is too rigid for a gradually typed language like Python where type refinement is typically done via isinstance(). Assertions to hint to the typechecker are a thing, as are instance checks to classify an object by type in the (hopefully exceedingly rare) cases where you need to do so. As such, this approach leads to a proliferation of code that does basically nothing.

In this specific case, you could've put ~10 lines of logic into the existing method with a comment indicating that we are aware of the type deficiencies and do not want these checks to proliferate through the codebase at large, and an assertion to cover completeness in runtime tests. Instead you've added 2 new module-scope functions, a module-private helper class, and ~100 lines of boilerplate to accomplish the same end.

I don't think this is a good trade, especially given that the code smell I keep running into in our codebase is calcified generic abstractions that serve no functional purpose other than to hide the fact that the custom type foundation of the system is inadequate. The InstanceSpecVisitor is a perfect example of this, with the SinkNodeVisitor (and the existence of the DataflowPlanSinkNodes or whatever they're called themselves) representing another one.

You are correct that isinstance calls tend to spread. They must, if the custom type hierarchy we're relying on isn't working well. The thing is, these visitor implementations that make the codebase fundamentally harder to reason about and more awkward to work with also tend to spread in these situations. I don't find the latter more palatable than the former - rather the opposite, in the specific case of MetricFlow, because we routinely tie ourselves into knots to satisfy the constraints imposed by these generic abstractions and they provide absolutely nothing of value in return.

I feel the correct way to avoid spread is through continuous improvement of type specifications, and rigor and encapsulation around the isinstance checks themselves. Relying on inappropriate object oriented design patterns that hide the mess behind generics and piles of boilerplate represents a huge step backwards in terms of readability in the codebase and communication around what we intend vs what exists today and requires change.

Since we had had this discussion a few times and to avoid having this thread be the place where we debate, let me just say that I see your points and I'm up for continued discussion on the issue.

tlento · 2024-04-26T23:12:25Z

metricflow-semantics/metricflow_semantics/model/semantics/linkable_element.py


 @dataclass(frozen=True)
-class LinkableEntity:
+class LinkableEntity(LinkableElement, SemanticModelDerivation):


You don't need both of these, right, just LinkableElement?

Right - updated.

tlento · 2024-04-26T23:12:34Z

metricflow-semantics/metricflow_semantics/model/semantics/linkable_element.py


 @dataclass(frozen=True)
-class LinkableMetric:
+class LinkableMetric(LinkableElement, SemanticModelDerivation):


Same thing here, just LinkableElement?

tlento · 2024-04-26T23:16:56Z

metricflow-semantics/metricflow_semantics/model/semantics/linkable_element_set.py

+                entity_links=path_key.entity_links,
+            )
+        elif path_key.element_type is LinkableElementType.TIME_DIMENSION:
+            assert path_key.time_granularity is not None


nit: error message, maybe something like "type refinement, should have been checked in dataclass validation"

tlento · 2024-04-26T23:20:39Z

metricflow-semantics/metricflow_semantics/model/semantics/linkable_element_set.py

+
+    @property
+    @override
+    def derived_from_semantic_models(self) -> Sequence[SemanticModelReference]:


Could we add some test cases for this and the filter by spec pattern operations? They're quite a bit more complex than the direct to spec conversion.

Add some test cases.

tlento · 2024-04-26T23:34:48Z

metricflow-semantics/metricflow_semantics/api/v0_1/saved_query_dependency_resolver.py

+        try:
+            return self._resolve_dependencies(saved_query_name)
+        except Exception:
+            logger.exception(


Ok. We might need to do something about this log level for deployment - if it turns out to be common we'll want it to show up with level WARN or INFO instead of EXCEPTION - but I think we can do that via datadog instead of here.

This is a fail-safe and shouldn't really be hit - if it is, it means there's a bug that we need to fix.

plypaul · 2024-04-27T00:40:44Z

Note: this approval assumes the large blocks of green in cases where things existed before are copy/paste. I did not read those sections carefully.

That's correct. There shouldn't be significant logic changes in those sections.

tlento

The updates look good, thank you, let's get this merged as soon as it's ready. The only update you might want to make before merge is the more_itertools change but even that I can do later as I grind my way through this thing I'm trying to get done. The useless visitor chafes but I'm just going to start removing stuff like that so one extra diff in that stack won't matter much to me one way or the other.

tlento · 2024-04-28T23:58:40Z

metricflow-semantics/metricflow_semantics/specs/spec_set.py

+class _GroupSpecByTypeVisitor(InstanceSpecVisitor[None]):
+    """Groups a spec by type into an `InstanceSpecSet`."""
+
+    metric_specs: List[MetricSpec] = dataclasses.field(default_factory=list)
+    measure_specs: List[MeasureSpec] = dataclasses.field(default_factory=list)
+    dimension_specs: List[DimensionSpec] = dataclasses.field(default_factory=list)
+    entity_specs: List[EntitySpec] = dataclasses.field(default_factory=list)
+    time_dimension_specs: List[TimeDimensionSpec] = dataclasses.field(default_factory=list)
+    group_by_metric_specs: List[GroupByMetricSpec] = dataclasses.field(default_factory=list)
+    metadata_specs: List[MetadataSpec] = dataclasses.field(default_factory=list)
+
+    @override
+    def visit_measure_spec(self, measure_spec: MeasureSpec) -> None:
+        self.measure_specs.append(measure_spec)
+
+    @override
+    def visit_dimension_spec(self, dimension_spec: DimensionSpec) -> None:
+        self.dimension_specs.append(dimension_spec)
+
+    @override
+    def visit_time_dimension_spec(self, time_dimension_spec: TimeDimensionSpec) -> None:
+        self.time_dimension_specs.append(time_dimension_spec)
+
+    @override
+    def visit_entity_spec(self, entity_spec: EntitySpec) -> None:
+        self.entity_specs.append(entity_spec)
+
+    @override
+    def visit_group_by_metric_spec(self, group_by_metric_spec: GroupByMetricSpec) -> None:
+        self.group_by_metric_specs.append(group_by_metric_spec)
+
+    @override
+    def visit_metric_spec(self, metric_spec: MetricSpec) -> None:
+        self.metric_specs.append(metric_spec)
+
+    @override
+    def visit_metadata_spec(self, metadata_spec: MetadataSpec) -> None:
+        self.metadata_specs.append(metadata_spec)


The rule I've been following is: no isinstance calls unless it's in a equals or comparison method. It's a code-smell, and when it's in the code base, it tends to spread.

This rule is too rigid for a gradually typed language like Python where type refinement is typically done via isinstance(). Assertions to hint to the typechecker are a thing, as are instance checks to classify an object by type in the (hopefully exceedingly rare) cases where you need to do so. As such, this approach leads to a proliferation of code that does basically nothing.

In this specific case, you could've put ~10 lines of logic into the existing method with a comment indicating that we are aware of the type deficiencies and do not want these checks to proliferate through the codebase at large, and an assertion to cover completeness in runtime tests. Instead you've added 2 new module-scope functions, a module-private helper class, and ~100 lines of boilerplate to accomplish the same end.

I don't think this is a good trade, especially given that the code smell I keep running into in our codebase is calcified generic abstractions that serve no functional purpose other than to hide the fact that the custom type foundation of the system is inadequate. The InstanceSpecVisitor is a perfect example of this, with the SinkNodeVisitor (and the existence of the DataflowPlanSinkNodes or whatever they're called themselves) representing another one.

You are correct that isinstance calls tend to spread. They must, if the custom type hierarchy we're relying on isn't working well. The thing is, these visitor implementations that make the codebase fundamentally harder to reason about and more awkward to work with also tend to spread in these situations. I don't find the latter more palatable than the former - rather the opposite, in the specific case of MetricFlow, because we routinely tie ourselves into knots to satisfy the constraints imposed by these generic abstractions and they provide absolutely nothing of value in return.

I feel the correct way to avoid spread is through continuous improvement of type specifications, and rigor and encapsulation around the isinstance checks themselves. Relying on inappropriate object oriented design patterns that hide the mess behind generics and piles of boilerplate represents a huge step backwards in terms of readability in the codebase and communication around what we intend vs what exists today and requires change.

tlento · 2024-04-29T06:51:40Z

metricflow-semantics/metricflow_semantics/collection_helpers/dedupe.py

+def ordered_dedupe(*iterables: Iterable[IterableT]) -> Tuple[IterableT, ...]:
+    """De-duplicates the items in the iterables while preserving the order."""
+    ordered_results: Dict[IterableT, None] = {}
+    for iterable in iterables:
+        for item in iterable:
+            ordered_results[item] = None
+
+    return tuple(ordered_results.keys())


This is a less robust custom implementation of the (admittedly poorly named) unique_everseen function in more-itertools.

Since dbt-semantic-interfaces depends on more-itertools we should add the dependency with the same version range and use it here as well.

Actually I just noticed this takes *iterables, so it's a less robust implementation of unique_everseen(itertools.chain(iterables)). The point still stands, we should use the libraries, especially where iterator collection helpers are concerned.

Sure, we can make that change, but let me check about the "minimize dependencies" request and check how that method works with types.

I already looked at the dependencies, that's not an issue. The typing is another matter, more-itertools is an old library and these methods date back a long way.

cla-bot bot added the cla:yes label Apr 24, 2024

plypaul added the Skip Changelog label Apr 24, 2024

plypaul force-pushed the plypaul--12--saved-query-resolver branch 2 times, most recently from 11787e3 to ebfe930 Compare April 24, 2024 22:24

plypaul force-pushed the plypaul--11--repackage branch from 0cea41f to e626208 Compare April 24, 2024 22:43

plypaul force-pushed the plypaul--12--saved-query-resolver branch 2 times, most recently from 1a36bf8 to 3018cdf Compare April 24, 2024 22:51

plypaul force-pushed the plypaul--11--repackage branch from e626208 to 4afbd09 Compare April 24, 2024 22:51

plypaul force-pushed the plypaul--12--saved-query-resolver branch from 3018cdf to e14c2b2 Compare April 25, 2024 17:09

plypaul force-pushed the plypaul--11--repackage branch 3 times, most recently from d62f62c to 47d7057 Compare April 26, 2024 09:59

Base automatically changed from plypaul--11--repackage to main April 26, 2024 10:07

plypaul force-pushed the plypaul--12--saved-query-resolver branch 3 times, most recently from 257fda7 to d9749ad Compare April 26, 2024 10:29

Move LinkableElementProperty.

9d73ca5

This was moved to address circular-dependency issues.

plypaul force-pushed the plypaul--12--saved-query-resolver branch from 31937ea to 46fe4ac Compare April 26, 2024 21:26

plypaul added 10 commits April 26, 2024 14:29

Move query-related specs to separate file.

0fa7db7

This was moved to address circular-dependency issues. Note: `GroupByMetricSpec` had to be moved back in a later commit.

Remove MetricFlowQuerySpec dependency from GroupByMetricSpec.

c270b8b

Change MetricFlowQueryParser.parse_and_validate_saved_query() to re…

4367174

…turn a result object.

Update LinkableElementSet to include semantic model and spec methods.

0673c22

`LinkableElementSet` will be used in group-by resolution to keep track of the semantic models that a query would read.

Use LinkableElementSet instead of specs in group-by-item resolution…

9cb1650

… classes. Using `LinkableElementSet` allows retrieval of the semantic models that are needed for computation.

Associated import and method changes in the rest of the codebase.

5ed211e

Add interface for SavedQueryDependencyResolver.

847a4d2

Update snapshots for where filter changes.

d31029e

Update snapshots to include spec -> element set updates.

76a11f9

plypaul force-pushed the plypaul--12--saved-query-resolver branch from 46fe4ac to 76a11f9 Compare April 26, 2024 21:30

Add change log for #1155.

4a19e0f

plypaul marked this pull request as ready for review April 26, 2024 21:45

plypaul changed the title ~~Add an Interface to Resolve Saved-Query Dependencies~~ Add a Resolver for Saved-Query Dependencies Apr 26, 2024

plypaul mentioned this pull request Apr 26, 2024

Interface for SavedQueryDependencyResolver dbt-labs/dbt-semantic-interfaces#278

Closed

4 tasks

tlento approved these changes Apr 26, 2024

View reviewed changes

plypaul added 2 commits April 26, 2024 16:41

Move _path_key_to_spec to ElementPathKey.

bcbf1bf

Address comments.

07761a3

tlento reviewed Apr 29, 2024

View reviewed changes

plypaul merged commit f54eeed into main Apr 29, 2024
15 checks passed

plypaul deleted the plypaul--12--saved-query-resolver branch April 29, 2024 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Resolver for Saved-Query Dependencies #1152

Add a Resolver for Saved-Query Dependencies #1152

plypaul commented Apr 24, 2024 •

edited

Loading

github-actions bot commented Apr 24, 2024

tlento left a comment •

edited

Loading

tlento Apr 26, 2024

tlento Apr 26, 2024

plypaul Apr 27, 2024

tlento Apr 26, 2024

tlento Apr 26, 2024

tlento Apr 26, 2024

plypaul Apr 27, 2024

tlento Apr 28, 2024

plypaul Apr 29, 2024

tlento Apr 26, 2024

plypaul Apr 27, 2024

tlento Apr 26, 2024

plypaul Apr 27, 2024

tlento Apr 26, 2024

plypaul Apr 27, 2024

tlento Apr 26, 2024

plypaul Apr 27, 2024

tlento Apr 26, 2024

plypaul Apr 27, 2024

plypaul commented Apr 27, 2024

tlento left a comment

tlento Apr 28, 2024

tlento Apr 29, 2024

tlento Apr 29, 2024

plypaul Apr 29, 2024

tlento Apr 29, 2024

Add a Resolver for Saved-Query Dependencies #1152

Add a Resolver for Saved-Query Dependencies #1152

Conversation

plypaul commented Apr 24, 2024 • edited Loading

Description

github-actions bot commented Apr 24, 2024

tlento left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plypaul commented Apr 27, 2024

tlento left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plypaul commented Apr 24, 2024 •

edited

Loading

tlento left a comment •

edited

Loading