Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SL-1235] [Feature] Ambiguous Group-By-Item Resolution #887

Closed
3 tasks done
plypaul opened this issue Nov 17, 2023 · 1 comment
Closed
3 tasks done

[SL-1235] [Feature] Ambiguous Group-By-Item Resolution #887

plypaul opened this issue Nov 17, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request High priority Created by Linear-GitHub Sync linear triage Tasks that need to be triaged

Comments

@plypaul
Copy link
Contributor

plypaul commented Nov 17, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing metricflow functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Background

Previously, group-by-items were input by the user in a relatively specific form. For example, the group-by-item:

guest__listing__created_at__month

refers to the created_at time dimension at a month grain that is resolved by joining the measure source to the dimension sources by the guest and listing entities.

We have since migrated from that interface to allow for additional naming formats for specifying group-by-items, and to allow for a more ambiguous specification.

Specification Updates

Additional Naming Format (Object Builder)

The object builder format uses notation similar to the creation of an object in Python (or similar language) using the builder pattern. For example:

Dimension('metric_time').grain('month')

specifies the metric_time dimension at the month grain.

Ambiguous Specification of Group-By-Items

  • If the grain of a time dimension in a query is not specified, then the grain of the requested time dimension is resolved to be the finest grain that is available for the queried metrics.
  • In a metric filter, if an ambiguous time dimension does not specify the grain, and all semantic models that are used to compute the metric define the time dimension with the same grain, MetricFlow should assume the specific time dimension is that grain.
  • When querying for a group-by-item, the entity links in the input no longer need to specify the full join path. Instead, if there is a shortest and unique entity-join path between the measure source and the dimension source, the entity links do not need to be fully specified - only the primary entity is required.

Challenges

  • Multiple ways of specifying a group-by-item needs to be consolidated before resolving the ambiguity to reduce complex branching and reduce the testing load.
    • In filters, the object builder format (e.g. {{ Dimension('listing__country') }} ) is used.
    • On the command line, the dunder format (e.g. listing__country) is used.
    • For saved queries, the object builder format is used.
  • Metrics defined from other metrics produce a recursive structure that makes it complex to resolve ambiguous group-by-items.
  • If the specified group-by-item is ambiguous and it cannot be resolved by MetricFlow according to the rules above, error messages with sufficient context that explain the problem should be generated for the user. The context should relate to the recursive nature of the metric definition. e.g.

The ambiguous group-by-item Dimension('user__country') cannot be resolved for metric3 because metric3 is derived from metric1 and metric2, and metric2 does not have a queryable group-by-item that matches the request.

  • It’s likely there will be future changes to the interface, and we would want to make the querying interface amenable to those changes.
  • A common mechanism should be used to solve these problems.

Proposed Approach

To support these changes, we need a query resolver that can figure out the mapping between an ambiguous group-by-item that the user has specified to a concrete dimension in a semantic model. The proposal for building the query resolver is to:

  • Model user inputs as patterns / filters that encapsulate the desired selection behavior.
  • Model the structure of the group-by-item resolution process as a DAG using the recursive definition of metrics.
  • Resolve ambiguous group-by-items as a push-down process where candidates are adjusted as they pass from measures, to metrics, and to the input query.

Model Group-By-Item Inputs as Patterns / Filters

The first part of the proposed solution is to introduce a set of pattern classes that capture the desired request from the user. With the pattern classes, we can map varied user input (which can be strings in different naming formats or interface objects) to a single type of input into the resolver.

## Pseudocode.
class Pattern:
  def match(self, candidates: Sequence[GroupByItem]) -> Sequence[GroupByItem]): 
    ...

## Map string user inputs to patterns.
“user__country” -> Pattern(
    type=ANY,
    element_name=’user’,
    entity_links=[‘country’], 
    grain=ANY,
)
“TimeDimension(“metric_time”)” -> Pattern(
    type=TIME,
    element_name=’metric_time',
    entity_links=[],
    grain=ANY,
)

## Map API objects to patterns.
TimeDimensionQueryParameter(element_name=”metric_time”, grain=DAY) -> Pattern(
    type=TIME,
    element_name=’metric_time’,
    entity_links=ANY,
    grain=ANY,
)

This layer of indirection provides a single type of input into the query resolver. This reduces the conditionals required and therefore, reduces cases that need to be written and tested. As suggested, these patterns would describe how to select a group-by-item from a list of available ones.

For example:

# Pseudocode.

# From a list of available group-by-items for querying a metric:

candidates = [
  TimeDimension(‘metric_time’, ‘day’),
  TimeDimension(‘metric_time’, ‘month’),
  TimeDimension(‘user__created_at’, ‘month),
  Dimension(‘listing__country’),
  Entity(‘listing__host’),
]

pattern =  Pattern(type=ANY, element_name=’metric_time’, entity_links=ANY, grain=ANY)

pattern.match(candidates) == [
  TimeDimension(‘metric_time’, ‘day’)
  TimeDimension(‘metric_time’, ‘month’)
]

Once the user input is mapped to pattern instances, resolution of ambiguous group-by-items can be handled via a DAG that models the resolution behavior.

Model Group-By-Item Resolution As A DAG

The resolution of available / valid group-by-items for a metric query can be modeled as a DAG. For a node in the DAG, the parent nodes represent the set of objects that provide the valid group-by-items that are then intersected to determine the available group-by-items for that node.

More specifically, the available group-by-items for a query is the intersection of the available group-by-items for the metrics in a query. Likewise, following the recursive definition, the available group-by-items for a metric are the intersection of the available group-by-items for the constituent metrics. For a base metric, the available group-by-items are the intersection of the group-by-items available for the constituent measures.

The DAG helps to guide writing the recursive code to handle resolution and aids development debugging by providing a visualization of the call stack.

As an example, consider the set of metric definitions below and the associated query:

---
metric:
  name: simple_metric_0
  type: simple
  type_params:
    measure: measure_0
---
metric:
  name: simple_metric_1
  type: simple
  type_params:
    measure: measure_1
---
metric:
  name: simple_metric_2
  type: simple
  type_params:
    measure: measure_2
---
metric:
  name: derived_metric_0
  type: derived
  type_params:
    expr: simple_metric_0 + simple_metric_1
    metrics:
      - name: simple_metric_0
      - name: simple_metric_1


# Query for metrics: [derived_metric, simple_metric_2]

The query above would result in the resolution DAG:

resolution_dag_complex

Resolve Ambiguous Group-By-Items as a Push-Down Process

As alluded to earlier, the group-by-items available for a given node in the resolution DAG is the intersection of the group-by-items available for each of the parent nodes. Resolving ambiguous group-by-items can be modeled as a push-down process where the candidate group-by-items are pushed down from root nodes to the leaf node, and the candidates are intersected along the way.

During the push-down process, if the intersection of the candidates from the parent nodes produces an empty set, an error can be generated that includes the path from the leaf node to help the user better diagnose the issue.

In the current proposal, the root nodes represent the measures used to compute metrics, and the leaf node is the query containing the metrics requested by the user.

Following this setup, the various conditions for ambiguous resolution can be realized by tweaking the initial set of candidate group-by-items in the root nodes (measures), and the selection behavior at the leaf node (the query).

Ambiguous Group-By-Item Specified in a Query:

  • Initial candidates @ measure nodes are all group-by-items available for a measure matched to the pattern for the ambiguous group-by-item.
  • Resolution behavior at query node / leaf node is
    • If there is only one candidate, select it as the resolution.
    • If there are multiple time dimensions that only differ by the grain, select the one with the finest grain as the resolution.
    • Otherwise if there are multiple candidates, return an error indicating the ambiguity cannot be resolved.

A representation of the process for an ambiguous group-by-item named metric_time for metrics ['simple_metric_0', 'simple_metric_1'] in a query is shown below.

resolution_dag_push_down

  1. The available group-by-items for measure_0 is matched to the pattern for metric_time. The candidates at this node are [TimeDimension(‘metric_time’, ‘day’), TimeDimension(‘metric_time’, ‘month’)].
  2. The available group-by-items for measure_1is matched to the pattern for metric_time. The candidates at this node are [TimeDimension(‘metric_time’, ‘month’), TimeDimension(‘metric_time’, ‘year’)].
  3. The candidate group-by-items for simple_metric_0 are the same as the parent candidates as there is only 1 parent and no intersection is required.
  4. Similar to 3.
  5. The query consists of ['simple_metric_0', 'simple_metric_1']. Intersecting the candidates from the parent nodes results in TimeDimension(‘metric_time’, ‘month’) - the resolution of the ambiguous group-by-item metric_time for this query.

Ambiguous Group-By-Item Specified in a Where-Filter:

  • Initial candidates @ the measure nodes are all group-by-items available for a measure, filtered to exclude non-base grains for time dimensions, and matched to the pattern for the ambiguous group-by-item.
  • Resolution behavior at the query node / leaf node is:
    • Select the only possible candidate.
    • If there are multiple candidates, return an error indicating the ambiguity cannot be resolved.

For a query, a where-filter can occur in a few places:

  • As a filter to entire query.
  • As a filter to an input measure as defined in a base metric.
  • As a filter to an input metric as defined in a derived metric.

The proposed approach is to collect and resolve all where-filters during the query parsing / query resolution phase, and then pass a lookup object to subsequent stages. The lookup object will ensure that the correct items will be rendered / retrieved.

Future Work

Since the resolution DAG represents how metrics are constructed from other items, there are other operations that can be be performed using the DAG and would be easier to implement due to the simpler nature of the resolution DAG as compared to the dataflow DAG.

Common Metric Computation Optimization

Since derived metrics are be computed from other metrics, it's possible that a given metric appears multiple times in a resolution DAG. It's desireable to compute a metric only once in a query for efficiency. The optimization to re-use common metric computation can be more easily implemented by representing the re-use in the resolution DAG as a common parent instead of implementing the optimization using the dataflow DAG as it is done now.

resolution_dag_reuse

Add ERD Nodes

To better represent the available group-by-items that can be retrieved for a measure, the resolution DAG can be updated to include nodes that model the entity-relationship diagram. This will aid implementation of entity roles and other related features.

Input Metric Alias Generation

There are some cases with derived metrics where using the same metric with different time offsets can produce an ambiguous column in the generated SQL if metric aliases are not provided by the user. The resolution DAG may allow for easier automatic generation of aliases in such cases.

Describe alternatives you've considered

A recursive implementation that does not create the resolution DAG. This ended up being hard to follow.

From SyncLinear.com | SL-1235

@plypaul plypaul added enhancement New feature or request triage Tasks that need to be triaged labels Nov 17, 2023
@plypaul plypaul self-assigned this Nov 17, 2023
@Jstein77 Jstein77 changed the title [WIP] [Feature] Ambiguous Group-By-Item Resolution [SL-1235] [WIP] [Feature] Ambiguous Group-By-Item Resolution Nov 21, 2023
@Jstein77 Jstein77 added High priority Created by Linear-GitHub Sync Metricflow Created by Linear-GitHub Sync Metricflow Gap Created by Linear-GitHub Sync labels Nov 21, 2023
@Jstein77
Copy link
Contributor

merged 2 PRs, a few more to cut this week.

plypaul added a commit that referenced this issue Nov 30, 2023
@plypaul plypaul changed the title [SL-1235] [WIP] [Feature] Ambiguous Group-By-Item Resolution [SL-1235] [Feature] Ambiguous Group-By-Item Resolution Nov 30, 2023
plypaul added a commit that referenced this issue Nov 30, 2023
courtneyholcomb pushed a commit that referenced this issue Nov 30, 2023
These pattern classes are used to model user inputs when group-by items are
specified through the query interface or specified in a filter. These patterns
allow for ambiguous user inputs of group-by items e.g. a time dimension with
an unknown grain. For that case, the ambiguous input makes it easier for the
user to author queries as figuring out the time grain requires inspection of
the configs. For more details, please see:

#887
plypaul added a commit that referenced this issue Dec 1, 2023
plypaul added a commit that referenced this issue Dec 1, 2023
plypaul added a commit that referenced this issue Dec 16, 2023
plypaul added a commit that referenced this issue Dec 16, 2023
plypaul added a commit that referenced this issue Dec 16, 2023
plypaul added a commit that referenced this issue Dec 16, 2023
plypaul added a commit that referenced this issue Dec 16, 2023
@Jstein77 Jstein77 removed Metricflow Created by Linear-GitHub Sync Metricflow Gap Created by Linear-GitHub Sync labels Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request High priority Created by Linear-GitHub Sync linear triage Tasks that need to be triaged
Projects
None yet
Development

No branches or pull requests

2 participants