Skip to content

[Feature] dbt should know about metrics #4071

@drewbanin

Description

@drewbanin

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

dbt should know about metrics. A metric is a timeseries aggregation over a table that supports zero or more dimensions. Some examples of metrics include:

  • active users
  • churn rate
  • mrr (monthly recurring revenue)

dbt should support metric definitions as a new node type. Like exposures, metrics participate in the dbt DAG and can be expressed in yaml files. By defining metrics in dbt projects, analytics engineers can encode crucial business logic in tested, version controlled code. Further, these metrics definitions can be exposed to downstream tooling to drive consistency and precision in metric reporting.

The ecosystem

There is some prior art for defining metrics in dbt projects. In particular, see

While these two implementations differ (measures vs. metrics), more on that below, there exists a need in the community for a first-class way to define these metrics in dbt code. It is really neat to see that some folks have already made these definitions possible with dbt, but it would better if metrics were treated as well-defined nodes with field validation and helpful utilities inside of dbt Core.

Specification

A metric is a timeseries aggregation over a table that supports zero or more dimensions. These metrics can be encoded in schema.yml files. In the example below, a new_customers metric is defined as a count of customer records created in a given time grain.

# models/marts/product/schema.yml

version: 2

models:
 - name: dim_customers
   ...

metrics:
  - name: new_customers
    label: New Customers
    model: dim_customers
    description: "The number of paid customers who are using the product"

    type: count
    sql: user_id # superflous here, but shown as an example

    timestamp: signup_date
    time_grains: [day, week, month]

    dimensions:
      - plan
      - country
    
    filters:
      - field: is_paying
        value: true

    meta: {}

Given this information, a downstream process (or a dbt macro!) can generate a sql SELECT statement that correctly calculates this metric with a specified time grain and set of dimensions. Here is a breakdown of supported fields:

Field Description Example Required?
name A unique identifier for the metric new_customers yes
model The dbt model that powers this metric dim_customers yes
label A short for name / label for the metric New Customers no
description Long form, human-readable description for the metric The number of customers who.... no
type The type of calculation to perform when evaluating a metric count_distinct yes
sql The expression to aggregate/calculate over user_id yes
timestamp The time-based component of the metric signup_date yes
time_grains One or more "grains" at which the metric can be evaluated [day, week, month] yes
dimensions A list of dimensions to group or filter the metric by [plan, country] no
filters A list of filters to apply before calculating the metric See below no
meta Arbitrary key/value store {team: Finance} no

Model reference

A reference to a dbt model. This model may be any "materialized" model, or a reference to an ephemeral model. Direct table references are not allowed, and alternate node types (seeds, snapshots) are not supported.

Metric types

The following metric types should be supported:

  • count
  • count_distinct
  • sum
  • average
  • min
  • max

In the future, alternative metric types (ratios, deltas, etc) should be supported in this model.

Filters

Filters should be defined as a list of dictionaries that define predicates for the metric. Filters are ANDed together. If more complex filtering is required, users can (and should) push that logic down into the underlying model.

filters:
  - field: is_paying
    value: true

Functional requirements

  • Metrics should participate in the dbt DAG as a distinct node type
  • Metric nodes should be accessible in the dbt Core compilation context via:
    • the graph.metrics variable
    • one or more accessor functions like metrics.find_by_name('...') (exact mechanism TBD)
  • Metric nodes should be emitted into the manifest.json artifact
  • Metrics should work with partial parsing
  • Metric nodes should be supported in node selection and should be selectable with the metric: selector
    • When listing nodes, existing graph operators (+, &, etc) should be supported
  • (in a different issue) Metrics should be surfaced in the dbt Docs website

dbt Core should not, itself, evaluate or calculate metrics. Instead, dbt Core should expose the definition of metrics to downstream tools or packages for evaluation and analysis. In this way, it is critical that dbt Core provides hooks into metrics that can be leveraged in both macro code, or by processes that consume dbt Core manifest.json files.

Describe alternatives you've considered

Don't implement metrics as distinct node types and keep encoding them in meta properties:

  • This information is untyped and semantically unrepresented in dbt, so it would be a net-improvement to instead create a first-class node type in dbt Core for these logical DAG nodes

Metrics vs. Measures

Metrics are strongly-typed objects. It is extremely common to see folks perform syntactically correct but semantically meaningless calculations over data. This looks like averaging an average, or adding two distinct counts together. You get a number back... but it's not a useful or meaningful result.

To that end, I think we should start with metrics instead of measures. The difference here (and maybe a strawperson of my own creation - tell me if you think so) is that measures are untyped aggregations, whereas metrics are rigorously defined summaries over well-defined datasets. The creation of metrics does not preclude us from teaching dbt about more generic types of aggregations in the future, but I'd prefer to start with a narrow set of functionality and expand over time. It is easy to remove constraints, but it is hard to add them 🙂

Include support for joins

  • Joins make metric calculations really complicated. dbt should absolutely know about foreign key relationships (outside of the existing relationships test) in the future, but this would be a meaningful expansion of scope for our first cut of this feature
  • While these joins would be semantically useful, they are not a blocker to defining metrics today. Join logic can be pushed down into model code (whether materialized or ephemeral). We should experiment with this single table paradigm, see how it feels, and then consider the best approaching for teaching dbt about semantic joins in the future.

Where metrics are defined

Should metrics be a property of a model? While that could be functional today, I think this would make it hard to extend metrics to work with joins (see above). Instead, declaring metrics as independent nodes that participate in the DAG is a more future-proof idea, and we'd probably do well to avoid the "patching" flow required to get schema tests (properties of models today) translated into their own independent nodes in the DAG.

Inheriting configuration from models

Should metrics be namespaced under a model? This would make it possible to define some "shared" properties for all of the metrics derived from a model (eg. valid dimensions, the time field, supported time grain). This would be good for ergonomics, but not a huge value-add IMO. I'd like to keep this simple for the initial implementation and then make decisions like this with some more information from the community around example use-cases.

Example:

metrics:
  - model: dim_customers
    # dimensions are shared for all metrics defined in terms of this model
    dimensions:
      - country
      - plan

    definitions:
      - name: new_customers
      - name: churned_customers

SQL calculations in dimensions

Should dimensions be allowed to provide arbitrary SQL expressions? I don't think so — that SQL is best encoded in model code, and it would be confusing and dissonant to break up dimension definitions across SQL and yaml files.

Example:

metrics:
  - name: dim_customers

    # This logic should be represented in the underlying model
    dimensions:
      - field: plan_type
        sql: case when plan in ('pro', 'extra pro') then 'paid' else 'free' end
        

Who will this benefit?

Analytics engineers
- As with models, AEs will be able to define metric logic under version control. By colocating model and metric code, new metrics or changes to existing metrics can be made in a tested, versioned, documented, code reviewed environment
- Further, dbt Core's built in lineage can surface information about how changes to an upstream model may impact a downstream metric

BI/Analytics tooling (and therein, data consumers)
- Organizations use metrics to understand performance and make decisions. To that end, the correctness and precision of these metrics is really paramount! By defining metrics rigorously under version control, and then exposing their definitions globally, dbt Core can help ensure consistency in reporting.

The data ecosystem
- There are so many tools, both existing and yet to be created, that can benefit from an open source mechanism for defining a semantic model on top of the data warehouse. I believe that this business logic is just too valuable and strategically important for end-users to be locked up in proprietary tooling. To that end, this feature, and future types of semantic logic like this, should be addressable in an open source way

Are you interested in contributing this feature?

I sure am :)

Anything else?

From around the internet:

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions