-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Is there an existing feature request for this?
- I have searched the existing issues
Describe the Feature
dbt should know about metrics. A metric is a timeseries aggregation over a table that supports zero or more dimensions. Some examples of metrics include:
- active users
- churn rate
- mrr (monthly recurring revenue)
dbt should support metric definitions as a new node type. Like exposures, metrics participate in the dbt DAG and can be expressed in yaml files. By defining metrics in dbt projects, analytics engineers can encode crucial business logic in tested, version controlled code. Further, these metrics definitions can be exposed to downstream tooling to drive consistency and precision in metric reporting.
The ecosystem
There is some prior art for defining metrics in dbt projects. In particular, see
- The MetricQL implementation (using meta fields)
- The Lightdash implementation (using meta fields)
While these two implementations differ (measures vs. metrics), more on that below, there exists a need in the community for a first-class way to define these metrics in dbt code. It is really neat to see that some folks have already made these definitions possible with dbt, but it would better if metrics were treated as well-defined nodes with field validation and helpful utilities inside of dbt Core.
Specification
A metric is a timeseries aggregation over a table that supports zero or more dimensions. These metrics can be encoded in schema.yml files. In the example below, a new_customers metric is defined as a count of customer records created in a given time grain.
# models/marts/product/schema.yml
version: 2
models:
- name: dim_customers
...
metrics:
- name: new_customers
label: New Customers
model: dim_customers
description: "The number of paid customers who are using the product"
type: count
sql: user_id # superflous here, but shown as an example
timestamp: signup_date
time_grains: [day, week, month]
dimensions:
- plan
- country
filters:
- field: is_paying
value: true
meta: {}Given this information, a downstream process (or a dbt macro!) can generate a sql SELECT statement that correctly calculates this metric with a specified time grain and set of dimensions. Here is a breakdown of supported fields:
| Field | Description | Example | Required? |
|---|---|---|---|
| name | A unique identifier for the metric | new_customers | yes |
| model | The dbt model that powers this metric | dim_customers | yes |
| label | A short for name / label for the metric | New Customers | no |
| description | Long form, human-readable description for the metric | The number of customers who.... | no |
| type | The type of calculation to perform when evaluating a metric | count_distinct | yes |
| sql | The expression to aggregate/calculate over | user_id | yes |
| timestamp | The time-based component of the metric | signup_date | yes |
| time_grains | One or more "grains" at which the metric can be evaluated | [day, week, month] | yes |
| dimensions | A list of dimensions to group or filter the metric by | [plan, country] | no |
| filters | A list of filters to apply before calculating the metric | See below | no |
| meta | Arbitrary key/value store | {team: Finance} | no |
Model reference
A reference to a dbt model. This model may be any "materialized" model, or a reference to an ephemeral model. Direct table references are not allowed, and alternate node types (seeds, snapshots) are not supported.
Metric types
The following metric types should be supported:
- count
- count_distinct
- sum
- average
- min
- max
In the future, alternative metric types (ratios, deltas, etc) should be supported in this model.
Filters
Filters should be defined as a list of dictionaries that define predicates for the metric. Filters are ANDed together. If more complex filtering is required, users can (and should) push that logic down into the underlying model.
filters:
- field: is_paying
value: trueFunctional requirements
- Metrics should participate in the dbt DAG as a distinct node type
- Metric nodes should be accessible in the dbt Core compilation context via:
- the
graph.metricsvariable - one or more accessor functions like
metrics.find_by_name('...')(exact mechanism TBD)
- the
- Metric nodes should be emitted into the
manifest.jsonartifact - Metrics should work with partial parsing
- Metric nodes should be supported in node selection and should be selectable with the
metric:selector- When listing nodes, existing graph operators (
+,&, etc) should be supported
- When listing nodes, existing graph operators (
- (in a different issue) Metrics should be surfaced in the dbt Docs website
dbt Core should not, itself, evaluate or calculate metrics. Instead, dbt Core should expose the definition of metrics to downstream tools or packages for evaluation and analysis. In this way, it is critical that dbt Core provides hooks into metrics that can be leveraged in both macro code, or by processes that consume dbt Core manifest.json files.
Describe alternatives you've considered
Don't implement metrics as distinct node types and keep encoding them in meta properties:
- This information is untyped and semantically unrepresented in dbt, so it would be a net-improvement to instead create a first-class node type in dbt Core for these logical DAG nodes
Metrics vs. Measures
Metrics are strongly-typed objects. It is extremely common to see folks perform syntactically correct but semantically meaningless calculations over data. This looks like averaging an average, or adding two distinct counts together. You get a number back... but it's not a useful or meaningful result.
To that end, I think we should start with metrics instead of measures. The difference here (and maybe a strawperson of my own creation - tell me if you think so) is that measures are untyped aggregations, whereas metrics are rigorously defined summaries over well-defined datasets. The creation of metrics does not preclude us from teaching dbt about more generic types of aggregations in the future, but I'd prefer to start with a narrow set of functionality and expand over time. It is easy to remove constraints, but it is hard to add them 🙂
Include support for joins
- Joins make metric calculations really complicated. dbt should absolutely know about foreign key relationships (outside of the existing
relationshipstest) in the future, but this would be a meaningful expansion of scope for our first cut of this feature - While these joins would be semantically useful, they are not a blocker to defining metrics today. Join logic can be pushed down into model code (whether materialized or ephemeral). We should experiment with this single table paradigm, see how it feels, and then consider the best approaching for teaching dbt about semantic joins in the future.
Where metrics are defined
Should metrics be a property of a model? While that could be functional today, I think this would make it hard to extend metrics to work with joins (see above). Instead, declaring metrics as independent nodes that participate in the DAG is a more future-proof idea, and we'd probably do well to avoid the "patching" flow required to get schema tests (properties of models today) translated into their own independent nodes in the DAG.
Inheriting configuration from models
Should metrics be namespaced under a model? This would make it possible to define some "shared" properties for all of the metrics derived from a model (eg. valid dimensions, the time field, supported time grain). This would be good for ergonomics, but not a huge value-add IMO. I'd like to keep this simple for the initial implementation and then make decisions like this with some more information from the community around example use-cases.
Example:
metrics:
- model: dim_customers
# dimensions are shared for all metrics defined in terms of this model
dimensions:
- country
- plan
definitions:
- name: new_customers
- name: churned_customersSQL calculations in dimensions
Should dimensions be allowed to provide arbitrary SQL expressions? I don't think so — that SQL is best encoded in model code, and it would be confusing and dissonant to break up dimension definitions across SQL and yaml files.
Example:
metrics:
- name: dim_customers
# This logic should be represented in the underlying model
dimensions:
- field: plan_type
sql: case when plan in ('pro', 'extra pro') then 'paid' else 'free' end
Who will this benefit?
Analytics engineers
- As with models, AEs will be able to define metric logic under version control. By colocating model and metric code, new metrics or changes to existing metrics can be made in a tested, versioned, documented, code reviewed environment
- Further, dbt Core's built in lineage can surface information about how changes to an upstream model may impact a downstream metric
BI/Analytics tooling (and therein, data consumers)
- Organizations use metrics to understand performance and make decisions. To that end, the correctness and precision of these metrics is really paramount! By defining metrics rigorously under version control, and then exposing their definitions globally, dbt Core can help ensure consistency in reporting.
The data ecosystem
- There are so many tools, both existing and yet to be created, that can benefit from an open source mechanism for defining a semantic model on top of the data warehouse. I believe that this business logic is just too valuable and strategically important for end-users to be locked up in proprietary tooling. To that end, this feature, and future types of semantic logic like this, should be addressable in an open source way
Are you interested in contributing this feature?
I sure am :)
Anything else?
From around the internet: