Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2070] [Feature] Materialized tests #6914

Closed
Assignees
Labels
enhancement New feature or request Team:Adapters Issues designated for the adapter area of the code

Comments

@Fleid
Copy link
Contributor

Fleid commented Feb 9, 2023

This one is about making sure tests stay relevant as we expand into low latency scenarios with Materialized Views (MVs).

Prior art can be found in dbt-materialize. More details on the benefits here.

Let's say you materialized a model with a MV, this is agg_daily_active_users:

{{
    config(
        materialized='materialized_view',
        auto_refresh=true
    )
}}

select
    date_trunc('day', event_at) as date_day,
    count(distinct user_id) as daily_active_users

from raw_app_data.events

I'm defining a test to ensure data quality:

select
    date_day,
    daily_active_users,
    current_date() as test_date
from {{ ref('agg_daily_active_users' )}}
where daily_active_users < 0

Currently, when dbt run is triggered (Wednesday the 8th, 2023 at 4pm Pacific Time), dbt will execute the underlying SQL query and returns the result in the log.

date_day daily_active_users test_date
2023-02-07 -2 2023-02-08

Ok I know about this one, there was some returns that were processed in a bizarre way, it should be solved in the coming day so I'm happy to ignore it.

That's excellent when you know the data is not moving when dbt is not looking. But what happens when it is, like it does with MVs? Then I want my tests to be run continuously. What I want is to make that test queryable so I can get the current status of my MV

I could run dbt test --select agg_daily_active_users instead. But then I'm forced to switch context between where my data is flowing (my database engine, via the materialized view) and where my test results appear. The best experience for me is to stay in the database for that.

I could run dbt test --select agg_daily_active_users --store-failures instead. But that may be slow and expensive, while making the operation a 2 step process (refresh the table, query the table).

What would be nice is to tell dbt that I want my test query materialized as a view (or a MV) instead, via a configuration:

{{ config(
    materialized="view"
) }}

select
    date_day,
    daily_active_users,
    current_date() as test_date
from {{ ref('agg_daily_active_users' )}}
where daily_active_users < 0

Which allows me to issue to debug my data with the following query:

select * from dbt_test__audit.agg_daily_active_users_negative_count

Which when I query it tomorrow will give me the following alarming situation, without having to use dbt.

date_day daily_active_users test_date
2023-02-25 -547 2023-02-09
2023-02-07 -2 2023-02-09
2023-02-06 -1500 2023-02-09

Should we deprecate store-failures?

The run level flag? Certainly not. The intent is different. With --store-failures I'm deciding at run time to persist my test results, independently of my test materializations.

The test level config? I think setting materialized="table" is equivalent to store_failures = true. I don't want to start the deprecation process of store_failures = true yet though. Let's punt that for later.

In terms of precedence of configurations:

  • the materialized configuration is top level, it takes precedence over store_failures and the flag (they are ignored)
    • if the test materialized configuration is omitted, the store_failures config will take precedence over the presence or absence of the --store-failures flag (it is ignored)
      • If the materialized configuration is omitted, or the store_failures config is none or is omitted, the resource will use the value of the --store-failures flag (current behavior)

Available materializations

We should support view and table for now.

Out of scope at the moment, we will support materialized_view at some point. Because warehouses may be able to surface the history of MVs (change log), which would allow me to get a list of all the transient errors that happened when I was not looking at the test MV.

Scope of the configuration

This configuration is a "tests config". It should be available at every level where a test configuration is offered and respect normal inheritance rules.

Describe alternatives you've considered

Just use a model instead of a test. Because that's basically what we're discussing here.

But in my opinion tests are tests not because their output is a collection of rows, a csv, a table or a view. They are tests because of their specific intent, their life cycle, and the fact they don't belong on the DAG.
I still want to be able to dbt test, even when all my tests are materialized as views.

@Fleid Fleid added enhancement New feature or request triage labels Feb 9, 2023
@Fleid Fleid added this to the v1.5 milestone Feb 9, 2023
@github-actions github-actions bot changed the title [Feature] Materialized tests [CT-2070] [Feature] Materialized tests Feb 9, 2023
@Fleid Fleid self-assigned this Feb 9, 2023
@Fleid Fleid added Team:Adapters Issues designated for the adapter area of the code and removed triage labels Feb 9, 2023
@morsapaes
Copy link

But in my opinion tests are tests not because their output is a collection of rows, a csv, a table or a view. They are tests because of their specific intent, their life cycle, and the fact they don't belong on the DAG.

Agree that it makes sense to keep tests as a separate concern, even if you can materialize them. In general, this sounds like a great add-on to the existing functionality of dbt test! Taking my maintainer hat off for a minute, I wonder if it could get really complicated to persist tests as materialized views in other databases, since the refreshing strategy might need to inherit from that of the referenced objects (though it's possible I'm just missing your point here 🫠).

In dbt-materialize, a materialized view is the preferred materialization for the store_failures configuration because it'll minimize the effort required to keep the results stored and up-to-date. This would be (roughly) equivalent to using a view+index, though that would require allowing a full config block for tests, not just a materialization config.

@dbeatty10
Copy link
Contributor

Let's suppose we replace "materialized" (how it appears above) with "store_failures_as":

In terms of precedence of configurations:

  • the store_failures_as configuration is top level, it takes precedence over store_failures and the flag (they are ignored)
    • if the test store_failures_as configuration is omitted, the store_failures config will take precedence over the presence or absence of the --store-failures flag (it is ignored)
      • If the store_failures_as configuration is omitted, or the store_failures config is none or is omitted, the resource will use the value of the --store-failures flag (current behavior)

As Florian mentioned, this feature introduces a case where both store_failures (boolean) and store_failures_as (string enum) can be specified at the same time. So we need a way to decide what to do when those values are in conflict.

Here's some pseudo code to implement the precedence outlined above, even when the values conflict with each other:

// Default value in all cases
// Initialize store_failures_as with a default value of None
store_failures_as = None

// Override the default value if storing failures
// Check if should_store_failures() is True
if should_store_failures():
    store_failures_as = 'table'

// If explicitly configured to do so, override previous default values
// Default value if storing failures
// If store_failures_as is specified, it overrides any previous value
if config_is_specified('store_failures_as'):
    store_failures_as = config_specified_value_here

@dbeatty10
Copy link
Contributor

@colin-rogers-dbt raised a relevant edge case here.

The specific concern was described as:

  • Users will be unable to turn off storing failures in the model config if they specify store_failures_as at a higher level

This is essentially the following question:

  • If there's a project-wide specification of store_failures_as as view, is there any way to turn it off for a particular model?

Read on for a proposed UX for how to turn it off at a granular level.

UX

We would need some valid value that means "don't actually create any database object -- we want something only fleeting, momentary, evanescent". The value that makes most intuitive sense to me is ephemeral. Using ephemeral would align with the dictionary definition of "ephemeral" and describe a "query that runs without leaving a trace" or "a temporary object that goes away." We could theoretically choose some other value instead. See "ephemeral vs. an alternative" below for further discussion.

Assuming ephemeral, it would look like this:

Singular test configured in SQL:

# models/_models.yml

models:
  - name: my_model
    columns:
      - name: id
        tests:

          # built-in generic test
          - not_null:
              config:
                store_failures_as: ephemeral

Generic test configured in YAML:

-- tests/singular/check_duplicate_column.sql

{{ config(store_failures_as="ephemeral") }}

-- custom singular test
select 1 as id
where 1=0

ephemeral vs. an alternative

We'd want to consider two possible usages of the word "ephemeral" within dbt:

  • introspective query that hits the DWH and returns a result
  • reusable & ref'able chunk of SQL (inserted as a CTE when referenced)

Both singular and generic tests are an example of the former. The latter is described in these docs. So we'd want to update those docs to be describe and delineate each usage.

Using ephemeral would align with the dictionary definition of "ephemeral" and describe a "query that runs without leaving a trace" or "a temporary object that goes away."

@dbeatty10
Copy link
Contributor

Schrödinger's deprecation principle

  • Will we deprecate or will we not? Let the duality of Schrödinger guide our design, readying for both eventualities

Of course this a specific case of a more general design principle:

  • Embrace all potentialities in design, until the possible futures collapse into one

Applying the principle

Suppose we don't add a value that means "don't actually create any database object".

Then the only way to allow users to configure this is:

Basically you set store_failures = False and store_failures_as = None

In this case, turning off storing failures would come with at least three downsides:

  1. harder to deprecate store_failures (if we choose to do that in the future)
  2. more complicated for the user (by requiring two configs to be set)
  3. more complicated in the documentation (to explain this edge case)

It would be harder to deprecate because store_failures would continue to be load-bearing (rather than superfluous). To successfully deprecate, we'd just be back to something like the proposed UX here which avoids all three downsides.

Click to toggle config examples

Examples

Singular test configured in SQL:

# models/_models.yml

models:
  - name: my_model
    columns:
      - name: id
        tests:

          # built-in generic test
          - not_null:
              config:
                store_failures: false
                store_failures_as:

Generic test configured in YAML:

-- tests/singular/check_duplicate_column.sql

{{ config(
    store_failures=false,
    store_failures_as=none),
}}

-- custom singular test
select 1 as id
where 1=0

@colin-rogers-dbt
Copy link
Contributor

@dbeatty10 / @Fleid I think ephemeral is a great path forward here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment