[CT-3121] [Implementation] get schema of inputs prior to executing unit test command #8649

graciegoheen · 2023-09-14T19:03:05Z

Housekeeping

I am a maintainer of dbt-core

Short description

Instead of requiring folks to run dbt run --select +my_model --exclude my_model before executing unit tests on my_model or requiring the use of --deferral (which would cause issues in a CI context when you're deferring to a production environment and may have changes to your inputs as part of the PR)...

we should be able to run dbt unit-test --select my_model in a new environment without having previously run dbt run --select +my_model --exclude my_model or dbt docs generate

Acceptance criteria

I can run dbt unit-test --select my_model in a new environment without having previously run dbt run --select +my_model --exclude my_model or dbt docs generate

Impact to Other Teams

None

Will backports be required?

No

Context

This is how the fixture sql is currently being generated, using the adapter.get_columns_in_relation to dynamically query the existing input table (or view) during input rendering: https://github.com/dbt-labs/dbt-core/blob/unit_testing_feature_branch/core/dbt/include/global_project/macros/unit_test_sql/get_fixture_sql.sql#L6-L9

The text was updated successfully, but these errors were encountered:

graciegoheen · 2023-09-18T20:08:00Z

Idea 1

If the datatype is explicitly provided, let's use that:

data contracts
allow data type definition (similar to seeds) for fixtures & inputs (in yml, only relevant columns)

If there's no indication, we "guess" (similar to what we do for seeds - agate library; or whatever the data warehouse would guess for the inputed data).

But!! If you don't provide us anything (columns that you don't care about), we can't guess.

What if we said:

use contracts if you don't want to have to supply data for all of your columns
other wise supply mock data for all of your columns (or use NULL strings)

Future aspirations:

column level lineage propagated from sources (not currently possible)????

Idea 2

Can we use SQL parsing to get the types? With the schemas for sources from source freshness?

Idea 3

Cast everything as string by default, raise a warning
If the test fails, we show in the diffs the CAST-ing
Only way to get accurate data types is via explicit contracts for direct parents

Idea 4

What if we could somehow "clean" the SQL to only select inputted columns?

Idea 5

Can we run the catalog queries for the direct parents of the model selected for the unit test command and make those available in the unit testing jinja context for rendering inputs and expected outputs SQL? We could make use of the applied state work to filter catalog queries to only run the catalog queries for the relevant models #8648

But catalog queries will fail if you haven't previously built the relevant models...

graciegoheen · 2023-09-19T19:49:47Z

Idea 6

cc: @jtcohen6

dbt ci command that in DAG order:

runs unit tests
builds models
runs data tests

Provided we:

include deferral
only selects the state:modified+ models
have retry for CI
with an optional --sample flag to limit your data
could exclude a piece of this ("don't run my data tests")

This is just dbt build + unit tests, is it better to just do dbt build --unit-tests

However - thing that you wouldn't be able to do:
modify a bunch of models in dev (specifically - adding/removing/retyping columns)
run a single unit test in the middle of several of them
without materializing any of them (not even sampled)

graciegoheen · 2023-09-19T19:55:17Z

Idea 7

Can we

use deferral for data types of most of the models
and build a temporary --sample 0 of all +modified models to get the data types of the relevant changed models (and then drop them when we're done)

graciegoheen · 2023-09-28T22:04:44Z

Our requirement -> "We need to know the data types of my current logical state (for direct parents of our unit tested models)"
This is because we want:

to be able to run unit testing in an empty warehouse (as the first step of a CI job, or in an empty development schema to save warehouse $$)
to only have to provide mock input data for the columns we care about (rather than all of the columns of our inputs)

Possible solutions:

you give us mock data for all of your columns, and we let the warehouse "guess" their data types
we use deferral + create temp tables for all of your +modified models to get the full picture of data types for your relevant models (maybe we "cache" to save time)
we only support running unit tests in the context of running your dag (so we require that upstream models have been materialized before we run your unit test)
we use column level lineage, SQL parsing from sources to get the data types of your inputs
deferring to a catalog with your data types, instead of a manifest -> and require that you define the data types for all of your columns in yml (this would be a significant amount of yml that you would have to add to every single model in your project that's a direct parent of a model you want to add a unit test to)

graciegoheen · 2023-10-02T19:19:54Z

What's the simplest thing we could do here:

we will use the data_types defined in the direct parent's yml config
if no data_type is provided...
- we will have the warehouse "presume" the data type based on the mock data provided (users could put cast statements in their fixture)
- if no mock data is provided for a column, we will guess that the data type is a string
- raise a warning during unit test execution if an input fixture specifies columns that don’t have user-provided data_type values.

We will need to confirm when we should add '' around the mocked data.

Let's see how far this gets us, and if this is "good enough" or if we need to take one of the more complicated approaches listed above.

graciegoheen · 2023-10-03T15:33:53Z

A few follow up questions:

Would we still have to know what all of the columns are? So would we be requiring that they also list all of the column names in the direct parent's yml config? That seems cumbersome.
- response: we should be able to get this information from adapter.get_column_schema_from_query without requiring users to manually input
Does Snowflake require a type-d NULL?
What if someone puts the wrong data_type of column name, but the contract isn't enforced? The unit test could pass, but the build could fail...
Should we offer multiple approaches for those that don't want us to build "temp" tables in their warehouse?
Let's test running a large dbt project but replacing ref with (select * from ref limit 0) and see how long it takes

graciegoheen · 2023-10-30T19:40:39Z

From a discussion with @dbeatty10 and @jtcohen6:

What if we supply folks 3 main options:

Option 1: supply mock data for all of your referenced columns
Option 2: supply mock data for only your relevant columns, build necessary upstream tables using "sample" mode / dry run to limit resluts
Option 3: dbt build dag order, run unit test of model before model is materialized

graciegoheen · 2023-11-01T22:01:52Z

Closing in favor of:

[CT-3309] --empty flag to to provide a cost-effective way to run a selection of empty models for dbt build and run #8980
[CT-3308] adjust dbt build to run in dag order -> unit tests, model materialization, data tests #8979

which covers option 2 and option 3. We can revisit option 1 if desired.

graciegoheen added the user docs [docs.getdbt.com] Needs better documentation label Sep 14, 2023

github-actions bot changed the title ~~[unit testing] get schema of inputs prior to executing unit test command~~ [CT-3121] [unit testing] get schema of inputs prior to executing unit test command Sep 14, 2023

graciegoheen added the enhancement New feature or request label Sep 14, 2023

graciegoheen mentioned this issue Sep 14, 2023

[CT-2911] [Epic] Unit testing dbt models #8283

Closed

graciegoheen changed the title ~~[CT-3121] [unit testing] get schema of inputs prior to executing unit test command~~ [CT-3121] [Implementation] get schema of inputs prior to executing unit test command Sep 14, 2023

graciegoheen mentioned this issue Sep 25, 2023

[CT-2917] Add ability to use csv file fixtures for inputs/expected in addition to inline yaml dict/csv #8290

Closed

graciegoheen mentioned this issue Oct 2, 2023

[CT-3176] [spike] infer fixture data types for unit testing #8757

Closed

1 task

graciegoheen closed this as not planned Won't fix, can't repro, duplicate, stale Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-3121] [Implementation] get schema of inputs prior to executing unit test command #8649

[CT-3121] [Implementation] get schema of inputs prior to executing unit test command #8649

graciegoheen commented Sep 14, 2023 •

edited

Loading

graciegoheen commented Sep 18, 2023 •

edited

Loading

graciegoheen commented Sep 19, 2023 •

edited

Loading

graciegoheen commented Sep 19, 2023

graciegoheen commented Sep 28, 2023 •

edited

Loading

graciegoheen commented Oct 2, 2023

graciegoheen commented Oct 3, 2023 •

edited

Loading

graciegoheen commented Oct 30, 2023

graciegoheen commented Nov 1, 2023 •

edited

Loading

[CT-3121] [Implementation] get schema of inputs prior to executing unit test command #8649

[CT-3121] [Implementation] get schema of inputs prior to executing unit test command #8649

Comments

graciegoheen commented Sep 14, 2023 • edited Loading

Housekeeping

Short description

Acceptance criteria

Impact to Other Teams

Will backports be required?

Context

graciegoheen commented Sep 18, 2023 • edited Loading

Idea 1

Idea 2

Idea 3

Idea 4

Idea 5

graciegoheen commented Sep 19, 2023 • edited Loading

Idea 6

graciegoheen commented Sep 19, 2023

Idea 7

graciegoheen commented Sep 28, 2023 • edited Loading

graciegoheen commented Oct 2, 2023

graciegoheen commented Oct 3, 2023 • edited Loading

graciegoheen commented Oct 30, 2023

graciegoheen commented Nov 1, 2023 • edited Loading

graciegoheen commented Sep 14, 2023 •

edited

Loading

graciegoheen commented Sep 18, 2023 •

edited

Loading

graciegoheen commented Sep 19, 2023 •

edited

Loading

graciegoheen commented Sep 28, 2023 •

edited

Loading

graciegoheen commented Oct 3, 2023 •

edited

Loading

graciegoheen commented Nov 1, 2023 •

edited

Loading