You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are two main high-level implementation approaches for unit testing in dbt:
inlining fixtures/expected ouputs as CTE in a single unit test query (to produce the 'actual' result) or,
persisting all input fixtures + expected given outputs, and querying the result of the model SQL run against the persisted input fixtures. Once the unit test finishes, cleanup any persisted fixtures from the warehouse.
I think both are technically feasible and would actually have pretty similar implementations under the hood: either using a materialization that leverages existing ephemeral logic for the 'CTE trickery' route, or actually materializing inputs and the 'actual' test model in the warehouse using the existing seed materialization.
Tradeoffs:
Actually materializing the input/actual datasets is a more accurate representation of how the models are run in production in comparison to the CTE-based approach, and would support a larger set of SQL/dbt functionality than CTEs. For example, syntax that is used sql_headers that may not be valid in a standalone query, or certain types in that can be inserted but not actually declared in a standalone query (dbt-labs/dbt-project-evaluator#290). Do any other limitations come to mind? The tradeoff being performance: actually materializing fixtures/expected/actual in the warehouse, querying them to obtain a diff, and deleting them reliably at the end of the test run all add up to additional latency.
Next steps
So far we've started with the CTE approach, mostly for sake of simplicity, but I do believe it'd be very worthwhile to spike the seed-based approach and quantify more precisely how much slower/complex that approach would be. @gshank also suggested exploring implementing both strategies and either selecting the strategy based on user configuration or the presence of certain conditions (e.g. a sql_header, or particular type on the model being tested). I think a non-CTE stategy would also be necessary to test complex or custom materializations end-to-end (#8275 (reply in thread)) .
Let's implement the seed-based strategy in a spike to understand:
its technical complexity
whether/how it deviates from the CTE-based approach
performance implications in relation to the CTE-based approach (with an remote adapter rather than dbt-postgres)
Ultimately let's use those learnings to recommend whether we should implement unit tests with:
just the CTE-based approach (as it is currently)
just the seed-based approach
a combination of both - with a top-level strategy pattern that toggles between the two under appropriate conditions or user configuration.
The text was updated successfully, but these errors were encountered:
github-actionsbot
changed the title
[Spike] Explore support multiple unit test materialization strategies: CTE vs 'seed'-based
[CT-3033] [Spike] Explore support multiple unit test materialization strategies: CTE vs 'seed'-based
Aug 25, 2023
From the discussion thread: #8275 (reply in thread)
There are two main high-level implementation approaches for unit testing in dbt:
given
outputs, and querying the result of the model SQL run against the persisted input fixtures. Once the unit test finishes, cleanup any persisted fixtures from the warehouse.I think both are technically feasible and would actually have pretty similar implementations under the hood: either using a materialization that leverages existing ephemeral logic for the 'CTE trickery' route, or actually materializing inputs and the 'actual' test model in the warehouse using the existing seed materialization.
Tradeoffs:
Actually materializing the input/actual datasets is a more accurate representation of how the models are run in production in comparison to the CTE-based approach, and would support a larger set of SQL/dbt functionality than CTEs. For example, syntax that is used sql_headers that may not be valid in a standalone query, or certain types in that can be inserted but not actually declared in a standalone query (dbt-labs/dbt-project-evaluator#290). Do any other limitations come to mind? The tradeoff being performance: actually materializing fixtures/expected/actual in the warehouse, querying them to obtain a diff, and deleting them reliably at the end of the test run all add up to additional latency.
Next steps
So far we've started with the CTE approach, mostly for sake of simplicity, but I do believe it'd be very worthwhile to spike the seed-based approach and quantify more precisely how much slower/complex that approach would be. @gshank also suggested exploring implementing both strategies and either selecting the strategy based on user configuration or the presence of certain conditions (e.g. a sql_header, or particular type on the model being tested). I think a non-CTE stategy would also be necessary to test complex or custom materializations end-to-end (#8275 (reply in thread)) .
Let's implement the seed-based strategy in a spike to understand:
Ultimately let's use those learnings to recommend whether we should implement unit tests with:
The text was updated successfully, but these errors were encountered: