Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-3033] [Spike] Explore support multiple unit test materialization strategies: CTE vs 'seed'-based #8499

Open
Tracked by #8283
MichelleArk opened this issue Aug 25, 2023 · 1 comment
Labels
enhancement New feature or request
Milestone

Comments

@MichelleArk
Copy link
Contributor

MichelleArk commented Aug 25, 2023

From the discussion thread: #8275 (reply in thread)

There are two main high-level implementation approaches for unit testing in dbt:

  1. inlining fixtures/expected ouputs as CTE in a single unit test query (to produce the 'actual' result) or,
  2. persisting all input fixtures + expected given outputs, and querying the result of the model SQL run against the persisted input fixtures. Once the unit test finishes, cleanup any persisted fixtures from the warehouse.

I think both are technically feasible and would actually have pretty similar implementations under the hood: either using a materialization that leverages existing ephemeral logic for the 'CTE trickery' route, or actually materializing inputs and the 'actual' test model in the warehouse using the existing seed materialization.

Tradeoffs:
Actually materializing the input/actual datasets is a more accurate representation of how the models are run in production in comparison to the CTE-based approach, and would support a larger set of SQL/dbt functionality than CTEs. For example, syntax that is used sql_headers that may not be valid in a standalone query, or certain types in that can be inserted but not actually declared in a standalone query (dbt-labs/dbt-project-evaluator#290). Do any other limitations come to mind? The tradeoff being performance: actually materializing fixtures/expected/actual in the warehouse, querying them to obtain a diff, and deleting them reliably at the end of the test run all add up to additional latency.

Next steps
So far we've started with the CTE approach, mostly for sake of simplicity, but I do believe it'd be very worthwhile to spike the seed-based approach and quantify more precisely how much slower/complex that approach would be. @gshank also suggested exploring implementing both strategies and either selecting the strategy based on user configuration or the presence of certain conditions (e.g. a sql_header, or particular type on the model being tested). I think a non-CTE stategy would also be necessary to test complex or custom materializations end-to-end (#8275 (reply in thread)) .

Let's implement the seed-based strategy in a spike to understand:

  1. its technical complexity
  2. whether/how it deviates from the CTE-based approach
  3. performance implications in relation to the CTE-based approach (with an remote adapter rather than dbt-postgres)

Ultimately let's use those learnings to recommend whether we should implement unit tests with:

  1. just the CTE-based approach (as it is currently)
  2. just the seed-based approach
  3. a combination of both - with a top-level strategy pattern that toggles between the two under appropriate conditions or user configuration.
@github-actions github-actions bot changed the title [Spike] Explore support multiple unit test materialization strategies: CTE vs 'seed'-based [CT-3033] [Spike] Explore support multiple unit test materialization strategies: CTE vs 'seed'-based Aug 25, 2023
@graciegoheen graciegoheen assigned gshank and unassigned gshank Sep 12, 2023
@martynydbt martynydbt assigned gshank and unassigned gshank Sep 26, 2023
@martynydbt
Copy link

this may roll to the next sprint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants