New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run tests after model creation, persisting updates only if the tests pass #1054

Open
drewbanin opened this Issue Oct 12, 2018 · 10 comments

Comments

Projects
None yet
3 participants
@drewbanin
Copy link
Contributor

drewbanin commented Oct 12, 2018

Feature

Feature description

There have been a bunch of suggestions int Slack to run tests during model creation, and only persist changes if the tests pass. Here is a sampling of such suggestions:

jdub:

It’d be interesting if dbt could run tests within a transformation transaction, and bail out of the transaction if the tests fail. [...] Simple things like the existing unique and null checks, but also more interesting tests like, “Did this transformation produce data within correct / sensible boundaries?”

Josh Temple:

Or create data in a staging environment, test it, and copy to production only if tests pass. That way if the tests don't pass, you have data you can reference and explore to understand why.

Leon:

I think the data lake paradigm for ETL is the ideal that we personally want to move to -> create everything in staging, run tests, update catalog conditionally on things passing

tmurphy:

The way we’re solving this is using zero-copy clones on Snowflake. Each merge request gets its own clone, and only once dbt runs successfully and passes do we merge the code in to be run on production.

Who will this benefit?

A paradigm like this would help ensure that errant transformations are caught before they are deployed to production.

Things to think about:

  • If the test fails for a model, should dbt revert the update and then SKIP all of the dependent models?
  • Should this rollback paradigm happen for every test? Or just certain designated tests?
  • How should this work for things like incremental models? They're going to be difficult to "roll back" without doing presumably expensive operations on BQ/Snowflake

There's lots more to discuss here too, so please drop your thoughts in the comments below!

@mikekaminsky

This comment has been minimized.

Copy link
Contributor

mikekaminsky commented Oct 13, 2018

I think this is a great idea. There are a few inter-related issues and features which I'll try to start picking apart. This would be a pretty major architecture change, so I think we will all benefit if we can get clear on the requirements, priorities, and streams of work.

Transactions

I feel strongly that DBT should be running in pseudo-transactions(henceforth "transactions") when it's running in production (i.e., there are downstream users querying the transformations). Here's why: in the middle of a DBT run, end users could receive invalid query results if two relations have become inconsistent.

Imagine you're calculating an order rate among users as the share of users that have an order. If this query gets execute after the users relation has been created but before the orders relation has finished, you're going to get misleading results.

Proposed architecture:

  • All transformations get written into staging schemas like $staging_my_schema_name
  • Once the DBT run completes, you do the following (maybe in a db-level transaction for seamlessness):
    • All "live" relations get renamed to "$old_"
    • All of the new relations get renamed into their production namespaces
    • All of the "$old_" relations get dropped

In this world, end users will never be querying data that are part old and part new.

Concerns:

  • People who are using post-hooks for managing permissions may end up wanting to tweak when these permissions are getting set.

DBT Run Modes

As I've alluding to above, you probably do not want this transaction-like behavior when you're developing. To support this, I'd recommend implementing DBT run "modes" that allow users to control how DBT behaves with respect to handling this transaction-like behavior.

To start, I'd propose the following modes (not mutually exclusive):

  • with-staging-schemas: this gets you the transaction-like behavior described above
  • fail-fast: stop dbt run as soon as it encounters an error (in with-staging-schemas mode, this means that the new transformations will not be promoted)
  • keep-going: inverse of fail-fast -- dbt builds as many models as it can despite any errors encountered (this is the current default behavior, I believe)

Note: I'm not at all committed to these names and I'm sure we can come up with better ones.

Tests

Once the above concepts are implemented, we can add some configuration so that tests can cause "errors" which may or may not halt a DBT run depending upon which run mode we're in.

One could imagine another configuration flag that indicates whether or not tests should be counted as "errors" for the purpose of a dbt run.

The last feature, which I expect people will want, would be the ability to configure tests with severity levels. So we could introduce "log", "warn", and "error" types (or whatever) for tests, then we could update the configuration options so that different test failure levels can be handled differently. In my mind, this feature comes last.

@mikekaminsky

This comment has been minimized.

Copy link
Contributor

mikekaminsky commented Oct 22, 2018

See also #1005 which is relevant.

@joshtemple

This comment has been minimized.

Copy link
Contributor

joshtemple commented Nov 14, 2018

Shooting blind here, but would it be possible to use tagging under the hood somehow to implement this? After running dbt test, dbt would tag models as failed or passed, and then the user would choose to run only models tagged as passed or failed.

This would also help in cases where you run successfully, test and find a few models failing tests, fix the issues with the models, and want to only re-run the models that ran successfully but didn't test successfully. In that case you could just run with tag failed.

@drewbanin

This comment has been minimized.

Copy link
Contributor Author

drewbanin commented Nov 14, 2018

@joshtemple I really like the workflow you're describing! I'm unsure if tags are going to be the best way to do it, but I think this is absolutely something that dbt should support

@drewbanin

This comment has been minimized.

Copy link
Contributor Author

drewbanin commented Nov 26, 2018

hey @michael Kaminsky -- this is a pretty significant departure from how dbt currently works. I just took another scan through here, and I think your description of an all-or-nothing build is most closely aligned with how I think this should work. I think there are still some unanswered questions here, and I think the first step in making headway on this issue is probably making sure that we have most/all of our bases covered. Some questions I have:

  1. how should incremental models work in this paradigm?
  2. what do we do on BigQuery (it does not have a rename statement)
  3. how do we interleave tests with models?
  4. at what “level” are the old/new relations swapped? Do we rename whole schemas/datasets? Or the individual database relations?
@mikekaminsky

This comment has been minimized.

Copy link
Contributor

mikekaminsky commented Nov 27, 2018

Great questions @drewbanin ! Going to table bigquery for now and come back to it (I've never used it, so am fuzzy on a bunch of important details that we'd need to solve it).

How should incremental models work in this paradigm?

Great question, and something I hadn't thought of yet. As I've conceptualized it, there are two issues:

  • You don't want to rebuild the whole incremental model in the staging location (defeats the purpose of an incremental model!)
  • Models downstream of the incremental relation need to query from the "updated" incremental model

Options:

  • Leave incremental models in place and update them normally

Easy, but sort of defeats the purpose of this whole exercise as it doesn't ensure consistency during your builds

  • view + UNION:
    • During the build, create a "temp" table that holds all of the records that need to be inserted or updated into the incremental model.
    • In the staging schema, in a view, UNION the incremental model with the temp table (applying the appropriate WHERE clause to the incremental relation)
    • At the end of the run, apply the update / insert logic to update the incremental model appropriately

There will be some performance penalty for models that are downstream of the incremental model, but it's unclear how much (depends on how big the change is and how the data are distributed).

  • Deep copy
    • In redshift (at least) you can use CREATE TABLE LIKE to deep copy a table.
    • Then we can apply the update / insert in the staging location as you'd expect

This will be slow (unsure how slow??) for large tables, though notably not nearly as slow as rebuilding the whole table -- we're basically just making redshift to do a lot of extra disk I/O for nothing. Might actually be an okay compromise, but would want to test it with someone who has bigger-data to get a sense of performance impact.

How do we interleave tests with models?

I don't think we need to implement this yet. I think we should table and come back to it when we want to implement the testing part of this (phase 2).

At what “level” are the old/new relations swapped? Do we rename whole schemas/datasets? Or the individual database relations?

The way I use dbt, I could / would do it at the schema level (dbt-managed schemas are wholly separate from non-dbt managed schemas). From what I've gathered in the slack channel, not everyone follows this rule so we'd probably have to do everything at the relation / model level. LMK if I'm missing something here, though! I'm thinking this would be trivial as long as we keep track of where everything is in the staging schema and where everything "belongs" but let me know if I'm just not thinking of something.

@drewbanin

This comment has been minimized.

Copy link
Contributor Author

drewbanin commented Nov 27, 2018

@mikekaminsky I really like your view approach to incremental models! That's a fantastic way to test changes without applying them. I think you're right about the performance penalty, but maybe that's tolerable given the other benefits of this approach. AFAIK, the deep copy approach will probably work on every database in existence (definitely BigQuery), so that could also be compelling,

I'm always super wary of modifying dbt's materialization code, as we've done seemingly innocuous but wholly incorrect things in the past. There's a whole list of things to know like:

  1. Redshift chokes if a table is dropped inside of a transaction at the same time as a user selects from that table
  2. Redshift chokes if two drop table... cascade statements run concurrently, trying to drop the same downstream view
  3. Redshift chokes if you look at it funny
  4. Snowflake's views cannot be renamed if the tables they depend on have changed in ways that make the view no longer valid
  5. BigQuery has draconian limits around ddl/dml operations like:
  • Maximum number of table operations per day — 1,000
  • Maximum rate of dataset metadata update operations — 1 operation every 2 seconds per dataset

There is more to this list, but I just wanted to sketch out the kind of thinking that goes into a feature like this. These constraints are individually workable, but together, make it challenging to design a feature that 1) operates as intended and 2) uses each database to its fullest potential.

In the past, we've built features that weren't implemented on certain databases due to limitations of the database itself. BigQuery, for instance, only got create table as syntax within the past year! The biggest challenge here is going to be defining the desired dbt workflow, then figuring out how to make it happen for each database. So, I do really like your idea of DBT Run Modes from above. It could just be that this more advanced stuff isn't available on BQ until that particular technology reaches DDL parity with the rest of dbt's adapters.

Last, I agree that we should tackle the model building part of this first, then add testing into the mix separately. That change has more to do with how dbt's graph selector includes nodes in a given run, and should "just work" once this "atomic swap" paradigm is implemented.

PS I think we'll need to make dbt drop these new relations somehow. A failed run should stick around long enough to debug the problem, and then should be easy to clean up with a dbt command IMO.

@mikekaminsky

This comment has been minimized.

Copy link
Contributor

mikekaminsky commented Nov 28, 2018

Lots of good thoughts, @drewbanin!

I'd be interested in moving this forward without bigquery support (especially because of the draconian ddl limits!!)

You definitely have a better perspective on where / how this can go wrong given the diversity of implementations you've seen. I've only really seen a handful (and all of them with redshift), so I just haven't seen these issues crop up. I'll defer to you if this is worth moving forward on!

Regarding having DBT clean up, it sort of depends on how you implement it. We could have DBT drop the old, failed relations, or we could also just let those get blown away when the next successful run happens (if we build everything in a dedicated staging schema like $dbt_staging it'd be pretty straightforward to just DROP CASCADE that schema when we want to clear out the old stuff.

Another comment incoming on late-binding views ...

@mikekaminsky

This comment has been minimized.

Copy link
Contributor

mikekaminsky commented Nov 28, 2018

When building views in atomic mode, we need to make sure they are not late-binding. Let's imagine we have table_a and view_b in our build where view_b depends on table_a.

In atomic mode, we run CREATE TABLE $dbt_staging.table_a AS ... and then CREATE VIEW $dbt_staging.view_b AS SELECT .... FROM $dbt_staging.table_a.

At the end of the run, we run RENAME $dbt_staging.table_a TO prod_schema.table_a; RENAME $dbt_staging.view_b TO prod_schema.view_b; If the view was late-binding, when we do the rename the view won't be pointed to the correct place!

So the view would need to be "regular" in order for this to work. Since we use late-binding views to get around the DROP CASCADE that happens during a non-atomic build, I think this will be fine unless there's some other use case for late-binding views I'm not thinking of!

@drewbanin

This comment has been minimized.

Copy link
Contributor Author

drewbanin commented Nov 28, 2018

@mikekaminsky I totally buy the premise but I'm unsure about that implementation. For one, relations on Redshift can't be renamed across schemas :)

Will keep this in mind, but I think that we'll likely be best served by late binding views in the general case. Redshift (and Postgres) are really the only databases out there that support bound views, so I'd be hesitant to leverage them as a core part of this strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment