Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto full-refresh when schema changes for incremental models #320

Open
drewbanin opened this issue Mar 4, 2017 · 13 comments

Comments

@drewbanin
Copy link
Contributor

commented Mar 4, 2017

No description provided.

@adamhaney

This comment has been minimized.

Copy link

commented Aug 9, 2017

I'm interested in working on this, conceptually I assumed that after compilation you the code would need to issue a query to the target database to get the list of existing columns and then compare that to the sqlparse compiled AST of the model, and then conditionally do something like

if len(set('db_table_columns') - set('model_columns')) > 0:
    delete_existing_table()
run()

it looks like there's a before_run in dbt.node_runners.ModelRunner would that be the correct place to "hook into" for this problem?

If this was run directly in the ModelRunner would that take care of both the normal dbt run as well as a single model dbt run --models=$foo case? Any other edge cases I should be aware of?

@drewbanin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 9, 2017

@adamhaney cool! I think parsing the model SQL probably isn't the way to go here -- I can imagine things like select * being problematic.

We're in the process of making dbt less aware of the different materializations options. As such, the code for the incremental materialization is entirely defined here. So ideally, any incremental-specific code will also be defined inside of that incremental materialization definition.

If needed, we can totally make changes to dbt to provide useful context variables in the incremental materialization to support functionality like this. I've been thinking about MD5ing the model contents and comparing against the previous value stored in the graph.gpickle file. If the MD5 changed, then a full-refresh is in order. All of that logic can be encoded in jinja-land as long as the requisite context variables are available during compilation.

I'm not necessarily certain that's the best option here, but it's where my head is at. Do you buy all of that?

@DavidAmesPup

This comment has been minimized.

Copy link

commented Mar 14, 2018

I like the idea of being able to automatically do a full refresh when a model changes. We may also want to do a full refresh of all down-stream models too. Eg, Parent selects columns a, b & c. Changed to select a,b,c &d. Child does a select * from parent.

Having said that, I'd probably avoid doing select * in my models, so might not be required.

The proposed MD5ing of the model content and exposing the fact that it is different via jinja sounds like a great approach.

@drewbanin

This comment has been minimized.

Copy link
Contributor Author

commented Mar 14, 2018

thanks for your input @DavidAmesPup! I don't think this issue is on our near-term roadmap, but I think it would be super valuable nonetheless. Happy to point you (or anyone reading this comment) in the right direction if you're interested in contributing a PR!

@DavidAmesPup

This comment has been minimized.

Copy link

commented Mar 14, 2018

Contributing a pr sounds like a great way for us to get some experience on the dbt codebase during our POC phase.

Sure, please point me in the right direction. We are not a Python shop but I'm sure we will manage :)

@DavidAmesPup

This comment has been minimized.

Copy link

commented Apr 19, 2018

Hey @drewbanin - we are going to pick this item up in our next sprint or two - do you have any pointers or design requests?

@drewbanin

This comment has been minimized.

Copy link
Contributor Author

commented Apr 20, 2018

hey @DavidAmesPup, awesome, glad to hear it!

To be honest, I've given this some thought and I'm not sure of the perfect answer. I'd love to hear a little more about what you're thinking too, but here are my updated thoughts:

I think the md5 option presented above is approximately correct, but I think there's a bit of an issue. The graph.gpickle file probably isn't the best way to preserve information across dbt runs. If you run dbt from Sinter, or in Docker, or in any env where the filesystem isn't persisted between runs, there will be no information about the state of the previous run to go off of.

Instead, we've been talking a lot about adding a persistence layer to dbt in the form of a database. That could either look like a SQLite db, or the data warehouse itself. This database would contain information about when models last ran, who ran them, a hash of the model SQL, etc. I think we'd be able to use this database to understand whether or not a full refresh is in order for incremental models.

Obviously adding a persistence layer is quite an undertaking! We're definitely going to work on this in the near-to-medium term, but is probably more involved than a simple graph.gpickle based implementation of this feature. You should definitely give this a shot, but I wanted to give you fair warning: I'm unsure if I'll be able to merge it. Regardless, I'd love to take a look at whatever you come up with!

Happy to discuss here, and keen to hear what you think

@DavidAmesPup

This comment has been minimized.

Copy link

commented May 3, 2018

hey @drewbanin Agreed that graph.gpickle isn't the best place - we are currently looking at using Docker for our execution environment which wouldn't work at all. We will also execute the same model set against multiple tenants so the persistence of file checksums needs to be on a per-target-dw basis.

Rather than trying to hack around inside of DBT to get this to work - we are thinking writing a simple external util that can look at a folder of models and tell which ones have changed. It will persist a hash of previous contents to it's own persistence store).

In our orchestration logic, we can run this util, if it detects any model changes, we will execute a dbt run --full-refresh with a list of models that changed and then do a normal dbt run to get the latest data for any incremental models not rebuilt.

The biggest challenge that I can see is for us to know which models were built successfully and which failed. - Obviously, if we did a full refresh on a model and the full refresh failed, then we would want to run the full re-fresh on that model the next time around.

I'd rather implement this without parsing the console output from dbt - I can see a couple of options

  • The ability for us to pass in some sort of correlation_id (invocation_id?) into dbt run, and have that exposed via a post-hook. We could then write the correlation_id into the audit logs and our process can use that to identify the last models successfully run.
  • Invoke dbt as a python module, and have the ability to query the API so get a list of failed models.

We would like some guidance as to which you thought would be the most useful option for dbt - we can then raise an issue, submit a PR etc. We are also happy to hear other options that you may have.

Cheers,
dave

@drewbanin

This comment has been minimized.

Copy link
Contributor Author

commented May 3, 2018

I think that sounds like a really great solution @DavidAmesPup! You're in luck on both counts:

  1. dbt provides an invocation_id in the model context. It's a UUID that's generated for each invocation of dbt. Check out the docs here
  2. we're working hard on dbt's api, but it's not in a place where it's ready for public consumption yet. In the meantime, you can hack it using this "unofficial" api. It's pretty heavy handed, but I think it might work for your purpose:
>>> import dbt
>>> import dbt.main
>>> res, success = dbt.main.handle_and_check(['run'])
>>> res[0]
<dbt.node_runners.RunModelResult object at 0x11237b358>
>>> res[0].node
{'name': 'some_model', ........}'
>>> res[0].node['config']['materialized']
'incremental'
>>> res[0].errored or res[0].failed or res[0].skipped
False

We don't support this API, and it may change in a future release, but this is the best way to get pass/fail info about a run at the moment. Hope this helps!

@sphinks

This comment has been minimized.

Copy link

commented Aug 2, 2019

Wanna to refresh conversation: I'm not sure that auto full-refresh could be a good idea. It should be directed by some flag, so use give an explicit agreement to rebuild the whole incremental model (which can be quite a long process). As an idea to give an idea to user, that it should be refreshed I suggest to introduce warning indicating the schema changed comparing to existing schema in DB. What do you say about such kind of first step?

@drewbanin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 2, 2019

@sphinks I think that's a great idea! This approach would work well with --warn-error I think.

@sphinks

This comment has been minimized.

Copy link

commented Aug 2, 2019

@drewbanin could you guide me a little bit where should I start to look at to integrate such kind of warning? No so good with python, but want to help you.

@drewbanin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 3, 2019

I think this might be pretty challenging to implement today! We have some code already in the incremental materialization which expands column types in the destination table if required. This could happen if dbt wants to insert a varchar(32) column into a varchar(16) column. The linked method will resize the destination column to account for the new data.

I think if I were going to implement this, I'd make a new method, migrate_destination_table, which wraps the call to expand_target_column_types. This method could:

  1. collect the schemas for both the source and target relations
  2. warn if the two schemas are incompatible (the column names or types are different)
  3. call expand_target_column_types to preserve the existing behvior

If we want to get more clever with our approach, like add or delete columns, then we could certainly do that in this method too. This won't extend well to the approach of full-refreshing automatically, which i think is ok. The more I think about this issue, the less inclined I am to make dbt automatically full-refresh when the relevant --full-refresh flag is not provided.

Somewhat relevant: dbt's Snapshots already add columns to the destination table when columns in the source query are changed. Long-term, I'd like to combine the logic of expand_target_column_types and create_columns into a method like migrate_destination_table, then use that across incremental models and snapshots.

I don't want to discourage you from trying this out, but I do want to say: there are a lot of gnarly edge cases in this part of the codebase. The column types dbt knows about vary by database (BigQuery has STRINGs and int64s, Snowflake has varchar() and text types, for instance). If you do end up giving this a shot, please don't hesitate to let me know how I can help! Feel free to also use the #development channel in dbt Slack for more general dbt development questions if you have them :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.