Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Cache relations in DBT #911
Relation Caching plan
This is a general roadmap for how we can implement a relation caching mechanism for dbt adapters. Instead of
Relations will be cached on a per-schema basis.
When we create the cache, we are actually keeping track of the state of all models that have a structural dependency upon the schema.
As we perform operations, we can see what nodes will be dropped by
Really great writeup, thanks @beckjake. Some thoughts:
This was sort of a first-pass stream of consciousness response. To be sure, there's a lot of details to get right here, but I want to make sure we're thinking about this correctly at the macro-level. Let me know if you disagree with any of this: super happy to discuss!
Agree. This may inform the structure of the
I agree, but it's important that when caching "throws its hands up," it does so in a way that doesn't cause radical behavior. We have, as a rule, stayed away from SQL parsing to date, but I don't know that we'll be able to continue to do so. If we need to parse SQL to do this right, let's do it.
@beckjake I think it's important to write down the pathological cases for our materializations and caching scenarios. Materializations should, as much as possible, avoid leaving the warehouse in an incomplete state. In RDBMSes, this is easy to reason about because everything is transactional -- if the cache says a table exists, but you get an error when trying to ALTER it, you can usually roll back the entire transaction. This isn't the case for the other cloud data warehouses.
If a user of dbt were to
@beckjake to your third point:
I don't think there's a general way to find this info out cross-platform. Or, if there is, it would be no better than just running the actual introspective query to see what still exists!
I guess the only real "risk" is the risk of a query failing in the middle of a dbt run, since
I wrote this like it was a computer science paper. It's probably needlessly verbose, but I wanted to define a language that we can use to talk about caching. This is info about how we can use 1) the dbt DAG and 2) the properties of modern warehouses to cache relations intelligently:
There are three classes of database operations that are cache-mutative:
All “create” operations are “local”; that is: they only mutate the cache for the entity that is being created. Both the “drop” and “swap” classes operations are “nonlocal” in nature. For these nonlocal operations, other entities in the cache may need to be mutated as a result of the nonlocal operations. The exact cache mutation logic is dependent on 1) the type of warehouse in use and 2) the type of relations present in the database.
Dropping a relation
Database relations can be “structurally dependent” on other relations. This structural dependency is a function of one relation selecting from another relation. The term “structurally dependent” is used to denote a tighter dependency than the typical ”logical dependency” inherent in dbt.
Tables can select from other relations. When this happens, the table and the relation(s) that it selects from are logically dependent. In this scenario, the selecting table maintains its own schema and data, so there is no structural dependency present. Instead, the relationship between a table and it’s parent is a logical dependency.
Contrastingly, views can also select from other relations. A view does not maintain its own schema or data — rather, the schema and data that comprise a view are a function of 1) the view query definition and 2) the relations that it selects from. Because the definition of a view is a function of the relations that it selects from, it is said to be “structurally” dependent on its parents. More formally: structurally dependent relationships exist where the definition for one relation is inherently tied to the existence of a separate relation. Crucially, this structural dependency is recursive in nature.
Consider the following example:
Here, Table A is (by definition) not structurally dependent on any relations.
Both BigQuery and Snowflake have implemented “views” in a late-binding fashion. Views on these databases are logically dependent, but not structurally dependent.
In Redshift and Postgres, views are structurally dependent. Interestingly, Redshift supports Late Binding Views which are not structurally dependent.
When a relation is
Dropping a schema
For cache purposes, dropping a schema is equivalent to dropping all of the relations in the schema, and then dropping the schema itself. When a schema is dropped, all of the relations that structurally depend on relations inside of that schema should also be dropped.
After chatting with Jake:
The dbt graph is an approximation of the the structural dependency graph inherent in the database. In situations where users use
Instead, dbt should built a separate graph of structural dependencies by querying the database at the beginning of the run.
It's worth noting that we only have to build the graph for postgres and redshift, and for redshift we can completely ignore late-binding views (since all we really care about is "will my