New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache relations in DBT #911

Closed
beckjake opened this Issue Aug 8, 2018 · 13 comments

Comments

Projects
None yet
3 participants
@beckjake
Copy link
Contributor

beckjake commented Aug 8, 2018

Relation Caching plan

This is a general roadmap for how we can implement a relation caching mechanism for dbt adapters. Instead of list_relations() going out to the database every time, we can maintain a local cache in between adapter invocations. As long as we're careful to invalidate and update the cache as appropriate, we can save a lot of time on metadata queries.

New concepts

Relations will be cached on a per-schema basis.

When we create the cache, we are actually keeping track of the state of all models that have a structural dependency upon the schema.

  • this is basically a second DAG, much like the manifest, but instead of caring about ref it cares about actual exists-in-the-database dependencies only.
  • we have a query for this for postgres and redshift
  • we can ignore this for snowflake, bigquery, and redshift's late-binding views

As we perform operations, we can see what nodes will be dropped by drop ... cascade (dependent views will be dropped) and what nodes will be changed by alter table ... rename (views will point to the new name) by finding all dependent nodes in the DAG

New types/modules

new type RelationsCache

Stores per-schema relations caches

methods
  • __contains__(self, schema):
    • Returns True if the schema is cached.
  • clear(self): Clears the schema for all schemas.
    • No return value
  • clear_schema(self, schema): Clears the schema if it's present, otherwise does nothing.
    • No return value.
  • rename_relation(self, from_relation, to_relation): Replace the old relation with a new one, if present. Otherwise insert it (or error?).
    • No return value.
  • clear_relation(self, relation): Invalidate a single relation by removing it from the schema cache, if present. Otherwise does nothing (or error?).
    • No return value.
  • list_relations(self, schema): See NamespaceCache.list_relations
  • get_relation(self, schema, search): See NamespaceCache.get_relation.
  • add_relation(self, relation): See NamespaceCache.add_relation.
    • Creates a new partially valid NamespaceCache if none exists.

new (internal) type NamespaceCache

Store a single schema's relations caches. I don't think this has to be exposed anywhere outside of the RelationsCache.

attributes
  • fully_valid: bool indicating that the cache is a complete representation of the relations in the schema
methods
  • __contains__(self, relation):
    • Returns True if the relation is in the cache
  • clear_relation(self, identifier): Invalidate a single relation by removing it from the schema cache. (TODO: ensure we don't need relationship_type!)
    • No return value.
  • add_relation(self, relation): Add a new relation to the schema cache.
    • No return value.
  • list_relations(self): Get the list of relations cached for this schema
    • Returns a list of Relations
  • get_relation(self, search): Search for the relation in the cache.
    • Returns a list of Relations that match (caller must handle duplicates).

Changes to existing code

Update DefaultAdapter and subclasses

  • We will now instantiate adapters with profile, project, and manifest in the factory
    • We'll want to make adapters singletons (classes are singletons and we use that fact)
    • If we make it a singleton, the cache can be a member of the adapter instance
      • Integration tests will have to clear this in setup/teardown, or un-set the singleton adapter.
      • We will need to create a DAG of the structural dependencies in postgres/redshift. This only impacts early-binding views, which is why we can ignore it for snowflake/bigquery and redshift's late-binding views.
  • They will now call into the RelationsCache to update it in drop_relation and rename_relation
    • add_relation is pretty trivial, it has no impact on the graph.
    • drop_relation will traverse the DAG we create above and clear all tables/views that would get hit by 'drop ... cascade'
      • note that late-binding views will not be cleared
      • this is something we'll want to unit test extensively
        - We can't just use the manifest because the manifest can include relationships that don't actually exist - for example, if a user uses ref inside a comment/conditional or if a relationship existed that the current dbt graph doesn't know about.
    • rename_relation should be simple after that - drop the old and add new.
    • check out @drewbanin's comment here for some great detail on this
  • They will look at the cache for get_relation/list_relations, catch any CacheInvalid and update the cache appropriately
  • Add an execute_model(cls, profile, schema, identifier, relation_type, sql, ...) method
    • add the new relation (it will always be an addition because we rename existing tables)
    • call execute()

Update macros

  • Macros that create things will also need to add those relations to the cache (via execute_model, I think?):
    • create_view_as
    • create_table_as
    • create_archive_table?
@drewbanin

This comment has been minimized.

Copy link
Contributor

drewbanin commented Aug 9, 2018

Really great writeup, thanks @beckjake. Some thoughts:

  1. I noticed that you included "truncate" in the list of cache-altering statements. Truncate will delete all of the rows in a table, but the table will still physically exist. Is there another reason to care about truncate statements here?

  2. We're almost certainly going to care about the types of relations. Most databases care about the difference between drop table public.abc and drop view public.abc, so we'll likely want to have the type of the relation handy in the cache

  3. I don't want to try to get too cute with caching, but I do think we should leverage the properties of the DAG where appropriate. When a relation is drop....cascaded, we can statically determine which relations in the cache will need to be invalidated. Briefly, we would need to drop dependent views (not late-binding) on Redshift and Postgres. I'm not sure if it's a good idea to encode this type of logic into the caching framework, but I think that invalidating whole schemas every time a drop... cascade is run will probably mitigate a lot of the benefit of caching in the first place!

  4. I think it would be unwise to do any sort of sql parsing at all. I think that hooks are totally appropriate for 1) granting access to relations 2) inserting records into an audit table and 3) running maintenance tasks on tables. I think that running any sort of statements that directly affect the dbt graph is an anti-pattern, and it should be ok for caching to throw its hands up in that scenario. I just don't imagine it being feasible at all for us to parse sql, understand the caching implications, and then do intelligent things in 100% of cases, especially given the different syntaxes across warehouses. If statements like this are definitely necessary, then i'd like to provide macros that return the desired SQL and also invoke the correct caching behavior.

This was sort of a first-pass stream of consciousness response. To be sure, there's a lot of details to get right here, but I want to make sure we're thinking about this correctly at the macro-level. Let me know if you disagree with any of this: super happy to discuss!

@beckjake

This comment has been minimized.

Copy link
Contributor Author

beckjake commented Aug 9, 2018

@drewbanin

  1. Nope, that was a mistake on my part.
  2. I agree. The relation field in those caches is always a Relation, which includes the relation_type field. clear_relation() does take just the identifier, but it's remove only and only operates on the cache, so I don't think it needs the type. All it has to do is iterate over the cache members, find the entry with a matching identifier, and remove it. Unless some databases allow views and tables to have the same name, is that the concern?
  3. Is it possible to ask the database what a cascade would delete? Or heck, even did delete? I do think this is possible though, I'll look into that some more.
  4. I'm fine with letting hooks do whatever.
@cmcarthur

This comment has been minimized.

Copy link
Member

cmcarthur commented Aug 9, 2018

we should leverage the properties of the DAG where appropriate.

Agree. This may inform the structure of the SchemaCache / RelationCache object -- it may end up looking more catalog-y.

I think that running any sort of statements that directly affect the dbt graph is an anti-pattern, and it should be ok for caching to throw its hands up in that scenario.

I agree, but it's important that when caching "throws its hands up," it does so in a way that doesn't cause radical behavior. We have, as a rule, stayed away from SQL parsing to date, but I don't know that we'll be able to continue to do so. If we need to parse SQL to do this right, let's do it.

@cmcarthur

This comment has been minimized.

Copy link
Member

cmcarthur commented Aug 9, 2018

@beckjake I think it's important to write down the pathological cases for our materializations and caching scenarios. Materializations should, as much as possible, avoid leaving the warehouse in an incomplete state. In RDBMSes, this is easy to reason about because everything is transactional -- if the cache says a table exists, but you get an error when trying to ALTER it, you can usually roll back the entire transaction. This isn't the case for the other cloud data warehouses.

If a user of dbt were to DROP SCHEMA ... CASCADE in the middle of their dbt run, I think that dbt would not do anything nasty. It'd just abort mid-run. I suspect that there are other types of queries that could cause real problems, but I don't know that for a fact. It could be that dbt would handle these situations well. Let's enumerate the error cases and only address the bad ones, and maybe we can forgo things like SQL parsing.

@drewbanin

This comment has been minimized.

Copy link
Contributor

drewbanin commented Aug 9, 2018

@beckjake to your third point:

Is it possible to ask the database what a cascade would delete? Or heck, even did delete? I do think this is possible though, I'll look into that some more.

I don't think there's a general way to find this info out cross-platform. Or, if there is, it would be no better than just running the actual introspective query to see what still exists!

@beckjake

This comment has been minimized.

Copy link
Contributor Author

beckjake commented Aug 9, 2018

I guess the only real "risk" is the risk of a query failing in the middle of a dbt run, since drop ... cascade is only going to delete stuff and then we'd be selecting from missing tables or renaming missing tables or something. I think that would be fine, even if you can't roll back, because you could just rebuild all your models, right? It might take a long time is all.

@drewbanin

This comment has been minimized.

Copy link
Contributor

drewbanin commented Aug 10, 2018

I wrote this like it was a computer science paper. It's probably needlessly verbose, but I wanted to define a language that we can use to talk about caching. This is info about how we can use 1) the dbt DAG and 2) the properties of modern warehouses to cache relations intelligently:

Caching Strategy

Overview

There are three classes of database operations that are cache-mutative:

  • Create a schema or relation
  • Drop a schema or relation
  • Swap a relation

All “create” operations are “local”; that is: they only mutate the cache for the entity that is being created. Both the “drop” and “swap” classes operations are “nonlocal” in nature. For these nonlocal operations, other entities in the cache may need to be mutated as a result of the nonlocal operations. The exact cache mutation logic is dependent on 1) the type of warehouse in use and 2) the type of relations present in the database.

Creative operations

In the create case, a simple addition to the cache is required. No further cache mutation is required.

Destructive operations

Dropping a relation

Database relations can be “structurally dependent” on other relations. This structural dependency is a function of one relation selecting from another relation. The term “structurally dependent” is used to denote a tighter dependency than the typical ”logical dependency” inherent in dbt.

Logical Dependency

Tables can select from other relations. When this happens, the table and the relation(s) that it selects from are logically dependent. In this scenario, the selecting table maintains its own schema and data, so there is no structural dependency present. Instead, the relationship between a table and it’s parent is a logical dependency.

Structural Dependency

Contrastingly, views can also select from other relations. A view does not maintain its own schema or data — rather, the schema and data that comprise a view are a function of 1) the view query definition and 2) the relations that it selects from. Because the definition of a view is a function of the relations that it selects from, it is said to be “structurally” dependent on its parents. More formally: structurally dependent relationships exist where the definition for one relation is inherently tied to the existence of a separate relation. Crucially, this structural dependency is recursive in nature.

Consider the following example:

+--------------+               +------------------+          +-------------------+
|              +--------------->                  |          |                   |
| Table A      |               |   View B         +---------->    View C         |
+--------------+               +------------------+          +-------------------+

Here, Table A is (by definition) not structurally dependent on any relations.
View B is structurally dependent on Table A.
View C is structurally dependent on Table B, and through recursive dependence, it is also structurally dependent on Table A.

Adapter-specific behavior

Both BigQuery and Snowflake have implemented “views” in a late-binding fashion. Views on these databases are logically dependent, but not structurally dependent.

In Redshift and Postgres, views are structurally dependent. Interestingly, Redshift supports Late Binding Views which are not structurally dependent.

When a relation is drop…cascaded, all of the structurally dependent descendants of the relation should also be marked as deleted in the cache.

Dropping a schema

For cache purposes, dropping a schema is equivalent to dropping all of the relations in the schema, and then dropping the schema itself. When a schema is dropped, all of the relations that structurally depend on relations inside of that schema should also be dropped.

@drewbanin

This comment has been minimized.

Copy link
Contributor

drewbanin commented Aug 10, 2018

I didn't actually touch on the renaming piece... need to think about that one a little bit more

@beckjake

This comment has been minimized.

Copy link
Contributor Author

beckjake commented Aug 10, 2018

Renaming is just a drop + a create from a cache perspective, right?

@drewbanin

This comment has been minimized.

Copy link
Contributor

drewbanin commented Aug 10, 2018

@beckjake that's right. I wanted to investigate if there's any sort of difference in how postgres/redshift handles the swap-and-drop if we do it inside of a transaction. I don't think it makes any difference, but I wanted to confirm.

@drewbanin

This comment has been minimized.

Copy link
Contributor

drewbanin commented Aug 10, 2018

After chatting with Jake:

The dbt graph is an approximation of the the structural dependency graph inherent in the database. In situations where users use {{ ref() }} outside of a select statement (eg. in a comment (eg. to hoist refs)), then the dbt graph and the database dependency graph might not be exactly equivalent.

Instead, dbt should built a separate graph of structural dependencies by querying the database at the beginning of the run.

@beckjake

This comment has been minimized.

Copy link
Contributor Author

beckjake commented Aug 10, 2018

It's worth noting that we only have to build the graph for postgres and redshift, and for redshift we can completely ignore late-binding views (since all we really care about is "will my drop ... cascade or alter table ... rename change anything?").

@beckjake beckjake self-assigned this Aug 23, 2018

beckjake added a commit that referenced this issue Oct 12, 2018

@beckjake

This comment has been minimized.

Copy link
Contributor Author

beckjake commented Oct 15, 2018

Implemented in #1025

@beckjake beckjake closed this Oct 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment