-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamically reference dbt models #1212
Comments
Thanks for the request @tmastny. This is something we've run into a bunch, and I agree: writing out the model names to union manually is error prone and a chore. I'd need to have a think about how this could work. I don't want to make dbt parse all of the models in a project twice, but I'm not sure how else to accomplish something like this. I imagine we'd need to do one pass to find all of the models, then another pass to correctly process functions like Maybe there's some world in which the graph isn't fully finalized until just before "running". After parsing, dbt could conceivably create some more edges between nodes like So, let me ponder this one for a while. Keen to hear your thoughts about any of the above! |
I think I now understand the root of this issue: dbt has no way to dynamically generate dependencies. In other DAG tools such as make, Snakemake, and drake, dynamic dependencies are a part of the design. These dynamic dependencies are found by preprocessing the workflow specification (the makefile, or the models in dbt). Let's take look at Snakemake, which is a make-like DAG system for Python. SnakemakeSuppose we are in the following working directory.
The objective is to "build" the models in # Snakefile: the snakemake "makefile"
path = 'text/{name}.txt'
names = glob_wildcards(path).name
rule all:
input:
'union.txt'
rule union:
input:
expand('text/{name}_compiled.txt', name=names)
output:
'union.txt'
shell:
'cat {input} > {output}'
rule build:
input:
path
output:
temp('text/{name}_compiled.txt')
shell:
'cat {input} > {output}' This workflow specification alone doesn't determine all the dependencies, since the number of compiled files depends on the structure of For example, we can evaluate the wildcards without executing the DAG: # in bash
snakemake --dryrun # output
Building DAG of jobs...
Job counts:
count jobs
1 all
2 build
1 union
4
rule build:
input: text/world.txt
output: text/world_compiled.txt
jobid: 3
wildcards: name=world
rule build:
input: text/hello.txt
output: text/hello_compiled.txt
jobid: 2
wildcards: name=hello
rule union:
input: text/world_compiled.txt, text/hello_compiled.txt
output: union.txt
jobid: 1
localrule all:
input: union.txt
jobid: 0 dbtIn dbt, the wildcard-like functionality of jinja templating is overloaded. Not only is the jinja used to expand out variables like The think your second idea is spot on. There needs to be some initial evaluation of the jinja templating, and then a pass through to evaluate the |
Nice writeup, thanks @tmastny. One totally different way of pursuing something like this might be a sort of Rather than collecting all of the tables to union in Another benefit of this type of materialization is that if the logic in This approach presents other problems, but I have a feeling that they may be more tractable than dynamic graph generation at present. Do you think this approach adequately addresses your use case? |
I'll have to do more research to understand if this could address my use case. I'm relatively new to SQL and I'm not familiar with However, I view dbt's DAG as a strength and would love to see more features around DAGs supported! Thanks for all the hard work on dbt! |
cc @beckjake Ok, let's figure out how to make this happen! I have a bunch of ideas bouncing around in my head, and there are a couple of constraints that will guide this feature. Thoughts:
Proposal: nodes.ref.resources(*selectors, resource_types=None)OverviewThe nodes.ref.resources(*selectors, resource_types=None)
nodes.ref.models(*selectors)
nodes.ref.sources(*selectors)
nodes.ref.archives(*selectors)
nodes.ref.seeds(*selectors) These functions should all return Relation objects that match any of the specified selectors. The Example usage:
**NOTE: ** I think it's important that the
Node selectionThe
If it turns out that there are other selectors that are useful here (like selecting by name prefix, or regex, or similar), we should consider adding those as available selectors on the CLI too. ImplementationCalls to CaveatsEphemeral models are going to be annoying here. I personally dislike the idea of having to document this beautiful suite of functions, noting that they either return a It might be worth investigating if we can reconcile ephemeral models here. A Other relevant things to think aboutIt wouldn't be crazy to support other functions in the future, like Would love to hear everyone's thoughts on an approach like this! |
I use this sort of pattern a fair bit, however, I would not use this feature to dynamically get the list of models. I have done this for models that union 13 tables, and tbh writing out the names of 13 models in order to loop over a union is already very dry, and more so it is explicit. The main reason I think this use case is not great is that it amounts to a hidden rule based on the existence or not of models within a directory. What if someone added a new model there, or deleted one? I'd want it to break if that happened, not silently succeed with bad data. Given the original example:
I would just do this:
By itself that is not useful, so I'd also use If there are variations in the columns between source tables, e.g. some source tables need a fake NULL column or require a different column naming, then some jinja ternary statements can help. e.g. instead of selecting just If the rules start to get more complex then instead of setting a list of strings up front I would use a list of dictionaries e.g.
On the other hand I do see use cases for this. For example I would like to use this feature for tests. It would be nice to define a test template or pattern to be run for all models in a directory, and additionally be able to access config values for models, e.g. the |
I think that's a really fair point @davehowell. I am inclined to agree with you -- making the model filepath subtly significant like this can definitely lead to confusion. I'd probably recommend that a tag selector is used here to fetch all of the models with a given tag. That feels like the right combination of "obvious" and "abstracted" to me. I think we can decouple something like |
As requested from @clrcrl , adding a use case here from dbt slack channel: https://getdbt.slack.com/archives/C0VLZPLAE/p1576167419187600 Related to @drewbanin's comment above, |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
@drewbanin Looks like this idea went stale, but I was curious if you thought any further about having a sub-table be able to upsert into a table rather than forcing the union all pattern when you want to combine multiple tables into one? I am new enough to dbt that I couldn't identify in your proposal if it was supporting a dynamic referencing capability for the union all (and other patterns) or if it would support the sub-table upsert. It seemed like the former of those two options. I was really interested in the upsert of a sub-table into a table as a pattern for the reasons you listed in that comment (no reason to query all tables when only one table has changed) and had similar functionality in a previous data engineering framework that I have used. |
Feature
Feature description
The dbt-utils documentation notes that
get_tables_by_prefix
pairs well with theunion_tables
macro and I agree. However,get_tables_by_prefix
only works with tables in the database, not models within the DBT project.I propose a new feature that allows us to put model names into a list.
Example
Suppose we have the following model directory:
daily_activity.sql
would like toUNION ALL
ref('activity_comment')
andref('activity_login')
.Maybe the workflow looks something like this:
Who will this benefit?
Anyone who wants to benefit from DRY principles. The current workflow forces users manually write the names of files to be unioned, instead of depending on the natural structure of the DBT project.
@drewbanin was also interested in this feature in the DBT slack channel, prompting me to create this feature request.
The text was updated successfully, but these errors were encountered: