Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically reference dbt models #1212

Closed
tmastny opened this issue Jan 4, 2019 · 10 comments
Closed

Dynamically reference dbt models #1212

tmastny opened this issue Jan 4, 2019 · 10 comments
Assignees
Labels
enhancement New feature or request stale Issues that have gone stale

Comments

@tmastny
Copy link
Contributor

tmastny commented Jan 4, 2019

Feature

Feature description

The dbt-utils documentation notes that get_tables_by_prefix pairs well with the union_tables macro and I agree. However, get_tables_by_prefix only works with tables in the database, not models within the DBT project.

I propose a new feature that allows us to put model names into a list.

Example

Suppose we have the following model directory:

.
└── activity
    ├── activities
    │   ├── activity_comment.sql
    │   └── activity_login.sql
    └── daily_activity.sql

daily_activity.sql would like to UNION ALL ref('activity_comment') and ref('activity_login').

Maybe the workflow looks something like this:

-- in daily_activity.sql
{%
  for table_name in get_model_names('activities')
%}

SELECT
  ...
FROM ref(table_name)

Who will this benefit?

Anyone who wants to benefit from DRY principles. The current workflow forces users manually write the names of files to be unioned, instead of depending on the natural structure of the DBT project.

@drewbanin was also interested in this feature in the DBT slack channel, prompting me to create this feature request.

@drewbanin
Copy link
Contributor

Thanks for the request @tmastny. This is something we've run into a bunch, and I agree: writing out the model names to union manually is error prone and a chore. I'd need to have a think about how this could work.

I don't want to make dbt parse all of the models in a project twice, but I'm not sure how else to accomplish something like this. I imagine we'd need to do one pass to find all of the models, then another pass to correctly process functions like get_model_names shown above.

Maybe there's some world in which the graph isn't fully finalized until just before "running". After parsing, dbt could conceivably create some more edges between nodes like daily_activity and the dynamic list of nodes that it would select from. In the example above, activity_comment and activity_login would need to run prior to daily_activity, then appear in the results of something like get_model_names.

So, let me ponder this one for a while. Keen to hear your thoughts about any of the above!

@tmastny
Copy link
Contributor Author

tmastny commented Jan 6, 2019

I think I now understand the root of this issue: dbt has no way to dynamically generate dependencies. In other DAG tools such as make, Snakemake, and drake, dynamic dependencies are a part of the design. These dynamic dependencies are found by preprocessing the workflow specification (the makefile, or the models in dbt).

Let's take look at Snakemake, which is a make-like DAG system for Python.

Snakemake

Suppose we are in the following working directory.

.
├── Snakefile
└── text
    ├── hello.txt
    └── world.txt

The objective is to "build" the models in text/ and then "union" the contents into union.txt.

# Snakefile: the snakemake "makefile"
path = 'text/{name}.txt'
names = glob_wildcards(path).name

rule all:
    input:
        'union.txt'

rule union:
    input:
        expand('text/{name}_compiled.txt', name=names)
    output:
        'union.txt'
    shell:
        'cat {input} > {output}'

rule build:
    input:
        path
    output:
        temp('text/{name}_compiled.txt')
    shell:
        'cat {input} > {output}'

This workflow specification alone doesn't determine all the dependencies, since the number of compiled files depends on the structure of text/. However, snakemake will preprocess Snakefile and evaluate the wildcards {. From there, snakemake determines the DAG.

For example, we can evaluate the wildcards without executing the DAG:

# in bash
snakemake --dryrun
# output
Building DAG of jobs...
Job counts:
	count	jobs
	1	all
	2	build
	1	union
	4

rule build:
    input: text/world.txt
    output: text/world_compiled.txt
    jobid: 3
    wildcards: name=world


rule build:
    input: text/hello.txt
    output: text/hello_compiled.txt
    jobid: 2
    wildcards: name=hello


rule union:
    input: text/world_compiled.txt, text/hello_compiled.txt
    output: union.txt
    jobid: 1


localrule all:
    input: union.txt
    jobid: 0

dbt

In dbt, the wildcard-like functionality of jinja templating is overloaded. Not only is the jinja used to expand out variables like {name} in the example above, it is also used to determine the DAG via ref.

The think your second idea is spot on. There needs to be some initial evaluation of the jinja templating, and then a pass through to evaluate the refs. As I mentioned above, I realize this is difficult because right now they are so intertwined.

@drewbanin
Copy link
Contributor

Nice writeup, thanks @tmastny. One totally different way of pursuing something like this might be a sort of upsert materialization?

Rather than collecting all of the tables to union in daily_activity, it might be interesting to have sub-tables like activity_login and activity_comment upsert their data into a table. This way, each model is responsible for itself and we can side step the whole question of dynamic graph building.

Another benefit of this type of materialization is that if the logic in activity_login changes but activity_comment does not, you can just run activity_login to refresh those specific rows. The naive version would delete all of the records from activity_login before inserting the full dataset. A smarter version would work more like dbt's incremental models. I think there's a way to guarantee idempotence here.

This approach presents other problems, but I have a feeling that they may be more tractable than dynamic graph generation at present. Do you think this approach adequately addresses your use case?

@tmastny
Copy link
Contributor Author

tmastny commented Jan 6, 2019

I'll have to do more research to understand if this could address my use case. I'm relatively new to SQL and I'm not familiar with upsert. I'm also on redshift, which this article indicates might be an additional source of difficulty.

However, I view dbt's DAG as a strength and would love to see more features around DAGs supported!

Thanks for all the hard work on dbt!

@drewbanin drewbanin added the enhancement New feature or request label May 16, 2019
@drewbanin
Copy link
Contributor

cc @beckjake

Ok, let's figure out how to make this happen! I have a bunch of ideas bouncing around in my head, and there are a couple of constraints that will guide this feature.

Thoughts:

  • This needs to be some sort of post-processing step that happens after parsing but before running
  • We'd need to do something special for ephemeral models
  • ref returns a Relation for most models, but a CTE string name for ephemeral models
  • Do we also need to defer ephemeral cte injection until after parsing? I think that might actually be how it already how it works...
  • There's merit to returning ParsedNode representations in addition to just Relations
    • I don't know that we need to do that in pursuit of this particular feature, but it's certainly adjacent!
    • This wouldn't return a Relation (or build an edge), but would instead be a sort of Manifest API
    • At some point, we'll probably want to think about how to import dbt from Python and do something similar...
  • Each of models, sources, seeds, and archives should be selectable

Proposal:

nodes.ref.resources(*selectors, resource_types=None)

Overview

The nodes.ref family of functions exist to 1) find resources that exist in the graph and 2) dynamically ref them, returning a Relation. The full family is:

nodes.ref.resources(*selectors, resource_types=None)
nodes.ref.models(*selectors)
nodes.ref.sources(*selectors)
nodes.ref.archives(*selectors)
nodes.ref.seeds(*selectors)

These functions should all return Relation objects that match any of the specified selectors. The models/sources/archives/seeds variants are sugar over nodes.ref.resources.

Example usage:

{% for model in nodes.ref.models("events.*") %}
  select * from {{ model }} {% if not loop.last %} union all {% endif %}
{% endfor %}

**NOTE: ** I think it's important that the ref() (or source()) happens implicitly in the nodes.ref.resources method. We need to capture these references, then post-process them before runtime. An alternative implementation which returns resources names would be hard for two reasons:

  1. the syntax for referencing models and source differs between resource types (ref(name) vs. source(source_name, table_name). Pushing this logic to the caller would be unduly difficult/error prone i think
  2. dbt needs to build graph edges in the post-processing step. Ideally, it could do this directly from the arguments passed to invocations of nodes.ref.models. An alternative implementation which returned model names and left refing to the caller would need to re-parse the whole model!

Node selection

The nodes.ref.* family of functions should accept a series of node selectors as varargs. These node selectors should use the same selection syntax as provided on the CLI. This includes:

  • selecting by fqn (eg. package.*, package.dir.*, dir.path.to.models)
  • selecting by tag (eg. event)
  • selecting by name (eg. stg_snowplow_event)
  • selecting by source (eg. source:snowplow.*+)

If it turns out that there are other selectors that are useful here (like selecting by name prefix, or regex, or similar), we should consider adding those as available selectors on the CLI too.

Implementation

Calls tonodes.ref.relations should return an empty list at parse-time. The function call should be stored for post-processing. When the function is called at run-time, dbt should evaluate the specified selectors and return a list of matched Relations.

Caveats

Ephemeral models are going to be annoying here. I personally dislike the idea of having to document this beautiful suite of functions, noting that they either return a Relation for literally every resource type, or a string for ephemeral models :)

It might be worth investigating if we can reconcile ephemeral models here. A ref of an ephemeral model returns the name of a CTE as a string. Maybe we can encode this as a Relation with its type set to CTE, excluding a database and schema. We might already be doing something like this in some places?

Other relevant things to think about

It wouldn't be crazy to support other functions in the future, like nodes.find.resources which returns ParsedNode objects (or their dict representations). This would not build an edge in the graph, and would probably be most useful in context like run-operations

Would love to hear everyone's thoughts on an approach like this!

@davehowell
Copy link

davehowell commented Sep 18, 2019

I use this sort of pattern a fair bit, however, I would not use this feature to dynamically get the list of models. I have done this for models that union 13 tables, and tbh writing out the names of 13 models in order to loop over a union is already very dry, and more so it is explicit.

The main reason I think this use case is not great is that it amounts to a hidden rule based on the existence or not of models within a directory. What if someone added a new model there, or deleted one? I'd want it to break if that happened, not silently succeed with bad data.

Given the original example:

.
└── activity
    ├── activities
    │   ├── activity_comment.sql
    │   └── activity_login.sql
    └── daily_activity.sql

I would just do this:

{%-
    set source_models = [
         'activity_comment',
         'activity_login'
    ]
-%}

WITH all_sources AS (
{% for source in source_models %}
    SELECT
       col1
      ,col2
      ,col3
    FROM {{ ref(source) }}

    {%- if not loop.last %}
    UNION ALL
    {%- endif %}

{%- endfor %}
)
SELECT * FROM all_sources

By itself that is not useful, so I'd also use materialized='incremental' in the config and add a unique_key which could be the individual model's primary keys if they don't collide, or more likely they do collide so use something like the dbt-utils surrogate_key function to concatenate across some useful type/categorization of the source tables and their primary key.

If there are variations in the columns between source tables, e.g. some source tables need a fake NULL column or require a different column naming, then some jinja ternary statements can help.

e.g. instead of selecting just col2 in the SQL you could do:
{{ 'NULL' if source in [activity_login] else 'col2' }} AS col2

If the rules start to get more complex then instead of setting a list of strings up front I would use a list of dictionaries e.g.

{%-
set source_model_rule = [
    dict(name = 'activity_comment,
           type = 'small',
           depth = 'mid'
    ),
    dict(name = 'activity_login',
           type = 'extra-medium',
           depth = 'deep'
    ) 
]
-%}

On the other hand I do see use cases for this. For example I would like to use this feature for tests. It would be nice to define a test template or pattern to be run for all models in a directory, and additionally be able to access config values for models, e.g. the start_date of the dbt-utils increment_by_period materialization.

@drewbanin
Copy link
Contributor

I think that's a really fair point @davehowell. I am inclined to agree with you -- making the model filepath subtly significant like this can definitely lead to confusion. I'd probably recommend that a tag selector is used here to fetch all of the models with a given tag. That feels like the right combination of "obvious" and "abstracted" to me.

I think we can decouple something like nodes.find.resources() (or nodes.find.resource()) from the larger question of dynamic dag references. It would also be compelling to make the Relation object returned by ref() include config info about the node itself, but I think we're less well suited to implement that in dbt today. I'll have a further think about this and follow up here with any comments!

@emily-hawkins
Copy link

As requested from @clrcrl , adding a use case here from dbt slack channel: https://getdbt.slack.com/archives/C0VLZPLAE/p1576167419187600

Related to @drewbanin's comment above,
The use case is for conditional post-hooks based on the model tags (i.e. models tagged with bi_tool should get different grants than those tagged with something else) rather than having to put all grants separately under each model section or in each config

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Dec 29, 2021
@github-actions github-actions bot closed this as completed Jan 5, 2022
@HintonBR
Copy link

HintonBR commented Mar 4, 2022

@drewbanin Looks like this idea went stale, but I was curious if you thought any further about having a sub-table be able to upsert into a table rather than forcing the union all pattern when you want to combine multiple tables into one? I am new enough to dbt that I couldn't identify in your proposal if it was supporting a dynamic referencing capability for the union all (and other patterns) or if it would support the sub-table upsert. It seemed like the former of those two options. I was really interested in the upsert of a sub-table into a table as a pattern for the reasons you listed in that comment (no reason to query all tables when only one table has changed) and had similar functionality in a previous data engineering framework that I have used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

6 participants