Dynamic source definitions #5686

fivetran-joemarkiewicz · 2022-08-18T18:56:16Z

fivetran-joemarkiewicz
Aug 18, 2022

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Context

Since the rollout of the Union Feature in a number of Fivetran dbt packages, I've seen users have an increasing apatite for adopting the feature in the compatible packages as well as request the feature in others. This feature is great for users who have multiple datasets that are similar (maybe the user is an agency, consultancy, or company with many separated business units). However, there is one major flaw with this feature in the Fivetran dbt packages, it breaks dbt convention by not generating a source.yml for each schema/database the user passes through. Therefore, the user is not able to effectively test freshness or generate an accurate DAG 😞.

The Fivetran dbt packages with this feature use only one source.yml as a template to then be leveraged in a custom macro that ultimately passes a list of relations towards dbt_utils.union_relations(). This works great during dbt run as I mentioned, however the DAG then shows no relation to the sources (see the image below).

However, when manually adding multiple sources and naming them properly, you can see the sources are properly defined! 🎉

This is great, but I only had to define two sources. What happens when tomorrow we add another Xero dataset (in this case)? How about in a month when our company grows and we add 100 other Xero datasets?! Managing all those sources and creating the same source.yml time after time can become quite unwieldy and is not ideal. It would be great if we can somehow dynamically generate sources from a single yml file 🤯!

Feature Request

It would be great if there was a way to use Jinja (or some other tool) to dynamically generate the same source.yml for multiple datasets that are the same. I am envisioning something like the following:

## xero_schema = ['xero_mx','xero_ca']

sources:
  - name: "{% for name in var('xero_schema') %}name{% endfor %}"
    schema: "{% for schema in var('xero_schema') %}schema{% endfor %}"
    database: "{{ var('xero_database', target.database) }}"
    tables:
      - name: account

The above does not work, but I imagine some way to leverage dynamic jinja to allow for a single source.yml to be reused multiple times if the structure of the schema is the same.

Describe alternatives you've considered

Manually writing the source.yml for each source. This does work, but is not scalable.

Who will this benefit?

dbt users whose companies relate to the consultancy, agency, or high growth models and see the same datasets being added over time.

Are you interested in contributing this feature?

I don't feel equipped to solve this issue, but open to helping however I can.

Anything else?

Here is the fivetran_utils macro that is called within the custom macro identified in the "Describe the Feature" section. You can see the bulk of the efforts are in place to pass a list of relations to dbt_utils.union_relations().

jtcohen6 · 2022-08-20T11:57:09Z

jtcohen6
Aug 20, 2022
Maintainer

@fivetran-joemarkiewicz Thanks for opening — this sounds a lot like #3862! Though that issue has been closed as stale, it's never really left my mind.

I don't have any perfect remedies to share, but I can say that this adds onto an interesting conversation of late: How should users—or people, such as package authors, building more complex products on top of dbt—programmatically generate dbt project resources?

Patterns that exist today

yaml anchors

Worth stating for anyone totally new: In the simplest case, where you're defining your own sources that happen to have most of their properties in common, we do support slightly DRYer source creation, by means of yaml anchors.

sources:
  - &xero_base 
    name: xero_mx
    database: "{{ var('xero_database', target.database) }}"
    tables:
      - name: account
      - name: contact
      - name: ...
  - <<: *xero_base
    name: xero_ca
  - <<: *xero_base
    name: xero_another_one

That doesn't give you, as package author, the programmatic means to pull just a variable list from the user (['xero_mx','xero_ca']) and populate sources accordingly. It also requires manual action each time a new Xero source (schema) comes into being.

externalize it?

If the column spec for every xero_*.account table is exactly the same, why not just store them all together, in one big xero.account table? Is it actually useful to view these as distinct sources within dbt? Yes, that would mix xero_mx data with xero_ca data, but it would be simple enough to add a new metadata column identifying which source table each one comes from.

This sounds a lot like the Union Feature you've built, though I'm thinking it could be done one step further up the stack: while staging/loading an external table. Most data warehouses can create external tables that map to specific files in S3/GCS/Blob/etc, with specific file name patterns to identify candidates.

Imagine, for Snowflake (using the syntax from the dbt-external-tables package):

sources:
  - name: xero
    database: "{{ var('xero_database', target.database) }}"
    tables:
      - name: account
         external: "@raw.fivetran.xero_stage"
         pattern: ".*account"
         partitions:
           - name: xero_connector
             data_type: string
             expression: substr(metadata$filename, ..., ...)  # no need to specify — Snowflake will infer
         columns:
           # common for all Xero connectors
           - name: ...
       - name: contact
          external: ...

This would yield one dbt source, mapped to one external table, for each Xero object (account, contact, invoice, etc), including the data from all Xero connectors.

Future

"union feature" for sources?

A future way to do this, without requiring an external table in the mix, might be to enable dbt to create its own simple "union" feature for sources:

## xero_schema = ['xero_mx','xero_ca']

sources:
  - name: xero
    database: "{{ var('xero_database', target.database) }}"
    union_schemas: ['xero_mx', 'xero_ca']
    tables:
      - name: account
      - name: contact

Then, dbt would "create" each source as:

create table raw.xero.account as (
    select * from raw.xero_mx.account
    union all
    select * from raw.xero_ca.account
)

We could use the slightly trickier logic in dbt_utils.union_relations, to handle any discrepancies in column specs—although part of the idea here is that they should really be identical.

I'd want to be quite limited about this functionality, to maintain the clear distinction between sources as simple "pointers" to raw data, versus models as vehicles for data transformation. External tables as sources already pushes the envelope a teensy bit.

programmatic resource creation

If you really do want one project resource per unique table, and want to avoid 1000s of repetitive lines of code, then there really is no solution other than programmatically generating them.

There's a lot of prior discussion about this in: #5101

The above does not work, but I imagine some way to leverage dynamic jinja to allow for a single source.yml to be reused multiple times if the structure of the schema is the same.

I have a belief, ever-more strongly held, that while we allow Jinja expressions to programmatically configure project resources, we should not extend Jinja to programmatically create project resources. This is a subtle distinction. Where this is allowed:

models:
  - name: my_model
    tests:
      - unique
      - not_null:
          severity: "{{ var('severity') }}"

This should not be, because the tests field actually instantiates net-new project resources:

models:
  - name: my_model
    tests: "{{ var('tests') }}"

The same would go for:

{% for var in [some, list] %}
   {% model model_{{ var }} %}
      ...
   {% endmodel %}
{% endfor %}

At this point, I just don't see us ever supporting that, to be honest.

Of course, there are people who have found the need to programmatically template / generate dbt project code. They've done so using their own scripting tools, and with dbt project files (yaml + SQL) as their compile target.

Something we have been thinking about more recently: What if this is a case for real code generation? Another interface into dbt project resources, written in Python, such that you (as an end user) could tap into the same "internal manifest" APIs that dbt uses when it parses the code in your project?

Imagine something like:

xero_schema = ['xero_mx','xero_ca']
xero_tables = ['account', 'contact', ...]

# absolutely imaginary code
for schema in xero_schema:
    for table in xero_tables:
        new_source = (
            source_name = schema,
            name = table,
            database = var('xero_database', target.database)
            ...
        )
        dbt.Manifest.sources.add(new_source)

This will be tricky to think through—parsing order becomes all the more confusing; it significantly expands the boundaries of what "dbt-only" code can do—but there are enough legitimate use cases among "superusers" that it's not something I'm willing to dismiss out of hand.

In any case, I'm going to convert this one to a discussion. I don't foresee us taking any immediate steps in these directions, except by continuing to invest in the programmatic interfaces that would make more feasible some version of the last wild-eyed idea.

0 replies

Fleid · 2023-03-09T04:35:53Z

Fleid
Mar 9, 2023

FYI, a potential follow-up is discussed at #7145

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic source definitions #5686

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Dynamic source definitions #5686

fivetran-joemarkiewicz Aug 18, 2022

Is this your first time submitting a feature request?

Describe the feature

Context

Feature Request

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

Replies: 2 comments

jtcohen6 Aug 20, 2022 Maintainer

Patterns that exist today

yaml anchors

externalize it?

Future

"union feature" for sources?

programmatic resource creation

Fleid Mar 9, 2023

fivetran-joemarkiewicz
Aug 18, 2022

jtcohen6
Aug 20, 2022
Maintainer

Fleid
Mar 9, 2023