Dynamic source definitions #5686
Replies: 2 comments
-
@fivetran-joemarkiewicz Thanks for opening — this sounds a lot like #3862! Though that issue has been closed as stale, it's never really left my mind. I don't have any perfect remedies to share, but I can say that this adds onto an interesting conversation of late: How should users—or people, such as package authors, building more complex products on top of dbt—programmatically generate dbt project resources? Patterns that exist todayyaml anchorsWorth stating for anyone totally new: In the simplest case, where you're defining your own sources that happen to have most of their properties in common, we do support slightly DRYer source creation, by means of yaml anchors. sources:
- &xero_base
name: xero_mx
database: "{{ var('xero_database', target.database) }}"
tables:
- name: account
- name: contact
- name: ...
- <<: *xero_base
name: xero_ca
- <<: *xero_base
name: xero_another_one That doesn't give you, as package author, the programmatic means to pull just a variable list from the user ( externalize it?If the column spec for every This sounds a lot like the Union Feature you've built, though I'm thinking it could be done one step further up the stack: while staging/loading an external table. Most data warehouses can create external tables that map to specific files in S3/GCS/Blob/etc, with specific file name patterns to identify candidates. Imagine, for Snowflake (using the syntax from the dbt-external-tables package): sources:
- name: xero
database: "{{ var('xero_database', target.database) }}"
tables:
- name: account
external: "@raw.fivetran.xero_stage"
pattern: ".*account"
partitions:
- name: xero_connector
data_type: string
expression: substr(metadata$filename, ..., ...) # no need to specify — Snowflake will infer
columns:
# common for all Xero connectors
- name: ...
- name: contact
external: ... This would yield one dbt source, mapped to one external table, for each Xero object (account, contact, invoice, etc), including the data from all Xero connectors. Future"union feature" for sources?A future way to do this, without requiring an external table in the mix, might be to enable dbt to create its own simple "union" feature for sources: ## xero_schema = ['xero_mx','xero_ca']
sources:
- name: xero
database: "{{ var('xero_database', target.database) }}"
union_schemas: ['xero_mx', 'xero_ca']
tables:
- name: account
- name: contact Then, dbt would "create" each source as:
We could use the slightly trickier logic in I'd want to be quite limited about this functionality, to maintain the clear distinction between sources as simple "pointers" to raw data, versus models as vehicles for data transformation. External tables as programmatic resource creationIf you really do want one project resource per unique table, and want to avoid 1000s of repetitive lines of code, then there really is no solution other than programmatically generating them. There's a lot of prior discussion about this in: #5101
I have a belief, ever-more strongly held, that while we allow Jinja expressions to programmatically configure project resources, we should not extend Jinja to programmatically create project resources. This is a subtle distinction. Where this is allowed: models:
- name: my_model
tests:
- unique
- not_null:
severity: "{{ var('severity') }}" This should not be, because the models:
- name: my_model
tests: "{{ var('tests') }}" The same would go for: {% for var in [some, list] %}
{% model model_{{ var }} %}
...
{% endmodel %}
{% endfor %} At this point, I just don't see us ever supporting that, to be honest. Of course, there are people who have found the need to programmatically template / generate dbt project code. They've done so using their own scripting tools, and with dbt project files (yaml + SQL) as their compile target. Something we have been thinking about more recently: What if this is a case for real code generation? Another interface into dbt project resources, written in Python, such that you (as an end user) could tap into the same "internal manifest" APIs that dbt uses when it parses the code in your project? Imagine something like: xero_schema = ['xero_mx','xero_ca']
xero_tables = ['account', 'contact', ...]
# absolutely imaginary code
for schema in xero_schema:
for table in xero_tables:
new_source = (
source_name = schema,
name = table,
database = var('xero_database', target.database)
...
)
dbt.Manifest.sources.add(new_source) This will be tricky to think through—parsing order becomes all the more confusing; it significantly expands the boundaries of what "dbt-only" code can do—but there are enough legitimate use cases among "superusers" that it's not something I'm willing to dismiss out of hand. In any case, I'm going to convert this one to a discussion. I don't foresee us taking any immediate steps in these directions, except by continuing to invest in the programmatic interfaces that would make more feasible some version of the last wild-eyed idea. |
Beta Was this translation helpful? Give feedback.
-
FYI, a potential follow-up is discussed at #7145 |
Beta Was this translation helpful? Give feedback.
-
Is this your first time submitting a feature request?
Describe the feature
Context
Since the rollout of the Union Feature in a number of Fivetran dbt packages, I've seen users have an increasing apatite for adopting the feature in the compatible packages as well as request the feature in others. This feature is great for users who have multiple datasets that are similar (maybe the user is an agency, consultancy, or company with many separated business units). However, there is one major flaw with this feature in the Fivetran dbt packages, it breaks dbt convention by not generating a source.yml for each schema/database the user passes through. Therefore, the user is not able to effectively test freshness or generate an accurate DAG 😞.
The Fivetran dbt packages with this feature use only one source.yml as a template to then be leveraged in a custom macro that ultimately passes a list of relations towards
dbt_utils.union_relations()
. This works great duringdbt run
as I mentioned, however the DAG then shows no relation to the sources (see the image below).However, when manually adding multiple sources and naming them properly, you can see the sources are properly defined! 🎉
This is great, but I only had to define two sources. What happens when tomorrow we add another Xero dataset (in this case)? How about in a month when our company grows and we add 100 other Xero datasets?! Managing all those sources and creating the same source.yml time after time can become quite unwieldy and is not ideal. It would be great if we can somehow dynamically generate sources from a single yml file 🤯!
Feature Request
It would be great if there was a way to use Jinja (or some other tool) to dynamically generate the same source.yml for multiple datasets that are the same. I am envisioning something like the following:
The above does not work, but I imagine some way to leverage dynamic jinja to allow for a single source.yml to be reused multiple times if the structure of the schema is the same.
Describe alternatives you've considered
Manually writing the source.yml for each source. This does work, but is not scalable.
Who will this benefit?
dbt users whose companies relate to the consultancy, agency, or high growth models and see the same datasets being added over time.
Are you interested in contributing this feature?
I don't feel equipped to solve this issue, but open to helping however I can.
Anything else?
Here is the fivetran_utils macro that is called within the custom macro identified in the "Describe the Feature" section. You can see the bulk of the efforts are in place to pass a list of relations to
dbt_utils.union_relations()
.Beta Was this translation helpful? Give feedback.
All reactions