Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defining vars, folder-level configs outside dbt_project.yml #2955

Open
jtcohen6 opened this issue Dec 15, 2020 · 19 comments
Open

Defining vars, folder-level configs outside dbt_project.yml #2955

jtcohen6 opened this issue Dec 15, 2020 · 19 comments
Labels
enhancement New feature or request paper_cut A small change that impacts lots of users in their day-to-day Refinement Maintainer input needed vars

Comments

@jtcohen6
Copy link
Contributor

Describe the feature

From @benjaminsingleton:

I’d like to use project level variables more, but I’m concerned about bloat to my already large dbt-project.yml file. I think it would be helpful if I could create a variables.yml file that could be imported in dbt-project.yml . And for that matter, the same could be done for other configurations in the dbt_project.yml file. I think having the ability to separate configurations into different files might make for improved modularity / separation of concerns (particularly for large projects), not to mention fewer merge conflicts. CC @jrandrews

Describe alternatives you've considered

  • We're already thinking of enabling some configs in resource-YAML files (Set configs in schema.yml files #2401), but these would be at the level of the individual resource (model/seed/snapshot/etc) only
  • The dbt_project.yml gets really really big??

Additional context

  • I don't think this has any correspondence to v1.0. It's a nice thing to have, and we can could do it before, after, any time without it being a breaking change in any way.

Who will this benefit?

  • Developers and maintainers of increasingly big dbt projects
@jtcohen6 jtcohen6 added the enhancement New feature or request label Dec 15, 2020
@codigo-ergo-sum
Copy link

codigo-ergo-sum commented Dec 16, 2020

Some additional thoughts/failure modes/concerns to think about with this:

  1. In large dbt projects, one problem with project-wide variables is the potential for developers to "step" on each other by editing or overwriting or conflicting with each other's var declarations. If devs were being meticulous in looking at the project-wide relevance of a given var, then this might happen less or not at all, but that is unfortunately often not the case. If we allow variable declaration (and other things) outside of just one file (say dbt_project.yml), and it is an arbitrary number of files, then I can see Dev1 defining my_var_p in variable_file_1.yml and then Dev2 defining my_var_p in variable_file_2.yml. I suppose/hope that dbt would detect that and throw and error but there are still some clunky workflow issues in allowing variable declarations in multiple different .yml files.
  2. Vars need to be parsed before other .yml files for declarations around models, tests, etc. are parsed. Right now this problem is handled by having one hard-coded .yml file (that is, dbt_project.yml) to be parsed before the other .yml files, but if we loosen this then dbt still needs a way to be able to determine how/what to parse in "pass 1" of parsing for vars (and I am sure a lot of other things that other, smarter people than I know already happen first :) ) versus "pass 2" of parsing for other things like tests, models, etc. And not just dbt -- this understanding of what .yml file gets parsed when needs to be not-too-hard to quickly understand for average devs. Otherwise people will be just littering random var declarations mixed in with tests and model config and then getting confused why things don't work.

One thought that comes to me - what if we had another separate set of subdirectories that were specifically dedicated to .yml files for config, like we already have a subdirectory declaration/space for snapshots, models, etc. Only things related to/extracted from dbt_project.yml could be put in there, and any other things related to models, snapshots, etc., would trigger a parsing error. This doesn't fix problem 1 above but it at least helps with problem 2. What say ye?

P.S. Also, I know var namespacing was removed for dbt_project.yml v2 config in .17 for some good reasons but it's also pretty hard not to have any way to do variable scoping in larger projects.

@jtcohen6
Copy link
Contributor Author

@codigo-ergo-sum I don't think I've ever seen you post from this alt account before. It goes without saying that I like the handle.

what if we had another separate set of subdirectories that were specifically dedicated to .yml files for config, like we already have a subdirectory declaration/space for snapshots, models, etc.

This is along the lines of what I was thinking: either an explicit set of subdirectories, or an explicit set of named files. I've been around just long enough to remember when packages was a special dict in dbt_project.yml rather than its own file; we split it out because we expected it to grow in size, and because it served a distinct purpose. We made the same choice for selectors.yml.

I'd be especially keen on a vars.yml: variables have a slightly different parsing context, we can be strict about accepting only literal values, and we could even do a better job of parsing vars.yml before parsing dbt_project.yml. That would make default values of vars called in dbt_project.yml work the way folks expect, rather than how it is today. I like that correspondence between vars.yml and CLI --vars, similar to how env vars can be sourced from an *.env file or prepended to a CLI command.

Configurations feel a bit trickier, because these can be especially verbose. How to coordinate hierarchies across multiple files without someone tripping over someone else? I'm honestly not sure. The cleanest separation I can envision would be allowing a project to have one each of models.yml, seeds.yml, etc.

P.S. Also, I know var namespacing was removed for dbt_project.yml v2 config in .17 for some good reasons but it's also pretty hard not to have any way to do variable scoping in larger projects.

This is fair. I wonder if the ability to scope vars differently for different model subsets may ultimately serve as a valid reason to split very big projects up into multiple sub-projects, installed as packages. That's regardless of whether they live in the same or separate repositories.

@codigo-ergo-sum
Copy link

Thanks for the compliment on the username @jtcohen6 :).

I think a vars.yml file would be a definite improvement over the current situation.

Would it be required or could vars also still be defined in dbt_project.yml? If required then that probably requires a new version 3 of the schema version for dbt_project.yml which is a sigificant change for existing users, right?

If not required, then what would the behavior look like if vars are defined in both places, and if they conflict? And are you suggesting that vars.yml would be parsed before dbt_project.yml is parsed? Allowing full, "no-gotcha" usage of vars in dbt_project.yml?

@danielefrigo
Copy link
Contributor

I absolutely support the idea of parsing the vars before the dbt_project.yml.
This would enable leveraging vars in many additional ways, e.g. to enable or disable subfolders or defining schemas from vars, without loosing the ability to simply run a model using the vars default values.

@moltar
Copy link

moltar commented Oct 25, 2021

Having outside vars available in dbt_project.yml would be a huge improvement.

We have a complex dbt_project.yml, with lots of repetition and using Jinja a lot.

For example:

source-paths:
  - modules/shared/models
  - modules/module1/models
  - modules/module2/models
  - stages/{{ env_var('DBT_STAGE', '@fake@') }}/models

Then enabling a particular stage via env var during deployment.

Would be great to set the stage once in the vars file, and then just use the var itself in the config. Also be able to define module names / prefixes, or even an entire array of modules to loop over.

@krazavet-tinyclues
Copy link

Hello 👋

In our company, we are using a lot DBT in a multi tenant context. For that purpose, we rely a lot on DBT variables with which we propagate the client configuration. Those configurations could be really different from a client to another.
Sometimes we have faced the following issue argument list too long: dbt, which is due to the large config payload (e.g. some of them could reach more than 600Kb).

We did not find a proper workaround for now. Passing a file path instead of a payload for our variables would probably solve our issue. This is why we are keen to know if there is any chance you are going to consider such feature for DBT ? (cc. @jtcohen6)

Thank you in advance 🙏

@ybressler
Copy link

++ this feature. The solution implemented by Jekyll (with _data directory) comes to mind as suitable.

@ciprian-mandras
Copy link

Hi all,

I agree that a vars.yml will be a good boost, but there you'll have just some global variables. Based on my background experience I think you should think as well to a solution for local variables. Some sort of accepting in a model configuration to define a model_vars.yml and use the variables for that specific model from there.

Thank you.

@itechprasanth
Copy link

I agree. Hope this will get implemented soon as it is always a good practice to modularize the configurations, rather than having everything in same single file.

@jtcohen6 jtcohen6 added paper_cut A small change that impacts lots of users in their day-to-day Refinement Maintainer input needed labels Nov 28, 2022
@vitorefazevedo
Copy link

dbt still only allow global variables defined in dbt_projetc.yml?

@codigo-ergo-sum
Copy link

Just looking through issue backlogs and wanted to bump this... Would be great as we are working with projects that have tens or even hundreds of variables now. Also the lack of ability to namespace them is still challenging.

@apolorei
Copy link

apolorei commented Jun 2, 2023

I'm also facing this problem and would very much love some ideas of how to tackle it!

@timvw
Copy link

timvw commented Jun 5, 2023

Currently we're trying to workaround this issue by using environment variables (and tooling via direnv and a .envrc file)

@jeremyyeo
Copy link
Contributor

Sneaky workaround whilst wait for this to be built into core. Basically move var declarations into macro files:

https://gist.github.com/jeremyyeo/06d552ee8facc8100416655ebc25d9b9

@krazavet-tinyclues
Copy link

krazavet-tinyclues commented Jun 12, 2023

Sneaky workaround whilst wait for this to be built into core. Basically move var declarations into macro files:

https://gist.github.com/jeremyyeo/06d552ee8facc8100416655ebc25d9b9

This is exactly what we started to POC in our DBT stack. Using a dedicated macro file to load bigger JSON payload.
The idea is to generate a macro file containing all DBT variables. At the end it should look to something like that 👇

{% macro get_config() %}
  {{ return(fromjson("<JSON_CONTENT_HERE>")) }} 
{% endmacro %}

And then you can use it in your model:

{% set some_var = get_config().get(...) %}

That's a workaround that should make the job.

@markproctor1
Copy link

markproctor1 commented Aug 7, 2023

Folder-level configs would make a huge difference on my project. With 50+ developers and growing we don't want anyone to modify project-level files day-to-day, but we do want them to manage many files and folders in their subject area.

Folder-level configs would do this. Clearly the need is there which is why they are featured in dbt_project.yml but this causes governance and git conflict problems where many teams trying to make changes to project-level files at the same time.

Basically, I need to treat our subject areas as mini-projects, each mini-project having its own configuration.

@dbeatty10
Copy link
Contributor

Within #8869, @slotrans described var() not being able see vars defined in dbt_project.yml for the purposes of configuring query-comment.

If this feature request were added, then it would solve that use-case.

Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Apr 16, 2024
@rlh1994
Copy link
Contributor

rlh1994 commented Apr 16, 2024

Apu Jumps 16042024082159

@dbeatty10 dbeatty10 removed the stale Issues that have gone stale label Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request paper_cut A small change that impacts lots of users in their day-to-day Refinement Maintainer input needed vars
Projects
None yet
Development

No branches or pull requests