Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-777] Default to current working directory for profiles.yml #5411

Closed
dbeatty10 opened this issue Jun 27, 2022 · 9 comments · Fixed by #5717
Closed

[CT-777] Default to current working directory for profiles.yml #5411

dbeatty10 opened this issue Jun 27, 2022 · 9 comments · Fixed by #5717
Labels
cli enhancement New feature or request

Comments

@dbeatty10
Copy link
Contributor

dbeatty10 commented Jun 27, 2022

Summary of the feature

Look for profiles.yml within the current working directory first. Fall back to the ~/.dbt/ directory.

I will submit an experimental/draft PR that shows how this could be implemented.

Who will this benefit?

Instigating use-case

The instigating use-case:

  • shipping a self-contained database + dbt project -- batteries-included!

Embedded databases like DuckDB and SQLite can utilize the same compute resources as dbt-core + dbt-{adapter}. In the case of non-sensitive data, profiles.yml defaulting to the current working directory would enable projects to work without mucking with environment variables.

Summary of proposal for unified conventions

This proposal would align conventions the order of search precedence for the profiles.yml and dbt_project.yml directories:

  1. command line flag option
  2. environment variable
  3. one or more paths to search

Pros

  • Projects that have no secret values can work out of the box without any further intervention on the part of the user -- batteries included! 🔋
    • This can be especially useful for use cases like demo projects or projects exclusively containing public data sets.
  • There are no additional tooling to install and manage (in contrast to most of the alternatives)
  • Works great with existing conventions:
    • Merely don't create a profiles.yml in the current working directory if a centralized config in ~/.dbt/ is desired instead.

It also supports this exotic option:

  • If profiles.yml does need to contain plain-text secrets for some reason, you can still safely check it into version control using a tool like BlackBox 🤯

Cons

  • This could be considered breaking behavior for any projects currently storing a profiles.yml in the project root.
    • However, this would only be breaking in the case that the local profiles.yml is non-functional / undesired (which feels unanticipated and unlikely)
    • Any local profiles.yml is most likely utilized anyways by the project via DBT_PROJECT_DIR or by copying into ~/.dbt/

Background context

dbt has a solid foundation of convention over configuration (CoC), and this proposal would lean into this further.

Current behavior

dbt needs a profiles.yml configuration file for database connection info. I believe the current order of precedence of ... is:

  1. --profiles-dir option
  2. DBT_PROFILES_DIR environment variable
  3. ~/.dbt/ directory

dbt also needs a dbt_project.yml. The current order of precedence is:

  1. --project-dir option
  2. current working directory

Desired behavior

Search order for profiles.yml:

  1. --profiles-dir option
  2. DBT_PROFILES_DIR environment variable
  3. current working directory (NEW)
  4. ~/.dbt/ directory

Search order for dbt_project.yml:

  1. --project-dir option
  2. DBT_PROJECT_DIR environment variable (NEW)
  3. current working directory

General design requirements

There's two necessary pieces for dbt to use a profile to connect to a target database:

  • property definitions for the desired target database
  • secrets to plug into the definition slots

There are two main design requirements in terms of discoverability and accessibility:

  1. property definitions are easily discoverable (since they are not sensitive)
  2. secrets are non-discoverable and access is restricted per security policies

Approach 1

A reason given for the current order of precedence (emphasis mine):

This file generally lives outside of your dbt project to avoid sensitive credentials being check in to version control. By default, dbt expects the profiles.yml file to be located in the ~/.dbt/ directory.

Using a ~/.dbt/profiles.yml file is a solution that:

  • can combine property definitions and secrets within a single file (but leaves optionality for them to be separated)
  • is stored outside of the project directory to guarantee that it is not tracked in version control (VCS)

Pros

  • It can optionally utilize the same approach with environment variables as Solution 2 (below).
  • Can theoretically re-use the same profile (including hard-coded secrets) across multiple projects (or multiple instances of a project)

Cons

  • It is a little less obvious to see which environment variables need to be set -- need to actively search through the profiles.yml file

Approach 2

  • store property definitions in sample.profiles.yml (or just profiles.yml)
  • store secret definitions in:
    • test.env.example for local development
    • environment variables within the continuous integration (CI) environment
    • environment variables within non-CI environments (like production)

This requires doing all of the following for local development:

  1. Copy the sample.profiles.yml file into profiles.yml (within the desired profiles directory)
  2. Set DBT_PROJECT_DIR environment variable or --profiles-dir command-line interface (CLI) flag if profiles directory is different than ~/.dbt/
  3. Copy test.env.example file to test.env
  4. Add secret values to test.env

Pros

  • It is obvious to see which environment variables need to be set by looking at test.env.example

Cons

  • Still need to configure the profiles.yml somehow (~/.dbt/ or DBT_PROJECT_DIR or --profiles-dir)
    • There is no way to clone a repo and have it "just work"

Alternatives considered

  1. Use existing DBT_PROJECT_DIR environment variable
  2. Curated personal ~/.dbt/profiles.yml
  3. Put it in priority behind ~/.dbt/
  4. Add a setting in dbt_project.yml (similar to seed-paths)
  5. python-dotenv
  6. direnv
  7. Docker

DBT_PROJECT_DIR environment variable

The most straight-forward solution to this currently is to just set the DBT_PROJECT_DIR environment variable to the root of the project (or some subdirectory).

Pros

  • This functionality is already available and many people and systems use it

Cons

  • Loading/unloading environment variables is not simple nor given knowledge.
  • A problem is remembering to unset the DBT_PROJECT_DIR when switching to a different project that doesn't have profiles.yml in the current working directory.

Curated personal ~/.dbt/profiles.yml

Pros

  • definitely not checked into VCS on accident
  • can contain secrets
  • doesn't have to contain secrets -- can also utilize environment variables

Cons

  • hard to read beyond a single project
  • doesn't help for the case of continuous integration (CI) or production deployment
  • undermines 12-factor app principles
    • twelve-factor apps store varying config in environment variables (rather than configuration files) (III. Config)
    • internal application config that does not vary between deploys is best done in the code (checked into version control)
    • the local ~/.dbt/profiles.yml will surely be managed differently than the CI and production versions of the same file (I. Codebase and X. Dev/prod parity)

In priority behind ~/.dbt/

Pros

  • 100% certain to be non-breaking behavior

Cons

  • Not actually the priority order we want
  • Would require changing the behavior to the desired priority order upon the next major release.

Add a setting in dbt_project.yml

Pros

  • 100% certain to be non-breaking behavior
  • Allows the behavior to be configured on a per-project basis rather than a global basis

Cons

  • Wouldn't be able to determine the final directory for profiles.yml until dbt_project.yml was found and parsed.
    • Maybe that's okay?

python-dotenv

Pros

  • Works well for loading environment variables for testing and unloading when finished
  • Can handle location of profiles.yml in the DBT_PROJECT_DIR environment variable

Cons

  • Only works in the context of an invocation of the test suite
  • Still requires manual file creation and editing

direnv

Pros

Cons

  • Requires additional installation steps -- not possible for a "batteries-included" dbt project
  • For security reasons, it requires running direnv allow the first time it is executed for a directory, and re-running it everytime the .envrc file is updated

Docker

Pros

  • Docker containers can handle setting environment variables like DBT_PROJECT_DIR
  • They are isolated from each other

Cons

  • The end user needs to install Docker for their host operating system

I do think that analytics engineers should get comfortable with manually loading/unloading environment variables, using Docker images, and even loading/unloading environment variables with tools like direnv and python-dotenv. But it's preferable to minimize additional non-Python dependencies (to the extent possible).

@dbeatty10 dbeatty10 added enhancement New feature or request triage labels Jun 27, 2022
@github-actions github-actions bot changed the title Default to current working directory for profiles.yml [CT-777] Default to current working directory for profiles.yml Jun 27, 2022
@leahwicz
Copy link
Contributor

@dbeatty10 this is a great write-up!!! Would you want to add this to your PR as a doc in our ADR directory? That way we can preserve this decision in the code. Also discussion can take place in the PR if anyone has questions on parts of this. I can help you if you have questions on the format but you are pretty close to having every section already taken care of here

@dbeatty10
Copy link
Contributor Author

🎉 @leahwicz I took a swing at adding this to the ADR directory. See here.

@jtcohen6
Copy link
Contributor

Thanks for the amazing and thorough write-up @dbeatty10, and for spearheading the team-wide conversation since opening. Excited for this to be the first of many such UX improvements :)

@joellabes
Copy link
Contributor

A safety comment, only half-baked as I have only skimmed the backstory of this.

I love batteries included experiences for quickstarts. Personally, I then take whatever I was given and modify it to build what I'm actually trying to do.

I'm worried that by shipping something like the duck-db-in-a-minute project – which will hopefully be the first experience many people have with dbt – with a profiles.yml inside of a source-controlled folder, someone who takes it and reuses it for their cloud DWH connection might do the same thing. Suddenly we have people checking passwords into GitHub and the hackers are mining bitcoin inside of a JavaScript UDF on a 6XL Snowflake warehouse.

Should we consider an additional property in profiles.yml, along the lines of dangerouslySetInnerHTML?

jaffle_shop:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      user: alice
      password: <password>
      port: 5432
      dbname: jaffle_shop
      schema: dbt_alice
      threads: 4

      allow_unsafe_profile_location: true

Then when dbt is invoked, it raises an exception if that key isn't present when the profile comes from the current working directory?

For backwards compatibility issues, we'd perhaps have to also treat use of --profiles-dir or DBT_PROFILES_DIR as allowed. I'm OK with that because it's not the sort of thing you do on your first invocation so you'd hit the error if you didn’t know what you were doing.

@joellabes
Copy link
Contributor

@jtcohen6 @dbeatty10 ☝️ in case your GH notifications only alert you for thing you're tagged in.

(It is also ok if you were just ignoring me and hoping I'd go away 😉)

@dbeatty10
Copy link
Contributor Author

TL;DR

🙏 Don't check secrets into version control

Current context

Currently, users can specify the location of a profiles.yml file via --profiles-dir or DBT_PROFILES_DIR. There are no limitations on the location of this file -- it can reside anywhere either inside or outside the dbt project directory. Like any other file that can be stored in version control, there are no restrictions on the contents of this file1.

To enable separation of configurable and/or secret content, env_var and/or "secret" env vars to the rescue.

Crucial skills

Crucial skills2 every git committer should possess:

  • be cognizant of anything checked in and be okay with it being publicly viewable by the world3

e.g., recognize secrets and don't check them into your version control system.

This is relevant regardless of programing language or library. Conversely, if a user doesn't yet have security wherewithal, they will have problems with systems well beyond dbt4.

Example

You provided a good example why secret hygiene for git is the crucial move.

Imagine that the allow_unsafe_profile_location is added as you described with a simplified version of profiles.yml as:

jaffle_shop:
  target: dev
  outputs:
    dev:
      type: duckdb
      allow_unsafe_profile_location: false

To get the provided example to work, the user will need to update it to allow_unsafe_profile_location: true. After a few more Levenshteinian tweaks, we are right back to the initial situation you described where the user has ability to commit a file like this:

jaffle_shop:
  target: dev
  outputs:
    dev:
      type: postgres
      password: MySecr3tPa$$word
      allow_unsafe_profile_location: true

Hints within profiles.yml

We could provide some helpful reminders of what to do (and not!) within any of our relevant tutorials by providing a profiles.yml similar to this one:

# You should __NEVER__ check credentials into version control. Thanks for reading :)
jaffle_shop:
  target: dev
  outputs:
    dev:
      type: duckdb
      # NOTE: Use an environment variable rather than a hard-coded password
      # https://docs.getdbt.com/reference/dbt-jinja-functions/env_var
      # https://cgjennings.ca/articles/environment-variables/
      # password: "{{ env_var('DBT_ENV_SECRET_PASSWORD') }}"

Ultimate solution that never goes out of style

Don't check secrets into version control 🧠


  1. git will happily track files containing your database password, SSN, or the keys to your cryptocurrency wallets.
  2. Same general category as:
    • look both ways before crossing the street
    • don't leave your keys in the ignition of your unlocked car
    • don't stick things in electrical outlets, etc
  3. Discussion of private repos is omitted for brevity.
  4. For example, AWS has the shared responsibility model that makes it clear that the customer is responsible for "Security in the Cloud."

@jaklan
Copy link

jaklan commented Oct 15, 2022

Hi, I've just discovered that part of the feature wasn't actually implemented:

Search order for dbt_project.yml:

1. --project-dir option
2. DBT_PROJECT_DIR environment variable (NEW)
3. current working directory

and that's pretty disappointing, because we miss that envar for such a long time... Is there any chance to retrieve that idea?

@mishamsk
Copy link

Hi, I've just discovered that part of the feature wasn't actually implemented:

Search order for dbt_project.yml:

1. --project-dir option
2. DBT_PROJECT_DIR environment variable (NEW)
3. current working directory

and that's pretty disappointing, because we miss that envar for such a long time... Is there any chance to retrieve that idea?

same here. curious is there are any plans to support DBT_PROJECT_DIR anytime soon?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 17, 2022

You're right! I think that may have been missed in the shuffle here. I'm going to open a new issue to reflect that change.

Update: #6078

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli enhancement New feature or request
Projects
None yet
6 participants