Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-3296] [Feature] Option to enforce 2-argument ref() when referencing models from other packages/projects #8951

Open
2 tasks done
Tracked by #10125
jaklan opened this issue Oct 31, 2023 · 9 comments
Labels
enhancement New feature or request multi_project

Comments

@jaklan
Copy link

jaklan commented Oct 31, 2023

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

As stated in dbt docs here:

We especially recommend using two-argument ref to avoid ambiguity, in cases where a model name is duplicated across multiple projects or installed packages. If you use one-argument ref (just the model_name), dbt will look for a model by that name in the same namespace (package or project); if it finds none, it will raise an error.

The last sentence is incorrect though - if you use 1-argument ref and a model doesn't exist in the same package (namespace), but exists in a package installed as a dependency - no error is raised.

Expected Behavior

I expect the behaviour specified in docs, so 1-argument ref should only work with models from the same package (namespace), but not with models (even the public ones) from installed packages. Such models should be referenced exclusively with two-argument ref, otherwise - an error should be raised.

If it's not a bug, but a mistake in documentation and you don't want to enable such behaviour by default - there should be an option provided to enforce it.

Steps To Reproduce

  • create package_a
  • create model_a (public or protected) in package_a
  • create package_b
  • specify package_a as a dependency in package_b
  • create model_b in package_b as select * from {{ ref('model_a') }}
  • run dbt compile in package_b
  • compilation succeeds instead of raising an error

Relevant log output

No response

Environment

- OS: macOS
- Python: 3.10
- dbt: both 1.6 & 1.7

Which database adapter are you using with dbt?

redshift

Additional Context

No response

@jaklan jaklan added bug Something isn't working triage labels Oct 31, 2023
@github-actions github-actions bot changed the title [Bug] dbt doesn't enforce 2-argument 'ref()' when using models from installed packages [CT-3296] [Bug] dbt doesn't enforce 2-argument 'ref()' when using models from installed packages Oct 31, 2023
@jaklan
Copy link
Author

jaklan commented Oct 31, 2023

cc: @jtcohen6, as we discussed package / project namespaces in the past

@dbeatty10 dbeatty10 changed the title [CT-3296] [Bug] dbt doesn't enforce 2-argument 'ref()' when using models from installed packages [CT-3296] [Bug] dbt doesn't enforce 2-argument ref() when using models from installed packages Nov 1, 2023
@dbeatty10 dbeatty10 self-assigned this Nov 1, 2023
@dbeatty10
Copy link
Contributor

dbeatty10 commented Nov 1, 2023

@jaklan would you mind double-checking this?

When I tried this with dbt-core 1.6.6, two-argument ref within package_b raised an error when restrict-access: True within package_a. Otherwise, it passed without an error. See below for details.

We should update the docs

We should update the docs for two-argument refto specifically call-out how restrict-access in the upstream project affects two-argument ref in the downstream project.

Project setup

Here's the files that I used:

Directory structure

package_a
├── dbt_project.yml
├── models
│   └── model_a.sql
└── profiles.yml
package_b
├── dbt_project.yml
├── dependencies.yml
├── models
│   └── model_b.sql
└── profiles.yml

Project A

package_a/dbt_project.yml

name: "package_a"
version: "1.0.0"
config-version: 2
profile: "sandcastle"

restrict-access: True  # default is False

package_a/models/model_a.sql

select 1 as id

Project B

package_a/dbt_project.yml

name: "package_b"
version: "1.0.0"
config-version: 2
profile: "sandcastle"

package_b/models/model_b.sql

select * from {{ ref('model_a') }}

package_b/dependencies.yml

packages:
  - local: ../package_a

Try to compile in Project B

cd package_b
dbt deps
dbt compile

Error log output

(postgres_1.6) $ dbt compile

21:01:15  Running with dbt=1.6.6
21:01:16  Registered adapter: postgres=1.6.6
21:01:16  Unable to do partial parsing because saved manifest not found. Starting full parse.
21:01:16  Encountered an error:
Parsing Error
  Node model.package_b.model_b attempted to reference node model.package_a.model_a, which is not allowed because the referenced node is protected to the 'package_a' package.

When I removed restrict-access: True from package_a/dbt_project.yml, then it compiled just fine though.

@jaklan
Copy link
Author

jaklan commented Nov 1, 2023

@dbeatty10 if I understand correctly, you haven't specified model access, so by default it's protected. So in your case it failed not because of dbt enforcing 2-argument ref(), but because of referring to protected model with restrict-access: True. Please try again following that step:

  • create public model model_a in package_a

Actually that part:

create package_a with restrict-access: True

should be replaced with:

create package_a

as restrict-access: True is not really relevant for the actual issue. Sorry for the confusion, I have just updated the original description - without that config you should be able to refer to both protected and public models from package_a with 1-argument ref() with no error.

@dbeatty10
Copy link
Contributor

Yeah, I see what you are saying!

Indeed, in contraction to the original issue in #7446 and the docs here, the following code in the reprex below works fine instead of erroring.

i.e. this says that it shouldn't:

If you use one-argument ref (just the model_name), dbt will look for a model by that name in the same namespace (package or project); if it finds none, it will raise an error.

And #7446 also says it shouldn't:

  1. model in root (running) project has a ref('model_a')
    • if it defines its own model_a node → model.root_project.model_a
    • otherwise: raise an error indicating an ambiguous ref

The first two scenarios already work! But the third is broken because dbt resolves “model_a” to an arbitrary project.

Reprex

Here's the files that I used to demonstrate the issue

Directory structure

package_a
├── dbt_project.yml
├── models
│   └── model_a.sql
└── profiles.yml
package_b
├── dbt_project.yml
├── dependencies.yml
├── models
│   └── model_b.sql
└── profiles.yml

Project A

package_a/dbt_project.yml

name: "package_a"
version: "1.0.0"
config-version: 2
profile: "sandcastle"

package_a/models/model_a.sql

select 1 as id

Project B

package_a/dbt_project.yml

name: "package_b"
version: "1.0.0"
config-version: 2
profile: "sandcastle"

package_b/dependencies.yml

packages:
  - local: ../package_a

package_b/models/model_b.sql

select * from {{ ref('model_a') }}

Compile in Project B

cd package_b
dbt deps
dbt compile --s model_b

Output

(postgres_1.6) $ dbt compile --select models/model_b.sql
00:00:15  Running with dbt=1.6.6
00:00:16  Registered adapter: postgres=1.6.6
00:00:16  Found 2 models, 0 sources, 0 exposures, 0 metrics, 352 macros, 0 groups, 0 semantic models
00:00:16  
00:00:16  Concurrency: 5 threads (target='postgres')
00:00:16  
00:00:16  Compiled node 'model_b' is:
select * from "postgres"."dbt_dbeatty"."model_a"

@dbeatty10 dbeatty10 removed the triage label Nov 2, 2023
@dbeatty10 dbeatty10 removed their assignment Nov 2, 2023
@jaklan
Copy link
Author

jaklan commented Nov 2, 2023

@dbeatty10 thank you for the answer! I have also done a few additional tests in various combinations, here are the outcomes:

Click to expand test scenarios

package_b installs package_a, package_c installs package_b

  1. package_a with model_a + package_b with model_a:

    • select * from {{ ref('model_a') }} works in package_b - model_a is taken from package_b (✅)
  2. package_a with model_a + package_b without model_a:

    • select * from {{ ref('model_a') }} works in package_b - model_a is taken from package_a (🚨)
  3. package_a with model_a + package_b with model_a + package_c with model_a:

    • select * from {{ ref('model_a') }} works in package_c, model_a is taken from package_c (✅)
  4. package_a with model_a + package_b with model_a + package_c without model_a:

    • select * from {{ ref('model_a') }} doesn't work in package_c (✅)

      Compilation Error in model model_c1 (models/model_c1.sql)
        When referencing 'model_a', dbt found nodes in multiple packages: 'model.project_b.model_a', 'model.project_a.model_a'
        To fix this, use two-argument 'ref', with the package name first: 'project_b' or 'project_a'
      
  5. package_a with model_a + package_b without model_a + package_c without model_a:

    • select * from {{ ref('model_a') }} works in package_c - model_a is taken from package_a (🚨)
  6. package_a without model_a + package_b with model_a + package_c without model_a:

    • select * from {{ ref('model_a') }} works in package_c - model_a is taken from package_b (🚨)
  7. package_a without model_a + package_b with model_a + package_c with model_a:

    • select * from {{ ref('model_a') }} works in package_c, model_a is taken from package_c (✅)
  8. package_a with model_a + package_b without model_a + package_c with model_a:

    • select * from {{ ref('model_a') }} works in package_c, model_a is taken from package_c (✅)

To sum up - an error is only raised in a situation when both conditions are met:

  • a) model_a is not found in the current project
  • b) model_a is found in more than one installed package

When model_a is found in one installed package exclusively then there's no ambiguity, so that's why dbt doesn't fail currently. But in my opinion - ambiguity shouldn't be the only reason to raise such error.

Why? Let's imagine the 6. scenario described above - you work on package_c without model_a, but such model is in package_b, so ref('model_a') works with no error. But after some time - maintainers of package_a decide to add model_a to their package. This way our scenario turns into the 4. one - so package_c starts to fail although it wasn't modified.

Another example - we again start with the 6. scenario, but this time model_a was added to package_c by another developer working on that project. This way we get into the 7. scenario. All models which were previously using ref('model_a') to refer to model_a from package_b - now quietly start to refer to the new model from package_c!

The only way to prevent such situations is to use 2-argument ref() from the very beginning. But when you can't enforce it - it's just a convention that developers may simply not follow, we have no real control of that. And that's why I find it really crucial that dbt allows to enforce explicit, 2-argument ref() when using models from installed packages - either as a default behaviour or as a config in package_c.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Nov 29, 2023

After reading this a few times, I've finally understood!

The docs here say:

We especially recommend using two-argument ref to avoid ambiguity, in cases where a model name is duplicated across multiple projects or installed packages. If you use one-argument ref (just the model_name), dbt will look for a model by that name in the same namespace (package or project); if it finds none, it will raise an error.

When perhaps we should more clearly say:

In cases where a model name is duplicated across multiple projects or installed packages, it is always recommended to use two-argument ref to avoid ambiguity. If you use one-argument ref (just the model_name), dbt will look for a model by that name in the same namespace (package or project); if dbt finds none in the same namespace, and multiple candidates in other namespaces, dbt will raise an error.

@jaklan It sounds like you want a compiler setting that is: "All references to models in other namespaces (projects/packages) must use two-argument ref." It would need to be opt-in, to avoid breaking existing projects. Would you expect to set this:

  1. As a config within the referenced project (similar to require-access; could be different for different projects/packages being referenced)
  2. As a config within the referencing project (could be different for different referencing projects)
  3. As a runtime config (applies to all projects/packages)

Personally, I'd vote for (3) — it's by far simplest.

I'm going to relabel this from bug to enhancement, and add it to my list of multi-project follow-ups (left over from #7979).

@jtcohen6 jtcohen6 added enhancement New feature or request multi_project and removed bug Something isn't working labels Nov 29, 2023
@jtcohen6 jtcohen6 changed the title [CT-3296] [Bug] dbt doesn't enforce 2-argument ref() when using models from installed packages [CT-3296] [Enhancement] Option to enforce 2-argument ref() when referencing models from other packages/projects Nov 29, 2023
@jtcohen6 jtcohen6 changed the title [CT-3296] [Enhancement] Option to enforce 2-argument ref() when referencing models from other packages/projects [CT-3296] [Feature] Option to enforce 2-argument ref() when referencing models from other packages/projects Nov 29, 2023
@jaklan
Copy link
Author

jaklan commented Nov 29, 2023

Hi @jtcohen6, thanks for the reply. Yes, we have two issues here:

  1. docs clarification
  2. option to enforce 2-arg ref()

Ad. 1: Looking at your proposal, I believe we can still improve it a bit.

If you use one-argument ref (just the model_name), dbt will look for a model by that name in the same namespace (package or project); if dbt finds none in the same namespace, and multiple candidates in other namespaces, dbt will raise an error.

That part imho doesn't cover the case when given model is not in the same namespace, but one candidate is in another namespace, so that's for example that scenario:

  1. package_a with model_a + package_b without model_a:
    • select * from {{ ref('model_a') }} works in package_b - model_a is taken from package_a

Maybe we should describe the resolution mechanism in a step-by-step form like:

  1. dbt looks for the model in the current namespace
  2. If it's found - the model is selected
  3. If it's not found, dbt looks for the model in another namespaces:
    3a. If dbt finds one candidate - that candidate is selected
    3b. If dbt finds multiple candidates - the error is raised

Ad. 2: Exactly, that's the option I'm looking for. And fully agree we can't introduce breaking changes. About the 3 proposed variants - I would definitely vote for the second approach, because defining that as a runtime config could lead to completely different behaviour of given project depending on local settings. Let's look at that scenario:

  1. package_a with model_a + package_b without model_a + package_c without model_a:
    • select * from {{ ref('model_a') }} works in package_c - model_a is taken from package_a

If you enable the new runtime config - that's gonna fail (as expected). But if some developers in the project don't do that - it would still work. So we end up in a situation when different developers can use different ref() syntax within the same project, which creates a lot of confusion and personally I don't see any added value here. That's why I think the goal of that setting should be to enforce 2-arg ref() at the project level, so all people working on given project have to use it.

@jtcohen6
Copy link
Contributor

That's why I think the goal of that setting should be to enforce 2-arg ref() at the project level, so all people working on given project have to use it.

I agree with this!

I should clarify what I meant in option (3) above: I don't think it should be possible to enforce this differently based on the configs of the referencing project. It should be the same for all models in an invocation. And it should be a config that's defined, in version control, in the project code, so that it does not differ for different users/environments.

@jaklan
Copy link
Author

jaklan commented Nov 29, 2023

@jtcohen6

And it should be a config that's defined, in version control, in the project code, so that it does not differ for different users/environments.

So we agree on that one 😄 I just still miss what's the difference between 2nd and 3rd variant then, could you provide some example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request multi_project
Projects
None yet
Development

No branches or pull requests

3 participants