Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better node identification #3348

Closed
iknox-fa opened this issue May 12, 2021 · 6 comments · Fixed by #4898
Closed

Better node identification #3348

iknox-fa opened this issue May 12, 2021 · 6 comments · Fixed by #4898
Labels
dbt tests Issues related to built-in dbt testing functionality enhancement New feature or request
Milestone

Comments

@iknox-fa
Copy link
Contributor

iknox-fa commented May 12, 2021

Describe the feature

As proven by the necessity to add a hash to a test node's unique_id, the current model of node identification could use some tweaking.

This was discussed in previous ticket comments:
#3335 (comment)
#3335 (comment)

Describe alternatives you've considered

This isn't well fleshed out yet, but the ideas that come immediately to mind are:

  • re-work the existing human readable unique_id so the naming scheme is actually unique
  • replace the unique_id with a full hash and include enough tooling that we don't have to manage hashes by hand in tests, etc.

Who will this benefit?

This will make development simpler and eliminate the need for post-parsing checks like this.

Are you interested in contributing this feature?

image

@iknox-fa iknox-fa added enhancement New feature or request triage labels May 12, 2021
@jtcohen6 jtcohen6 added dbt tests Issues related to built-in dbt testing functionality and removed triage labels May 12, 2021
@jtcohen6
Copy link
Contributor

Thanks for this @iknox-fa! I'm tagging this tests because this is most relevant to schema/generic test nodes, for which the node name is auto-generated. (Every other node name is user-specified, often based on the file name.)

There was a great thread over in Slack just now about what might make sense from a user's point of view.

@iknox-fa
Copy link
Contributor Author

iknox-fa commented May 12, 2021

@jtcohen6 I see where you're coming from, but it's actually more about the unique_id than the name. For that reason I'm thinking we should keep all nodes in-play here.

Right now we have three+ ways to ID a node: unique_id unique_id + hash, and name. I'd love it if we could get that down to one human readable and unique identifier, but I rather doubt it will stay human readable* as we add more to the node content.

If that is the case, uniform logic to handle unique_ids as a hash and name as a human readable label seems like a likely solution and would include all nodes.

*unless you want a Donaudampfschifffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft

@jtcohen6 jtcohen6 added the 1.0.0 Issues related to the 1.0.0 release of dbt label Jun 24, 2021
@leahwicz
Copy link
Contributor

leahwicz commented Jul 8, 2021

  • Human readable vs hash?
  • If we do a hash we should be able to translate it back to the original name
  • We need a design discussion on this

@leahwicz
Copy link
Contributor

leahwicz commented Jul 8, 2021

Let's do an ADR on this first and then scope: #3548

Also we need to talk to the metadata team to see if this would impact them

@nathaniel-may
Copy link
Contributor

nathaniel-may commented Feb 7, 2022

This came up again while triaging #4684 and while doing some research for making a new ticket, I found this one. The rest of this comment is a proposal for an alternative identification strategy as it pertains specifically to tests:

Goals

  • Tests need to be uniquely identifiable so they can all be run with their results mapped back to the test definition.
  • Tests should be able to be tracked across runs. Users need a way to identify which test from a current run maps to a specific test from a previous run.

Today

The way we do test identification today involves a "name mangling" strategy which is defined in core/dbt/parser/generic_test_builders.py::get_nice_generic_test_name. This strategy produces a string with the model name, test type, test name, test arguments, and a hash of the previous values to disambiguate different tests that may have elements that contain the delimiter in them.

Implications:

  • Tests on the same model with the same arguments but with different configs are identified as the same test. Tests where the only difference is their configs cannot be run together.
  • Sometimes arguments can be large such as with the accepted_values test which makes test identifiers to be an unmanageable length.

Proposed Change

Computing function equality is undecidable in the general case, and even though we have a smaller subset of functions to work with when it comes to dbt tests, mechanically deciding which tests are equivalent seems like the wrong approach. Instead, I propose allowing users to name tests if they would like to track tests between runs. For example:

tests:
    - accepted_values:
           name: my_col_value_test
           values: [1, 2, 3, 4]
           quote: false

I propose that this test name must be unique across each model which we can think of as namespaces for test definitions. This means our name mangling strategy can be simplified to a pairing of "{model_name}_{test_name}". If two tests have the same name within the same model, an exception must be raised to identify the name collision and require the user to resolve the conflict before running tests.

Since requiring the name field on all tests would be a difficult breaking change for large projects with many tests, we will need a mechanically identified name that can be deterministically built. I propose pulling from the previous solution and using the model, test type, and values to identify the test. However, to keep the identifier a manageable size even when there are many arguments the name mangling strategy can be reduced to "{model_name}_{fixed_length_hash([test_type, ...arguments])}".

Additionally, to encourage the best practice of naming your test, in a future major version of dbt we could issue a warning for each test definition which does not have an explicit name configured.

Additional Context

@jtcohen6
Copy link
Contributor

@nathaniel-may Thank you again for this excellent comment. While thinking about this on the train, I realized that the code to fix this one is super straightforward and self-contained, so I gave it a go in #4898

@jtcohen6 jtcohen6 added this to the v1.1 milestone Mar 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbt tests Issues related to built-in dbt testing functionality enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants