Better node identification #3348

iknox-fa · 2021-05-12T19:18:38Z

Describe the feature

As proven by the necessity to add a hash to a test node's unique_id, the current model of node identification could use some tweaking.

This was discussed in previous ticket comments:
#3335 (comment)
#3335 (comment)

Describe alternatives you've considered

This isn't well fleshed out yet, but the ideas that come immediately to mind are:

re-work the existing human readable unique_id so the naming scheme is actually unique
replace the unique_id with a full hash and include enough tooling that we don't have to manage hashes by hand in tests, etc.

Who will this benefit?

This will make development simpler and eliminate the need for post-parsing checks like this.

Are you interested in contributing this feature?

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2021-05-12T19:32:54Z

Thanks for this @iknox-fa! I'm tagging this tests because this is most relevant to schema/generic test nodes, for which the node name is auto-generated. (Every other node name is user-specified, often based on the file name.)

There was a great thread over in Slack just now about what might make sense from a user's point of view.

iknox-fa · 2021-05-12T19:46:05Z

@jtcohen6 I see where you're coming from, but it's actually more about the unique_id than the name. For that reason I'm thinking we should keep all nodes in-play here.

Right now we have three+ ways to ID a node: unique_id unique_id + hash, and name. I'd love it if we could get that down to one human readable and unique identifier, but I rather doubt it will stay human readable* as we add more to the node content.

If that is the case, uniform logic to handle unique_ids as a hash and name as a human readable label seems like a likely solution and would include all nodes.

*unless you want a Donaudampfschifffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft

leahwicz · 2021-07-08T17:40:25Z

Human readable vs hash?
If we do a hash we should be able to translate it back to the original name
We need a design discussion on this

leahwicz · 2021-07-08T17:42:54Z

Let's do an ADR on this first and then scope: #3548

Also we need to talk to the metadata team to see if this would impact them

nathaniel-may · 2022-02-07T19:32:09Z

This came up again while triaging #4684 and while doing some research for making a new ticket, I found this one. The rest of this comment is a proposal for an alternative identification strategy as it pertains specifically to tests:

Goals

Tests need to be uniquely identifiable so they can all be run with their results mapped back to the test definition.
Tests should be able to be tracked across runs. Users need a way to identify which test from a current run maps to a specific test from a previous run.

Today

The way we do test identification today involves a "name mangling" strategy which is defined in core/dbt/parser/generic_test_builders.py::get_nice_generic_test_name. This strategy produces a string with the model name, test type, test name, test arguments, and a hash of the previous values to disambiguate different tests that may have elements that contain the delimiter in them.

Implications:

Tests on the same model with the same arguments but with different configs are identified as the same test. Tests where the only difference is their configs cannot be run together.
Sometimes arguments can be large such as with the accepted_values test which makes test identifiers to be an unmanageable length.

Proposed Change

Computing function equality is undecidable in the general case, and even though we have a smaller subset of functions to work with when it comes to dbt tests, mechanically deciding which tests are equivalent seems like the wrong approach. Instead, I propose allowing users to name tests if they would like to track tests between runs. For example:

tests:
    - accepted_values:
           name: my_col_value_test
           values: [1, 2, 3, 4]
           quote: false

I propose that this test name must be unique across each model which we can think of as namespaces for test definitions. This means our name mangling strategy can be simplified to a pairing of "{model_name}_{test_name}". If two tests have the same name within the same model, an exception must be raised to identify the name collision and require the user to resolve the conflict before running tests.

Since requiring the name field on all tests would be a difficult breaking change for large projects with many tests, we will need a mechanically identified name that can be deterministically built. I propose pulling from the previous solution and using the model, test type, and values to identify the test. However, to keep the identifier a manageable size even when there are many arguments the name mangling strategy can be reduced to "{model_name}_{fixed_length_hash([test_type, ...arguments])}".

Additionally, to encourage the best practice of naming your test, in a future major version of dbt we could issue a warning for each test definition which does not have an explicit name configured.

Additional Context

There is an existing alias configuration, but it's currently only used when storing test results in the warehouse dbt 0.20.0 tries to create two tables with the same name when --store-failures is used with similar-looking expression_is_true tests #3626
Long names can sometimes make writing files impossible on some file systems [CT-166] [Bug] accepted_values test causing OSError: File name too long #4684

jtcohen6 · 2022-03-18T07:58:35Z

@nathaniel-may Thank you again for this excellent comment. While thinking about this on the train, I realized that the code to fix this one is super straightforward and self-contained, so I gave it a go in #4898

iknox-fa added enhancement New feature or request triage labels May 12, 2021

iknox-fa mentioned this issue May 12, 2021

Feature/schema tests are more unique #3335

Merged

4 tasks

jtcohen6 added dbt tests Issues related to built-in dbt testing functionality and removed triage labels May 12, 2021

jtcohen6 mentioned this issue Jun 24, 2021

[spike] Allow generic data tests to be documented #2578

Open

jtcohen6 added the 1.0.0 Issues related to the 1.0.0 release of dbt label Jun 24, 2021

leahwicz mentioned this issue Jun 29, 2021

Detail and scope 1.0.0 issues #3370

Closed

18 tasks

jtcohen6 mentioned this issue Jul 26, 2021

dbt 0.20.0 tries to create two tables with the same name when --store-failures is used with similar-looking expression_is_true tests #3626

Closed

5 tasks

jtcohen6 mentioned this issue Oct 21, 2021

[Feature] Allow multiple unique tests for the same column #4102

Closed

1 task

jtcohen6 removed the 1.0.0 Issues related to the 1.0.0 release of dbt label Nov 11, 2021

nathaniel-may mentioned this issue Feb 7, 2022

[CT-166] [Bug] accepted_values test causing OSError: File name too long #4684

Closed

1 task

jtcohen6 mentioned this issue Mar 18, 2022

Custom names for generic tests #4898

Merged

7 tasks

jtcohen6 added this to the v1.1 milestone Mar 25, 2022

jtcohen6 added the Team:Language label Mar 25, 2022

jtcohen6 closed this as completed in #4898 Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better node identification #3348

Better node identification #3348

iknox-fa commented May 12, 2021 •

edited

Loading

jtcohen6 commented May 12, 2021

iknox-fa commented May 12, 2021 •

edited

Loading

leahwicz commented Jul 8, 2021

leahwicz commented Jul 8, 2021 •

edited

Loading

nathaniel-may commented Feb 7, 2022 •

edited

Loading

jtcohen6 commented Mar 18, 2022

Better node identification #3348

Better node identification #3348

Comments

iknox-fa commented May 12, 2021 • edited Loading

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

jtcohen6 commented May 12, 2021

iknox-fa commented May 12, 2021 • edited Loading

leahwicz commented Jul 8, 2021

leahwicz commented Jul 8, 2021 • edited Loading

nathaniel-may commented Feb 7, 2022 • edited Loading

Goals

Today

Proposed Change

Additional Context

jtcohen6 commented Mar 18, 2022

iknox-fa commented May 12, 2021 •

edited

Loading

iknox-fa commented May 12, 2021 •

edited

Loading

leahwicz commented Jul 8, 2021 •

edited

Loading

nathaniel-may commented Feb 7, 2022 •

edited

Loading