[CT-1923] [Spike] Enforce model contracts for incremental materializations #6755

MichelleArk · 2023-01-26T16:20:18Z

When contract is true, model contracts are enforced prior to table creation for table materializations.

Similarly, Incremental models would need to enforce contracts as part of on_schema_change handling, and before the existing table is modified.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2023-02-01T10:09:00Z

@MichelleArk Shall we open a separate issue for enforcing model contracts on view materializations? (Thoughts below, which either should or shouldn't land in that new issue.)

Views would only support dbt's "preflight" checks—column names & types—since data platforms don't actually enforce not null/nullable, check, or any other content-related constraints at view creation. I still think there's value in supporting the more limited contracts on views, too, and documenting the differences.

A thing worth considering, when we get to model versions: If you switch a contracted model's materialization from table to view, and it has constraints that used to be enforced at table creation time, but can no longer be enforced at view creation time—is that a breaking change to the contract, requiring a version bump?

sungchun12 · 2023-02-01T15:48:29Z

+1 for version bump

A thing worth considering, when we get to model versions: If you switch a contracted model's materialization from table to view, and it has constraints that used to be enforced at table creation time, but can no longer be enforced at view creation time—is that a breaking change to the contract, requiring a version bump?

jtcohen6 · 2023-02-15T12:38:34Z

Also: Python models. This issue or a separate issue? We either get this working, for some/all materializations, or we leave this validation error in place:

dbt-core/core/dbt/parser/schemas.py

Lines 972 to 979 in b5b1699

    
           def constraints_language_validator(self, patched_node): 
        
               language_error = {} 
        
               language = str(patched_node.language) 
        
               if language != "sql": 
        
                   language_error = {"language": language} 
        
               language_error_msg = f"\n    Language Error: {language_error}" 
        
               language_error_msg_payload = f"{language_error_msg if language_error else None}" 
        
               return language_error_msg_payload

some/all materializations

Python models don't currently support view.
In subsequent runs of incremental models, this is quite simple, since we're upserting/merging into an existing table.
For table and first run of an incremental model, however, we'd need some way to check the schema of the returned dataframe before overwriting. Simplest (and slowest) is to always save the data into a temp table first. Then if the schema matches, create or replace.

MichelleArk · 2023-02-27T15:48:57Z

View issue: #7034
Python issue (spike): #6984

MichelleArk · 2023-02-27T15:50:56Z

Thinking through this more, incremental materializations might "just work" already because both the initial table creation and temp table creation for updates go through the create_table_as macros, which have contract enforcement + constraint ddl generation. This may be a matter of adding tests (for a matrix of incremental strategies and on_schema_change values if possible). Will spike to confirm!

jtcohen6 · 2023-03-01T14:44:51Z

@MichelleArk From our conversation on Monday: This mostly "just works"!

One gotcha: We should enforce that all incremental models with contract: true also turn on on_schema_change: "append_new_columns" (docs). Why?

Imagine:

You add a new column to both the SQL and the yaml spec
You don't set on_schema_change, or you set on_schema_change: 'ignore'
dbt doesn't actually add that new column to the existing table — and the upsert/merge still succeeds, because it does that upsert/merge on the basis of the already-existing "destination" columns only (this is long-established behavior)
The result is a delta between the yaml-defined contract, and the actual table in the database — which feels like a big no-no for contracted models!

Why append_new_columns, rather than sync_all_columns? Because removing existing columns is a brrrrrrreaking change for contracted models! We'll aim to catch that during node selection (#7065), but why even let people define it as an intended behavior in the first place?

Upside of this approach: Easy to reason about, document, explain.

Downsides of this approach:

Some adapters may not support on_schema_change, or the support varies (e.g. dbt-spark supports this for some file formats, not others)
Harder to start contracting & sharing existing incremental models, if they don't have this behavior defined. But users could still wrap an incremental model in a contracted & public view (knowing that it can't make use of the platform's support for constraints).

MichelleArk · 2023-03-01T16:40:13Z

We should enforce that all incremental models with contract: true also turn on on_schema_change: "append_new_columns"

Makes sense 👍

Some adapters may not support on_schema_change, or the support varies (e.g. dbt-spark supports this for some file formats, not others)

We should call this out in the migration guide for adapter maintainers. What should the expected behaviour be if on_schema_change is not supported and a model has contract: true? I'm thinking that if the materialization can't guarantee that the materialized dataset will have the same columns as defined by the SQL for a model with contract: true, an error should be raised.

Does the variability in adapter support necessarily mean we need to implement this validation in the jinja, so that adapters can overwrite the behaviour (as opposed to earlier on during parsing, perhaps in the NodeConfig class itself)?

MichelleArk · 2023-03-01T18:34:03Z

I also played around with an alternative approach to assert correct model contracts after process_schema_changes in the incremental materialization. We'd need to defer model contract validation in the create table statements to avoid running it twice, and call the assertion once we've got dest_columns available:

{% if config.get('contract', False) %}
      {{ get_assert_columns_equivalent(get_select_cols_from_relation_query(temp_relation, dest_columns)) }}
 {% endif %}

{% macro get_select_cols_from_relation_query(temp_relation, dest_columns) %}

    {%- set dest_cols_csv = get_quoted_csv(dest_columns | map(attribute="name")) -%}
    select {{ dest_cols_csv }}
    from {{ temp_relation }}

{% endmacro %}

This feels possible but ultimately I prefer the configuration enforcement approach you've outlined @jtcohen6 for a couple reasons:

Validating model contracts during the incremental materialization still has the downside of adapter-specific support for on_schema_change.
The end result for a user is the same - they see an error because the model contract can't be satisfied, and the resolution should be to set on_schema_change to append_new_columns.
- Raising an error on invalid configuration (contract: true + on_schema_change != append_new_columns) detects it earlier in the modelling development lifecycle - during model / contracting time, as opposed to down the line when the model is actually being modified and is more difficult to resolve.

jtcohen6 · 2023-03-01T20:24:24Z

What should the expected behaviour be if on_schema_change is not supported and a model has contract: true?

Hmm, not totally sure. I guess I'd hope/expect the adapter to raise an exception or warning (probably in Jinja, within the materialization) if if doesn't support on_schema_change, and an on_schema_change value is set.

We could:

Enable defining custom versions of the incremental_validate_on_schema_change macro, by adding some dispatch logic so it can be adapter-specific
Change that from a macro to a Python adapter method, similar to what we did with valid_incremental_strategies

dbt-core/core/dbt/include/global_project/macros/materializations/models/incremental/on_schema_change.sql

Line 3 in 7efb6ab

    
           {% if on_schema_change not in ['sync_all_columns', 'append_new_columns', 'fail', 'ignore'] %}

Does the variability in adapter support necessarily mean we need to implement this validation in the jinja, so that adapters can overwrite the behaviour (as opposed to earlier on during parsing, perhaps in the NodeConfig class itself)?

I think we could enforce the validation during parsing, if the only thing we're validating is:

if (contract == "true" and materialized == "incremental") and on_schema_change != "append_new_columns" then ValidationError

One callout is that the valid options of on_schema_change are just defined in the Jinja macro above, rather than a Python adapter method or in our dataclass validation logic (StrEnum).

MichelleArk · 2023-03-10T20:41:53Z

Closing in favor of implementation issue: #7154

MichelleArk added Team:Language multi_project labels Jan 26, 2023

github-actions bot changed the title ~~Enforce model contracts for incremental materializations~~ [CT-1923] Enforce model contracts for incremental materializations Jan 26, 2023

MichelleArk mentioned this issue Jan 26, 2023

[CT-1915] [Epic] Multi-project collaboration - Milestone 1 #6747

Closed

jtcohen6 mentioned this issue Feb 1, 2023

dbt Constraints / model contracts dbt-labs/dbt-spark#574

Merged

6 tasks

MichelleArk changed the title ~~[CT-1923] Enforce model contracts for incremental materializations~~ [CT-1923] [Spike] Enforce model contracts for incremental materializations Feb 27, 2023

MichelleArk self-assigned this Feb 28, 2023

MichelleArk added the spike label Mar 3, 2023

emmyoop mentioned this issue Mar 8, 2023

support contracts on models materialized as view dbt-labs/dbt-spark#670

Merged

6 tasks

MichelleArk mentioned this issue Mar 10, 2023

[CT-2293] Enforce model contracts for incremental materializations #7154

Closed

MichelleArk closed this as completed Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1923] [Spike] Enforce model contracts for incremental materializations #6755

[CT-1923] [Spike] Enforce model contracts for incremental materializations #6755

MichelleArk commented Jan 26, 2023 •

edited

Loading

jtcohen6 commented Feb 1, 2023

sungchun12 commented Feb 1, 2023

jtcohen6 commented Feb 15, 2023

MichelleArk commented Feb 27, 2023

MichelleArk commented Feb 27, 2023 •

edited

Loading

jtcohen6 commented Mar 1, 2023

MichelleArk commented Mar 1, 2023

MichelleArk commented Mar 1, 2023

jtcohen6 commented Mar 1, 2023

MichelleArk commented Mar 10, 2023

[CT-1923] [Spike] Enforce model contracts for incremental materializations #6755

[CT-1923] [Spike] Enforce model contracts for incremental materializations #6755

Comments

MichelleArk commented Jan 26, 2023 • edited Loading

jtcohen6 commented Feb 1, 2023

sungchun12 commented Feb 1, 2023

jtcohen6 commented Feb 15, 2023

MichelleArk commented Feb 27, 2023

MichelleArk commented Feb 27, 2023 • edited Loading

jtcohen6 commented Mar 1, 2023

MichelleArk commented Mar 1, 2023

MichelleArk commented Mar 1, 2023

jtcohen6 commented Mar 1, 2023

MichelleArk commented Mar 10, 2023

MichelleArk commented Jan 26, 2023 •

edited

Loading

MichelleArk commented Feb 27, 2023 •

edited

Loading