Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1923] [Spike] Enforce model contracts for incremental materializations #6755

Closed
MichelleArk opened this issue Jan 26, 2023 · 10 comments
Closed

Comments

@MichelleArk
Copy link
Contributor

MichelleArk commented Jan 26, 2023

When contract is true, model contracts are enforced prior to table creation for table materializations.

Similarly, Incremental models would need to enforce contracts as part of on_schema_change handling, and before the existing table is modified.

@github-actions github-actions bot changed the title Enforce model contracts for incremental materializations [CT-1923] Enforce model contracts for incremental materializations Jan 26, 2023
@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 1, 2023

@MichelleArk Shall we open a separate issue for enforcing model contracts on view materializations? (Thoughts below, which either should or shouldn't land in that new issue.)

Views would only support dbt's "preflight" checks—column names & types—since data platforms don't actually enforce not null/nullable, check, or any other content-related constraints at view creation. I still think there's value in supporting the more limited contracts on views, too, and documenting the differences.

A thing worth considering, when we get to model versions: If you switch a contracted model's materialization from table to view, and it has constraints that used to be enforced at table creation time, but can no longer be enforced at view creation time—is that a breaking change to the contract, requiring a version bump?

@sungchun12
Copy link
Contributor

+1 for version bump

A thing worth considering, when we get to model versions: If you switch a contracted model's materialization from table to view, and it has constraints that used to be enforced at table creation time, but can no longer be enforced at view creation time—is that a breaking change to the contract, requiring a version bump?

@jtcohen6
Copy link
Contributor

Also: Python models. This issue or a separate issue? We either get this working, for some/all materializations, or we leave this validation error in place:

def constraints_language_validator(self, patched_node):
language_error = {}
language = str(patched_node.language)
if language != "sql":
language_error = {"language": language}
language_error_msg = f"\n Language Error: {language_error}"
language_error_msg_payload = f"{language_error_msg if language_error else None}"
return language_error_msg_payload

some/all materializations

  • Python models don't currently support view.
  • In subsequent runs of incremental models, this is quite simple, since we're upserting/merging into an existing table.
  • For table and first run of an incremental model, however, we'd need some way to check the schema of the returned dataframe before overwriting. Simplest (and slowest) is to always save the data into a temp table first. Then if the schema matches, create or replace.

@MichelleArk
Copy link
Contributor Author

View issue: #7034
Python issue (spike): #6984

@MichelleArk
Copy link
Contributor Author

MichelleArk commented Feb 27, 2023

Thinking through this more, incremental materializations might "just work" already because both the initial table creation and temp table creation for updates go through the create_table_as macros, which have contract enforcement + constraint ddl generation. This may be a matter of adding tests (for a matrix of incremental strategies and on_schema_change values if possible). Will spike to confirm!

@MichelleArk MichelleArk changed the title [CT-1923] Enforce model contracts for incremental materializations [CT-1923] [Spike] Enforce model contracts for incremental materializations Feb 27, 2023
@MichelleArk MichelleArk self-assigned this Feb 28, 2023
@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 1, 2023

@MichelleArk From our conversation on Monday: This mostly "just works"!

One gotcha: We should enforce that all incremental models with contract: true also turn on on_schema_change: "append_new_columns" (docs). Why?

Imagine:

  • You add a new column to both the SQL and the yaml spec
  • You don't set on_schema_change, or you set on_schema_change: 'ignore'
  • dbt doesn't actually add that new column to the existing table — and the upsert/merge still succeeds, because it does that upsert/merge on the basis of the already-existing "destination" columns only (this is long-established behavior)
  • The result is a delta between the yaml-defined contract, and the actual table in the database — which feels like a big no-no for contracted models!

Why append_new_columns, rather than sync_all_columns? Because removing existing columns is a brrrrrrreaking change for contracted models! We'll aim to catch that during node selection (#7065), but why even let people define it as an intended behavior in the first place?

Upside of this approach: Easy to reason about, document, explain.

Downsides of this approach:

  • Some adapters may not support on_schema_change, or the support varies (e.g. dbt-spark supports this for some file formats, not others)
  • Harder to start contracting & sharing existing incremental models, if they don't have this behavior defined. But users could still wrap an incremental model in a contracted & public view (knowing that it can't make use of the platform's support for constraints).

@MichelleArk
Copy link
Contributor Author

We should enforce that all incremental models with contract: true also turn on on_schema_change: "append_new_columns"

Makes sense 👍

Some adapters may not support on_schema_change, or the support varies (e.g. dbt-spark supports this for some file formats, not others)

We should call this out in the migration guide for adapter maintainers. What should the expected behaviour be if on_schema_change is not supported and a model has contract: true? I'm thinking that if the materialization can't guarantee that the materialized dataset will have the same columns as defined by the SQL for a model with contract: true, an error should be raised.

Does the variability in adapter support necessarily mean we need to implement this validation in the jinja, so that adapters can overwrite the behaviour (as opposed to earlier on during parsing, perhaps in the NodeConfig class itself)?

@MichelleArk
Copy link
Contributor Author

I also played around with an alternative approach to assert correct model contracts after process_schema_changes in the incremental materialization. We'd need to defer model contract validation in the create table statements to avoid running it twice, and call the assertion once we've got dest_columns available:

{% if config.get('contract', False) %}
      {{ get_assert_columns_equivalent(get_select_cols_from_relation_query(temp_relation, dest_columns)) }}
 {% endif %}
{% macro get_select_cols_from_relation_query(temp_relation, dest_columns) %}

    {%- set dest_cols_csv = get_quoted_csv(dest_columns | map(attribute="name")) -%}
    select {{ dest_cols_csv }}
    from {{ temp_relation }}

{% endmacro %}

This feels possible but ultimately I prefer the configuration enforcement approach you've outlined @jtcohen6 for a couple reasons:

  1. Validating model contracts during the incremental materialization still has the downside of adapter-specific support for on_schema_change.
  2. The end result for a user is the same - they see an error because the model contract can't be satisfied, and the resolution should be to set on_schema_change to append_new_columns.
    • Raising an error on invalid configuration (contract: true + on_schema_change != append_new_columns) detects it earlier in the modelling development lifecycle - during model / contracting time, as opposed to down the line when the model is actually being modified and is more difficult to resolve.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 1, 2023

What should the expected behaviour be if on_schema_change is not supported and a model has contract: true?

Hmm, not totally sure. I guess I'd hope/expect the adapter to raise an exception or warning (probably in Jinja, within the materialization) if if doesn't support on_schema_change, and an on_schema_change value is set.

We could:

{% if on_schema_change not in ['sync_all_columns', 'append_new_columns', 'fail', 'ignore'] %}

Does the variability in adapter support necessarily mean we need to implement this validation in the jinja, so that adapters can overwrite the behaviour (as opposed to earlier on during parsing, perhaps in the NodeConfig class itself)?

I think we could enforce the validation during parsing, if the only thing we're validating is:

if (contract == "true" and materialized == "incremental") and on_schema_change != "append_new_columns" then ValidationError

One callout is that the valid options of on_schema_change are just defined in the Jinja macro above, rather than a Python adapter method or in our dataclass validation logic (StrEnum).

@MichelleArk
Copy link
Contributor Author

Closing in favor of implementation issue: #7154

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants