Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2460] [Feature] Infer schema from prod, to enforce contract & detect breaking changes in dev #7432

Open
Tracked by #7979 ...
jtcohen6 opened this issue Apr 21, 2023 · 15 comments · May be fixed by #8339
Open
Tracked by #7979 ...
Assignees
Labels
enhancement New feature or request model_contracts multi_project Refinement Maintainer input needed spike state Stateful selection (state:modified, defer)

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Apr 21, 2023

In v1.5, we added support for:

  • Declaring model contracts in yaml (which can require a lot of yaml!)
  • Enforcing that contract at build time, and raising a nice error message if there's a mismatch

What if we could manage to do that, without requiring all that yaml to be defined upfront? We infer the contract on the basis of the current version of this model in production. In that way, we also marry together the contract’s enforcement and its breaking change detection.

This would require a mechanism similar to what's proposed in #7256 / #7258. If provided with a manifest for deferral / state comparison, we should store a state_relation associated with the prod counterpart of each node, in the current manifest.

During the model’s materialization:

  1. Dynamically access the columns of the “stateful” (prod) relation
  2. Use those columns, instead of user-defined config (yaml), as the basis for comparison. Crucially, we should allow column additions, but raise a ContractBreakingChangeError if an existing column is renamed/removed/retyped.
models:
  - name: my_model_with_dynamic_contract
    contract:
      enforced: true
      inferred: true

Pseudo logic:

{% if state_relation and contract.inferred %}
    {% set dev_columns = get_column_schema_from_query(sql) %}
    {% set prod_columns = get_column_schema_from_query(get_empty_subquery_sql("select * from " ~ state_relation)) %}
    {% set dev_formatted = format_columns(dev_columns)['formatted'] %}
    {% set prod_formatted = format_columns(prod_columns)['formatted'] %}

    {% for prod_column in prod_formatted %}
        {% if prod_column not in dev_formatted %}
            {%- do exceptions.raise_contract_breaking_change_error(dev_columns, prod_columns) -%}
        {% endif %}
    {% endfor %}
{% endif %}

Considerations

We'd need some special handling when a model is being renamed, or created for the first time. If there is no state_relation, there is no contract to infer/enforce — it's whatever the model SQL says.

We would need to do more work to support partial contract definitions—a mix of explicit & inferred—whereby a user could define certain columns with constraints, dynamic data masking rules (!), etc—but wouldn’t have to define them all.

dbt would still need to have access to the the full column spec, during create table DDL. How to get the names and data types of columns that are not defined in yaml?

  • A lot of that work might be around type aliasing, which gets tricky quickly. Many database cursors return column types with a genuinely different naming convention from how that database supports receiving them. We know what the state_relation is, we should be able to just run a describe statement against it (adapter.get_columns_in_relation) to get back data types that we can us within the create table DDL.
  • If push comes to shove, we could switch out the current implementation of the "preflight" check (run a single "empty" query and get the schema from the cursor) for an even-more-naive one, where we create a temp table and run a describe statement / information schema query against it.
models:
  - name: my_model_with_partial_contract
    contract:
      enforced: true
      inferred: true # but i'll declare *some* columns below
    columns:
      - name: column_a
        data_type: int
        constraints:
          - type: not_null
      - name: column_b
        data_type: # leave this to be inferred?
      # other columns not declared -- to be inferred from prod
@jtcohen6 jtcohen6 added enhancement New feature or request state Stateful selection (state:modified, defer) model_contracts labels Apr 21, 2023
@github-actions github-actions bot changed the title [Feature] Infer contracts from prod, to enforce & detect breaking changes in dev [CT-2460] [Feature] Infer contracts from prod, to enforce & detect breaking changes in dev Apr 21, 2023
@jtcohen6 jtcohen6 changed the title [CT-2460] [Feature] Infer contracts from prod, to enforce & detect breaking changes in dev [CT-2460] [Feature] Infer schema from prod, to enforce contract & detect breaking changes in dev Apr 21, 2023
@jtcohen6 jtcohen6 added the Refinement Maintainer input needed label Apr 21, 2023
@jtcohen6 jtcohen6 added the spike label May 26, 2023
@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Jun 9, 2023

There's more to figure out here, but an immediate action we can take is revisiting this in light of the implementation of #7796. My sense is that will only be called by certain commands, whereas we would need state_relation to be available any time we want to infer/enforce contracts.

@ghost
Copy link

ghost commented Jun 12, 2023

i love this idea, i think it lends itself well to a new implementation of dbt and could be used as a check that the logic now spelled out in dbt code will in fact output data in the same format as the existing data.

what i'd really love to see though is a column level contract to enable models that are only partially covered by a contract. all or nothing makes adoption of contracts cumbersome and time consuming, especially the data types requirement imo.

@jtcohen6
Copy link
Contributor Author

Relevant follow-up work from the defer_relation implementation for dbt clone:

@rflitcroft
Copy link

This would be a great addition. I have a use case where I would like to ensure that a subset of columns are present with the correct data_type, whilst being lenient about any others undeclared in the contract.

Perhaps a simple contract setting at the model level such as enforce_undeclared_columns: false
...Unless of course this would be a natural behaviour of what is already proposed.

@ghost
Copy link

ghost commented Aug 4, 2023

I'm also very much interested in the partial contracts feature of this ticket - this would be really useful for ensuring a given amount of a model conforms to the intended shape while allowing divergence on any other customizations implemented in a data meshy environment.

@MichelleArk MichelleArk linked a pull request Aug 8, 2023 that will close this issue
4 tasks
@jtcohen6
Copy link
Contributor Author

We've started spiking this over in #8339, with some promising results!

This is an "inferred" contract, not a "partial" contract

I want to distinguish between two ideas that I'm seeing in the comments above: an "inferred" model contract is different from a "partial" model contract. With an "inferred" contract, every column is contracted, but only some columns are explicitly specified in yaml. The names and data types of other columns are inferred from the columns in production. A "partial" contract would mean, only some of the columns in this model are actually reliable guarantees for downstream consumers. I think that's a much trickier undertaking, requiring careful thinking about column-level documentation and access policies. (How can a downstream consumer reliably know which columns are "under contract," and which ones might go away at any moment?)

The scope of this issue is "inferred" contracts only. As soon as a model is pushed into production, every column is "under contract," and dbt will raise warnings/errors if any is renamed/retyped/removed. The result is the same, but the input is what's different: users only need to specify columns if those columns also have constraints, tests, descriptions, etc. (In the current spike implementation, if you want to specify anything about a column, it needs to be fully specified, including a data_type. We could look into supporting a mix of explicit and inferred attributes.)

What about when deferral is not available?

One more thing: Inferred contracts can only be enforced when a manifest is provided and --defer is enabled. Otherwise, the behavior is to fail the contract check:

$ dbt run --defer --state state/
...
Completed successfully

$ dbt run
...
Compilation Error in model my_model (models/my_model.sql)
  This model has an enforced contract that failed.
  Please ensure the name, data_type, and number of columns in your contract match the columns in your model's definition.

  | column_name | definition_type | contract_type | mismatch_reason     |
  | ----------- | --------------- | ------------- | ------------------- |
  | test        | TEXT            |               | missing in contract |
  | test2       | INT             |               | missing in contract |

Does that feel right? Should we instead raise a warning, saying that the contract could not be fully enforced, but allowing the user to continue for now? I foresee some users switching the contract to enforced: false in development, and then forgetting to switch it back on before it runs in CI. Disabling the contract enforcement should be caught as an error there as well.

@ghost
Copy link

ghost commented Aug 14, 2023

@jtcohen6 re the split of partial vs inferred, it might be worth separating the two definitions into additional issues on here https://docs.getdbt.com/docs/collaborate/govern/model-contracts#can-i-define-a-partial-contract

I would be interested in the "partial" definition as you've outlined in here, rather than the inferred one but at the moment I think there is a single issue for both :)

mirnawong1 added a commit to dbt-labs/docs.getdbt.com that referenced this issue Sep 11, 2023
## What are you changing in this pull request and why?

Neither of these things exists yet — but they are distinct, and one is
likely to exist much sooner than the other!

Context:
-
dbt-labs/dbt-core#7432 (comment)
@charles-astrafy
Copy link

+1 for this feature but I still feel that one of the major issue with current implementation of data contracts within dbt is that the data contract can be modified unilaterally by one party. It goes against the idea of a "contract" that both parties agree upon and that can only be changed if both parties agree to change it.

@Ashish-Agrawal-tfs
Copy link

+1

1 similar comment
@arasdk
Copy link

arasdk commented Jan 22, 2024

+1

phamthiminhtu pushed a commit to phamthiminhtu/batch-data-pipeline that referenced this issue Jan 29, 2024
@kmarq
Copy link

kmarq commented Jan 29, 2024

Wanted to drop some thoughts on the enforced component of contracts. Currently in order to use constraints you have to have a fully defined table to also enforce a contract. We typically like to define important columns and metadata comments but do not define their types within the yaml. Typing is done in the sql models with some additional macro support for custom logic.

Constraints can provide a lot of capability outside of just enforcing a contract (such as defining PK/FK relationships within the dwh for BI tools). I'd really like to see the ability to apply constraints without having to fully implement the other requirements of a contact (defining each column and its datatype for example).

I'd really appreciate being able to have constraints applied, without having to manage all of the other components that are currently required for a contract.

@rumbin
Copy link

rumbin commented Jan 30, 2024

The proposed mechanism seems to extend the existing one of the --state-based backward-compatibility check.
The latter often came into my way within models that I actually did not want to have compatibility-checked, as they are located in less exposed modeling layers where we allow ourselves more freedom of change.
The only reason for having the contracts defined for these models is to automatically have FK/PK constraints set, so the query engine (here Snowflake) can perform join elimination.

BTW, at least for Snowflake, dbt doesn't respect the run order of the models based on the FK relationships. This can lead to failing FK declaration queries, as the PK of the referred model may not be declared yet. Reason: the referred model is not an upstream model, of the one that declares the FK relationship. And explicitly declaring it as such via depends_on alters the DAG in a way that make stateful runs execute too many models.

So in conclusion, I second @kmarq's remarks on constraints and I'd like to suggest that their execution order needs to have a dependency graph of its own, which cannot be reflected by existing dbt model dependencies in an efficient way.

@emirkmo
Copy link

emirkmo commented Apr 4, 2024

Warning! Negative but hopefully constructive feedback:

With contracts, explicit is better than implicit and having inferred contracts is going to add a lot of headaches.

I would like to suggest instead tooling to generate the yaml, so that contracts are always explicit, as more sustainable long term approach. This has the added benefit of being helpful outside of contracts as well, solving one pain point of extensive dbt usage.

I believe creating/merging this “implicit contract” will be a footgun.

Hope examples/details below are helpful:

Currently, we use a variety of ways to generate more explicit non-dbt contracts from dbt manifest and yaml files, and it would be great if contracts were a true first class citizen of dbt itself for our chosen adapters. But making it optional or tying it “to the state in prod” is very dangerous and has not been thought through.

For example, what should happen to an implicit contract upon database failure or data issues/type corruption etc.? How should contracts behave when migrating models to a new database or tool? How much can users rely on an implicit contract that’s not in version control but is state-dependent in a (likely distributed) DBMS? What about datalake or columnar analytics databases/DWs? How useful are contracts when they are essentially a Describe query on a database instead of yaml that can be consumed by various tooling/services in data platforms?

Benefits of explicit YAML: Onr can have code review over yaml. One can have meetings with non-technical stakeholders over yaml, easily translating it to a myriad of consumable formats. Downstream users can read and trust it, without having to understand database internals. This is not the case for an implicit contract in the way being designed here. The whole name seems to be a misnomer as implicit is not enforceable in a contract.

Perhaps you can call it “contract generation”, and look at it from that angle? Of course we want to minimize manual work but that can be achieved without sacrificing the core feature of a data contract.

The reason I leave this feedback here instead of in a feature request is because I believe having non-explicit pseudo contracts will be actively harmful and create difficulties for us and our users down the line, and for dbt practitioners worldwide. It would be better to not have it than have it be implicit.

Beside what I enumerated here, there are a lot of problems tying the desired state, contract, to the actual state (database). Dbt is already the tool to enforce this. Contracts should not assume the database to be valid, it may not be. They should be a contract and define the expected/desired state, and dbt should help validate this expectation. It shouldn’t make the mistake of giving false confidence in the data by having implicit contracts that don’t happen to fail during changes/disruptions/etc.

I think the request here around using constraints would also be handled better without an implicit contract, but instead with generated yaml.

Finally, these data contracts are producer owned data contracts, so they are one party. That is fine, as long as it is explicit 😊. But they are not merely the current API definition (DBMS’s have this feature). Users should be able to rely on the contract to only change when intended by the producer, not when someone sneezes at the database. This way, they get value and there is a concept of breaking changes that can be handled at the contract definition/code level, giving a platform for users concerns as well. Making contracts implicit would erode their value for users/consumers as well.

@d1manson
Copy link

d1manson commented May 1, 2024

in terms of partial contracts for constraints, i wonder if it's worth bringing this concept into the mix here as it would solve for both the zero-cost aspect of constraints and the only-build-if-valid aspect, but via a more generic route that doesn't fragment yaml between tests and constraints. That option also works for views.

@cunhanator
Copy link

I really like emirko's yml code generator idea. Is this something that could be scoped into a v.2 of dbt assist?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request model_contracts multi_project Refinement Maintainer input needed spike state Stateful selection (state:modified, defer)
Projects
None yet
Development

Successfully merging a pull request may close this issue.