Better mechanisms for cross-project lineage #5244

jtcohen6 · 2022-05-13T13:06:47Z

jtcohen6
May 13, 2022
Maintainer

Newer discussion: #6725

Projects should be smaller.

Cross-project lineage should just work.

Alt title: "Down with the monorepo."

Context:

This isn't something we can make immediate progress on—but it's something I've been thinking a lot about, and I want to share that thinking to hear more folks' thoughts. Hence a discussion for now, with issues to follow when we're ready.

Strong premises, loosely stated

Everyone should get to work in a project that’s a few hundred models strong. We know for a fact that dbt is a delightful experience at this scale: dbt parsing is suuuuuper fast. Any IDE (local, VS Code, dbt Cloud) can handle your file system with ease. You never accidentally type dbt run and see 1 of 5000 START
Even in a very large project, >90% of dbt development touches only a handful of related models. (Exactly how those models are grouped is an interesting question, which will vary project to project / org to org.)
Developers must be able to view full lineage: the entire DAG, in dbt Cloud. Put another way: My colleague’s model is not a source; it is a ref.

What is it?

There's a lot more to say here, and many implementation details I'm still trying to figure out, but I think it comes down to:

Support models with the same name in different projects.
- That's the original ask in Add namespacing for dbt resources #1269; we got stuck around how to disambiguate entries in property (.yml) files for the same-named model. That property disambiguation is work we should still pursue, and we should also think about namespacing support for two models with the same names. Those aren't blockers for this work in particular, though, so they're outside scope for this proposal.
Require two-argument ref() for models from other projects.
- dbt-core skips resolving those refs at parse time, and resolves them as the first step of execution instead.
Those other projects are not installed as packages, in the sense of dbt deps to access their source code.
- No need to mess with git auth for private packages.
- That also means, we're not solving here for the ability to override/reconfigure models from those projects (as in [Feature] Package Model Overrides #4157), which is still work we should pursue, but outside the scope of this proposal.
dbt-core gets access to a limited amount of metadata, which is more 1:1 mapping table than it is big bad manifest.json. This metadata could be extracted from those other projects, and provided as some sort of artifact.
You can't run models from those other projects. You don't want to! They're someone else's models, and you want to build on top of them.

What might it look like?

In my project code:

-- models/eligible_recipients.sql
select

	user_id,
	email_address

from {{ ref('upstream_project', 'dim_users') }}
where first_login_date > dateadd(day, -7, current_date)
  and plan_type = 'self-serve'

There's no package named another_package in my packages.yml, and there's no file dim_users in the dbt_packages directory. That's ok; dbt observes this at parse time, and leaves itself a reminder for later.

At runtime, dbt gets access to a "mapping" artifact, akin to:

{
  "upstream_project": {
		"model.upstream_project.dim_users": "analytics.acme.dim_users",
		"model.upstream_project.fct_activity": "analytics.acme.fct_activity",
	},
  "upstream_project_two": {
		"seed.upstream_project_two.country_codes": "seed_db.seed_schema.iso2_country_codes",
		...
  },
	...
}

And dbt compiles accordingly:

-- models/eligible_recipients.sql
select

	user_id,
	email_address

from analytics.acme.dim_users
where first_login_date > dateadd(day, -7, current_date)
  and plan_type = 'self-serve'

Considerations

How is this different from deferral? Deferral helps when running pieces of the same project, by resolving references across different environments. I want to build one model in the middle of a complex DAG, in my dev schema; I “defer” to the production version. It more closely ties into concepts like versioning models and zero-copy cloning. This is a “run mode,” and it makes sense to turn on in dev + CI environments, but not in production. By contrast, the capability discussed in this issue helps you run your project, by resolving references across different projects. It would be turned on all the time: dev + CI + production. **You get to treat upstream models as if they’re sources, but without breaking the chain of lineage—and getting dynamic updates, should the “source” model ever move. Deferral could definitely use a UX refresh (Review UX for --defer #5095), but it's outside the scope of this proposal.
It’s pretty easy to imagine that any sophisticated project, even if it isn’t huge—such as dbt Labs' own internal analytics project—should actually be split into 3-5 different ones. They may want to keep living in the same repo; but splitting them up would make development easier, PR review more straightforward (CODEOWNERS), slim CI even slimmer.
"Public" models vs. "private" models. If there are models I can reference from an upstream project, a natural extension is the models I can’t reference. These "private" models should be excluded from the mapper, and their source code should be hidden from me. (I think ephemeral models are always private!) dbt should tell me that reference does not exist OR is not authorized :)
Is this the end of packages as we know them? No! Packages are and will remain tremendously useful as a means of sharing open source dbt code with the larger community. That goes for both "model" and "macro" packages (e.g. Fivetran's offerings and dbt-utils, respectively). Within an org, private packages will continue to serve the purpose of configuring.
Metadata unification is really important, to get the full DAG view. Is that one big "docs" job, or an external service that unifies metadata from each project's production deployment?
Model versioning? If you’re ref'ing a model, dbt will tell you where it lives in the database. That database location could include a version specifier in its schema/alias. As these cross-team and cross-project relationships mature, "public" models will naturally constitute a contracted API, and it will be important to retain the ability to make breaking and backwards-compatible changes. As the maintainer of downstream project B, I could continue referencing a versioned deployment of upstream project A, and then migrate to the new version at my own readiness (within some acceptable window).

gwenwindflower · 2022-05-17T22:02:03Z

gwenwindflower
May 17, 2022

I honestly was not on board with this at all until this line:

It’s pretty easy to imagine that any sophisticated project, even if it isn’t huge—such as dbt Labs' own internal analytics project—should actually be split into 3-5 different ones. They may want to keep living in the same repo; but splitting them up would make development easier, PR review more straightforward (CODEOWNERS), slim CI even slimmer.

In practice, I see so many people screw themselves up trying to split up projects for the wrong reasons. They quickly blow up the great unifying ability of dbt to create a single source of truth, and lose the most positive effect of dbt's current limitations in this area, which is forcing you to collaborate across functions, create unified definitions, and break down silos (it's a feature 🦋 not a 🐛!). The easier we make this, the more likely analytics teams are to end up with two different orders models for two different teams and find themselves hating life.

That said! -- housing these projects in a unified repo would help mitigate a lot of the worst effects of separation while gaining the benefits of increased modularity. This actually is the idea of the modern monorepo, and what you're trying to break down is the monolith -- so I would say your alternative title should be 'Down with monolith, up with the monorepo'.

0 replies

bashyroger · 2022-05-18T12:33:28Z

bashyroger
May 18, 2022

@jtcohen6 , this makes a lot of sense and nicely fits into the whole #DataMesh concept that is emerging. If you want multiple domain teams to each be responsible for a part of the code, those individual code bases should still work together nicely / know of each others existence and dependencies.
If we would have this functionality now, we would start using it immediately!

By doing this, dbt Labs could also start selling dbt as a tool that fits nicely into the 'Data Mesh hype train', which will for sure attract new customers! :-)

0 replies

jtcohen6 · 2022-05-19T12:43:55Z

jtcohen6
May 19, 2022
Maintainer Author

More thoughts on model versioning, after a great conversation this afternoon with some interested folks. There are two options here:

Model-level: I make breaking changes to my dim_users model, so I add a version specifier to that one model's name: dim_users_v2. I communicate the change, and the deprecation window for dim_users_v1. Both run in my project until the deprecation date is reached. I can access metadata that tells me if dim_users_v1 is still being accessed (ref'd) in downstream projects. At that point, I communicate with any stragglers, and remove dim_users_v1.
"Module"-level: I have one version for the entire project. When I make breaking changes to any "public" interfaces in the project, I communicate the change and the deprecation window. Until then, I maintain simultaneous deployments of both the new and old project versions in the data warehouse. The naive version of this could be, all the models written to multiple schemas with version identifiers. The versioned schema names might be super ugly, but that's ok, because no human will be writing them—dbt will be resolving any refs downstream. Since projects are smaller in this vision of the future, that double-deployment might not be too "heavy" a proposition. The cleverer (and more cost-effective) version would take advantage of cloning or --defer, in concert with state:modified+ or other data-diffing tools, to deploy twice only those models which are actually different across versions.

1 reply

gwenwindflower Jun 2, 2022

oooh i like the module level w/ cloning and deferment idea a lot, you get the best of both there 🔥

jthandy · 2022-07-11T12:44:38Z

jthandy
Jul 11, 2022
Maintainer

I am extremely interested in this topic. I think there are very few topics being discussed right now that are more impactful for the long-term experience of dbt project development. We have always wanted to provide the modularity of software development when it comes to building knowledge graphs, but dbt in its current form does not yet provide this. Being able to take large chunks of functionality and call them services (or whatever!), which means that one team can understand their internals while all other teams can simply rely on their functionality as a black box, is how mature software projects and teams scale. Without this ability, you hit very real complexity taxes as a team/project that make it tremendously hard to continue to move forwards with velocity. And technically, there are very real things that we need to enable in order to make this type of organization possible. The biggest two I think about are:

private vs. public methods
API versioning

There are a million other things that are important in this as well (i.e. how does all of this get represented in a visual DAG? i don't believe the correct answer is always just "explode everything and show all of the detail!!") but I think the above two things are the biggest items and other stuff must fall out of the approaches we take there.

2 replies

leoebfolsom Apr 23, 2023

re: "how do we represent it in the DAG," it would be cool if you had the option of exploding (ok, exploring) black boxes of upstream projects, and then closing them up once you've seen what you need to see. Just like a collapsible nested menu on a website, or a file system with subdirectories.

sungchun12 Apr 24, 2023

I hear you! Something like this diagram? https://whimsical.com/public-private-dbt-models-monorepo-edition-2NxeKuU8oUiPxpKrkWpsvo@2Ux7TurymMZyQVhtBJxg

karunpoudel · 2022-08-25T05:47:42Z

karunpoudel
Aug 25, 2022

@jtcohen6, what if we treat cross project objects just like a source? My colleague's model is a source. (I think, atleast, separate project/repo by team)

For all of development purposes they act like a source: you can't (shouldn't) schedule them; you don't need to rebuild them for your development or PR checks.
Our main goal is to see the lineage between projects. We could add a dbt-project field when defining source like this:

sources:
  - name: hr
    dbt-project: hr_project
    database: hr
    schema: main
    tables:
      - name: users

Only when visualizing the lineage or building the dbt docs, dbt would need to extract metadata from the other project and get mapping between database object name and model name to show the upstream lineage. The core of dbt could stay the same.

It make sense to use database object name as an interface between teams rather than dbt model name because most (95%+) of the downstream user are your analyst that using various reporting tools. Also, if teams are using alias in models, then you don't have to confirm individual model name. Whereas, if you use model name for reference between dbt project, then model name will also be tied to dependencies along with database object name.

Main advantage for customer with existing data transformation that are gradually switching to dbt is: As other teams migrate their code to dbt, your lineage would automatically grow. You don't have to update your code to switch from source() to ref(). If your database has mix of tables build using dbt (from source data already replicated into your database server) and other external integration, then other teams using your table don't need to know how individual tables are built beforehand. Lineage or dbt docs can highlight such tables differently depending on whether they exists in the dbt project or not.

Regarding getting the downstream dependencies between project: Like you said, there should be a purpose built "one big docs" jobs that combines metadata from all project. It is safe to assume that this job will have access to metadata from all projects in an account.

Versioning has always been tricky with data analytics because you have to continually refresh data in old and new versions. As long as the grain of a table is staying the same, it is efficient manage it within single model (by adding new column for change in logic or if you have to drop a column, wait for all downstream to remove their dependencies). So i think, versioning would be a separate topic on it own.

2 replies

jtcohen6 Aug 31, 2022
Maintainer Author

@karunpoudel Really appreciate the comment! I have two main points of disagreement:

These things are dbt models, not sources — even if they're not my dbt models. They are not being loaded in an un-opinionated manner from third-party APIs, and dropped into the warehouse unchanged. They are being produced by transformation logic defined by colleagues on other teams. I think it's really important that we build this feature in a way that allows us to call things as they are.
I disagree with database locations being a better contract between teams than model names. Renaming a "public" model should be a breaking change; changing its alias config should not be. If anything, I would like this feature to enable dbt to abstract away database locations even more. What if I want to coordinate a change across projects, and need to reference a model project deployed in a given environment? What if I want to "upgrade" the version of the upstream model I'm referencing? (Versioning does not feel like a separate topic to me, but an essential ingredient in this recipe.)

I do take your point that downstream BI users reference dbt models by their database locations today, not by their model names. That doesn't need to be the case forever, though—and it's been one of our motivations for developing (separately) a semantic layer that can translate between BI queries and dbt project context, and unify disjoint lineage.

jtcohen6 Sep 6, 2022
Maintainer Author

Good feedback from @lostmygithubaccount: Whether this is a source-that's-actually-a-model (inferred by dbt after the fact), or a public-model-from-someone-else's-project (resolved by cross-project ref), we'd want some way to uniquely identify these, as distinct from both traditional sources (raw, untransformed) and "my" models (the ones I'm defining + building myself).

Maybe a distinct color in the dbt-docs DAG viz, other than source green and model blue...?

sungchun12 · 2022-08-25T21:29:52Z

sungchun12
Aug 25, 2022

I'm happy to own the blueprints and mechanical discovery of how exactly this can be a reality in dbt. Going to draft my ideas and work internally at dbt Labs before sharing complete thoughts in this discussion 🧃

1 reply

emekdahl Aug 29, 2022

very interested in this topic and will be eagerly waiting to hear more!

sungchun12 · 2022-09-07T16:54:51Z

sungchun12
Sep 7, 2022

Okay Community!

This is going to be a juicy 2 part post.

The first is philosophical and the second has code snippets to crystallize the philosophy with something tangible to discuss!

Goals:

Are we heading in the right direction?
Are the code snippets crystallizing the vision for you?

Note: Portions of this are thematically consistent with @jtcohen6's original post. I do this to prevent you scrolling up and down to understand the full story :)

Part 1: Old Problems with New Paradigms

The Problems

Having a monorepo is overwhelming to newcomers and maintainers of a dbt project.

Jobs take too long to run and too many subfolders/files to keep track of
Requires tech leads to babysit PRs everyday
dbt docs lineage is a lot to digest
Monorepos are hard in general, so hard that entire companies are built around solving related problems

Using a multirepo approach is less overwhelming to reason about files but brittle in building lineage and confident dependencies across repos

Too much brainpower required to reason about lineage and maintaining interfaces across dbt projects(think: 10 github tabs open at a time)

Both approaches in their current state hit a ceiling and requires tech lead heroes to roll up their sleeves and offer themselves as PR babysitter tributes.

New Paradigms

Data Contracts are the API analogy for data

The mechanical scaffolding of how APIs work should come to data
- BUT it requires a deep empathy for how fixing, maintaining AND rebuilding trust is a harder, stateful exercise
Instead of a JSON payload, the contract provides the ability to ref the upstream project
- Note: data contracts enforce one way data flows to prevent circular logic
- Defaults to assume data already exists OR terminal output informs the user that code exists but data does not exist yet
- There needs to be a dbt state file similar to terraform state to map data contracts to real-world data use/reference
Something similar is happening at Coinbase for internal python modules

Working with data is very different from APIs

JSON payloads are small and when things go wrong, an API can do a quick fix and return the expected JSON payload
- When data is broken, dependencies may wait HOURS before a fix cascades to the relevant code/data(think: schema changes)
Good data looks the same as bad data
- Metadata(think: testing, row counts, freshness, last time run) required to determine if data is good or bad

Values of this system

Require coordination on breaking changes
Enforce only ONE version of a data contract is allowed at any given time
- Video snippet on how the R-language ecosystem handles versioning
Strong gatekeeping and tooling to enforce
- Ex: old version expiration dates
The most recent version of all packages shall be compatible
Expose ONLY the data I want for downstream use
- Ex: core-only exposes 2/10 models for downstream use

Use Cases/ Who is this for?

Core data mart dbt project that serves as the upstream dbt project to downstream finance, marketing, operations dbt projects. That will be 4 dbt projects in total and will visually represent a hub(core) and spoke(finance, marketing, operations) project workflow.
- This should work well in BOTH a monorepo and multi-repo mechanic.

I work at ABC Company and we are (finally) switching from an on-prem, Spark based DWH to BigQuery + dbt. We are with >100 data people, and we currently have many problems with dependencies on other teams and other teams’ code / pipelines
So we would like to switch to a more data mesh oriented approach, where we work in multiple dbt projects - however, we still have logic that we need to share, so that’s where we are right now.
We just started architecting our solution, and migration will only start somewhere in the new year, so it will take some time before we actually get to the point where we actually need this
But i think in the mean time, we can find some hacky workaround with packages, tags, and IAM permissions.
Rebuilding data models will be too expensive so that’s something that i’d like to prevent haha

Community Member

9 replies

Junobijlard Sep 8, 2022

Some comments regarding data contracts - might be worth to have another discussion on that:

Is dbt really the tool to define data contracts? We should be able to have data contracts with data producers (e.g. product engineers) as well, since changes in backend might break DWH jobs (and hence ML models and production). In my opinion, dbt should be part of a data contract, but i think data contracts have to be wider than what dbt manages.
What do you see as the interface of a contract? Is it a table? A view? An API? I think all have pros and cons (think engineering effort, duplicated data, performance)

Junobijlard Sep 8, 2022

Should this config include imports only(think: upstream) OR both upstream/downstream projects?

I would suggest to infer downstream dependencies for a project (through the unified lineage) so that in case of a version upgrade, you can notify the people who depend on you. Explicitly specifying downstream dependencies means that we need to keep dependencies in sync in multiple places, so that would be hard to maintain.

jtcohen6 Sep 8, 2022
Maintainer Author

@sungchun12 Lots of good stuff in these comments!! I agree with lots of the conceptual things you've laid out here. Any of the specific code or implementation details can change between now and later; it's helpful to make the discussion concrete.

What are we going to call this thing? I've heard "upstream namespace," "upstream/data contract"... for now I'm going to refer to this thing as ATM (Another Team's Models), until we have a final name for it. I think an ATM could include more than just models — metrics in particular come to mind — but models are the main event.

I want to make a few strong claims to help us clarify points of potential disagreement. These claims aren't final, they're intended to spark further discussion!

Access to source code

Unlike other modes of installing packages, I believe that adding an ATM should not require access to the source code for the project generating those models. As a developer in the downstream project, I should know which public models exist, be able to read their documentation, and see the tests / metadata on their columns. But I should not be able to see its source code or its upstream dependencies. I should not be able to see the existence of private models at all — I shouldn't even know their names, let alone not having the database permissions to query them.

A few corollaries:

Because I don't need (or have) source-code access to the upstream project, I wouldn't add my ATM via git. I would need some other mechanism to get an up-to-date artifact, from that project's deployment, containing the mapping between public model names and database locations. This feels similar to how accessing a past manifest.json is the prerequisite to Slim CI. (Of course, if I do have source-code access, I could create that artifact myself via cloning — but git is only in the picture if I want it to be.)
Because I don't have the source code for upstream models, I cannot run them. (I probably wouldn't have the database permissions to do that, anyway.)
As a developer in that downstream project, my DAG would include only resources from my project, and public models from all the ATMs I've added.

In order to generate a full DAG, containing every model (private + public) from every single project/namespace, I'd need either:

Access to the source code of every project, such that I could install every namespace and perform a root-level docs generate ("god mode")
A service that ingests metadata from every project, and the ability to serve a DAG viz / documentation from that metadata service. Who gets to see which models in that DAG viz would require identity, permissions, ... That full experience is something we'd want to build in dbt Cloud. (Wouldn't be possible without it!) It should also be possible for users to construct specific pieces of that functionality with homegrown logic (e.g. by collecting + merging manifests from separate project invocations).

Data contracts

@Junobijlard Love the questions you're asking!

At their simplest, for dbt's purposes, I've been thinking about data contracts, for data produced by a data team, as comprised of:

A set of one or more "public" relational objects (views or tables), exposed in a data platform / warehouse
The columns in those objects and their data types: we need a mechanism in dbt to declare these proactively, or at least detect divergences during CI
The content of those columns — tests to assert expectations / quality! metrics that can measure / detect anomalies!
Freshness of the data flowing into those objects/columns (sources + scheduling)
The good word of the team maintaining that table. If someone fundamentally redefines the calculation for ARR, in a way that will implicitly break downstream users (in Sung's words, "Good data looks the same as bad data"), then it's on them to bump the version number. There's no way around that.

I think contracts are and remain an essential part of the discussion. At the same time, I also recognize that there have been broader conversations floating around about the need for data contracts further up the ELT stack, starting with the application developer who's performing a database migration. API middleware to handle those renames?). I don't think dbt can solve for that all by itself, not nearly as well as it can help the (data) team creating models depended on by (another data) team. We need other things: API middleware, between app db and EL tool?

I'm open to thoughts on how the solution we pursue in dbt could fit into a larger narrative. (What if the the .yml config for a dbt source, or even the SQL written in a dbt staging model, could actually define the expected set of columns in that source table? And thereby control which columns get synced by the extract + load tool? That could prevent unexpected / unannounced changes brought on by the EL tool, but still doesn't get to the root cause.)

I'm also open to the argument that there's a fundamental difference between contracts for "raw" data (true sources), and "transformed" data produced by another data team — and we'd need different mechanisms to define and enforce each one.

I would suggest to infer downstream dependencies for a project (through the unified lineage) so that in case of a version upgrade, you can notify the people who depend on you.

Agreed! This feels totally plausible with a unified metadata service, whether in dbt Cloud or home-grown. Much, much trickier to do in dbt Core only. As the upstream project maintainer, you need access to metadata / logs from downstream projects one way or another.

sungchun12 Sep 8, 2022

Is dbt really the tool to define data contracts?
Yes, in the scope of transformed data for consumption. This is solving a problem internal to the dbt experience. dbt isn't quite ready to boil the ocean of data contracting across any and all data producers/consumers, only what's in dbt's control.

What do you see as the interface of a contract?

Entrypoint: dbt_contracts.yml to define the contract and make requirements clear
Artifact: dbt_contracts.json to map contract to real world use
Tracking: table, view, and/or API that power a status page visual. All should be available similar to how saas have status pages

Should this config include imports only(think: upstream) OR both upstream/downstream projects?
I definitely see the merits of keeping this DRY and defining it once. We'll have to figure out the gatekeeping mechanics to ensure BOTH a core-only and finance-only contract have power to negotiate contract terms.

Access to source code
This should be up to the upstream dbt project owner to enable/disable as desired. This is NOT the same as permission to invoke that code. Example below.

Core-only has a data contract with Finance-only and it saves everyone time and energy to share "how the data sausage is made" via nodes in the DAG. Some of those nodes expose all the code(fact and dimension). Some nodes are fully blank(staging models for cleanup). Some nodes don't appear at all(think: social security number transformation logic). All read-only information. This evolves the dbt docs mechanism when you import a dbt package.

prratek Sep 30, 2022

I'm also open to the argument that there's a fundamental difference between contracts for "raw" data (true sources), and "transformed" data produced by another data team — and we'd need different mechanisms to define and enforce each one.

@jtcohen6 I tend to think that instead of the distinction being between contracts for "raw" data and those for "transformed" data, there is a fundamental difference between the "promises" a data producer makes about the data assets they make available and the "expectations" a data consumer has of the assets they're referencing from these upstream producers.

Applying that to your example, those two feel different because in one case the dbt project is a consumer of data being produced by an external service or API and in the other it is a producer for a downstream dbt project. I'd argue that these two are fundamentally the same:

The "promises" a third-party API makes about the data it makes available
The "promises" a dbt project makes for models it makes available for downstream consumption

These two, on the other hand, are both examples of the opposite relationship:

The "expectations" a dbt project has of raw data (true sources)
The "expectations" a downstream dbt project has of data produced by an upstream one

While both are "contracts" in some sense, I think one can exist without the other. In my ideal world, a dbt project would be able to make freshness, schema, test coverage promises to downstream consumers just as it would be able to declare expectations of upstream sources (whether true sources or other dbt projects). I imagine you can do the latter with some combination of dbt tests, source freshness, and zero copy clones as @sungchun12 describes here. Importantly, you could do either one without doing the other!

What do you think?

Junobijlard · 2022-09-08T06:54:20Z

Junobijlard
Sep 8, 2022

This assumes I have the appropriate database permissions across both core and finance projects as a finance-only analytics engineer

In the 'exposing a model' interface, you should be able to specify who (e.g. google group) you want to expose your model to. Ideally, we also automatically apply the permissions to access the table to that group

2 replies

jtcohen6 Sep 8, 2022
Maintainer Author

IMO this is exactly what grants are good for, as an out-of-the-box model config! https://docs.getdbt.com/reference/resource-configs/grants

(And on BigQuery, google groups are indeed supported as recipients of grants)

In addition to "publishing" a model with a contract, you'd also coordinate database-level permissions so that the right folks can actually use it.

sungchun12 Sep 8, 2022

The base mechanics of granting permissions is definitely something that should be configurable in the data contract!

Junobijlard · 2022-09-08T06:55:17Z

Junobijlard
Sep 8, 2022

How do I as a finance-only person invoke core-only dbt models if they need fresher data?

I'm not sure if you should allow an 'outsider' to trigger an upstream model to build

If we allow any downstream consumer to trigger a build (in the upstream environment) that means that we will have multiple (expensive) builds for everybody's personal preference (all at different build times)
Alternatively, some kind of data freshness SLA should be part of the data contract. If people would like to change the freshness, they can open a PR and discuss that with the team that owns the upstream model

2 replies

jtcohen6 Sep 8, 2022
Maintainer Author

Deployment is a super interesting question here! As a general rule, I do not think we should allow downstream consumers to run models from upstream namespaces. Access to source code for those models should be absolutely optional, and same for database permissions on their upstream dependencies.

So, I think there are three potential mechanisms for deployment:

Agnostic and uncoordinated. Each project deploys itself, with its own freshness SLAs. Simplest.
"God mode." In production, a single service adds/installs every project and performs a root-level dbt run of the whole DAG. (Or pieces of it, in multiple steps.) Parity with the monolith pattern that exists currently.
Metadata-driven. A wraparound service that ingests information from everyone's project, then identifies the right set of sub-DAGs to execute based on their upstream/downstream dependencies. This would be most powerful if it also had access to user-provided SLAs, historical run times for approximate estimations, information about source freshness / loading patterns ... that is, more than what dbt-core would have access to itself.

sungchun12 Sep 8, 2022

Sounds like we generally agree that strong gatekeeping to invoke any upstream/downstream projects is necessary. I also see a future where BOTH Agnostic and uncoordinated and God mode happen. Metadata-driven I'm fully aligned and there needs to be a persisted database behind the scenes to track this info!

I'm aligned that in general, downstream projects should NOT invoke upstream projects.

Agnostic and uncoordinated:

Marketing-Only depends on an upstream Finance-Only dbt project, but as a whole they work independently from each other and are happy. They should NOT have access to invoke each other's projects and rely on the async promise of the data contract.

God mode:

Core-Only is the bedrock for all downstream dbt projects in an internal-only set of dbt projects. I make a fix to ensure revenue by month is correct for all downstream projects and invoke a root-level dbt build. I'd love a mechanism to send a notification to downstream dbt project owners and have them say "yes" to being invoked similar to how terraform apply asks you permission before applying changes. But that sounds messy to maintain at the same time.

$ terraform apply
aws_instance.app_server: Refreshing state... [id=i-01e03375ba238b384]

Terraform used the selected providers to generate the following execution plan.
Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # aws_instance.app_server must be replaced
-/+ resource "aws_instance" "app_server" {
      ~ ami                          = "ami-830c94e3" -> "ami-08d70e59c07c61a3a" # forces replacement
      ~ arn                          = "arn:aws:ec2:us-west-2:561656980159:instance/i-01e03375ba238b384" -> (known after apply)
##...

Plan: 1 to add, 0 to change, 1 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value:

christineberger · 2022-09-08T14:50:13Z

christineberger
Sep 8, 2022
Collaborator

Thanks for laying this out @sungchun12 ! It definitely helps to visualize it. I concede that others may be more “in the thick of it” than I ever have been, but here’s my thoughts:

Setup

Should this be additional configs within dbt_project.yml OR its own config like dbt_contracts.yml?
It’s own config file! The examples are proof enough to me of how cluttered the project file will get. I think it as it’s own separate dedicated file is both visible and tidy - I’d vote to avoid a situation like we have with vars.
Should this config include imports only(think: upstream) OR both upstream/downstream projects?
I vote for upstream only for now - I’m a fan of doing this in pieces until we know more about what else we should support with real use cases to account for. I admittedly need an explanation on how downstream contracts would be beneficial or differ from having all upstream references - does this mean that you can reference models from a downstream project of your own? Or does it mean that you have more control over which projects can use your project as an upstream reference?

Upgrading contracts

Should this file live in a single git repo to centralize contracting?
I feel that every project should own their own contracts file - for now. When I think about the contracts living in a central place, I think about the haul of every team needing to update their models (if needed) to ensure things don’t break. By having separate contact files for each project, each team can decide when to upgrade their models much like packages - we are able to flag that there is a new version available, but it will be up to the processes of the org to make that happen across all teams.

Invoking upstream contract models

How do we handle reconciling separate manifest.json files? 🤷‍♀️
How do we handle permissions to allow/disable execution for specific projects?
If I’m running this command starting from the core-only project, how does core-only know about downstream projects?
I’m handling 2 and 3 here to ask for a clarification. I might have missed it somewhere in the write up, but would the idea be that each project referencing an upstream project builds the resources from the upstream project as their own objects in the warehouse (like a package)? Or does it reference the objects that are already built somewhere in the warehouse from the core project to build their models (which I'm sure might need more configuration to know the "right place" to reference from, in a core-only downstream contract config)? If the second, I don't think any team should be able to re-run models from a project that's not their own - I know it adds a wall back between teams, but each team should completely own the process to update their own data sets.

Limiting the use of models

There’s no questions to answer here, but this section has satisfied the use case of the downstream contracts thought above. I wonder if you could also just reference a folder path here, as I’m imagining a scenario where you want to share most models except a few. Maybe something like:

from dbt_contracts.yml:

models:
    # Model folder level
    +shared: true
    +except: 
       - ref('my_model')
    # Sub-folder level
    marts:
       +shared: true

or sharing from the dbt_project.yml level after defining contracts (which might be a messy idea):

from dbt_project.yml:

models:
   marts:
       +share_with: [contract('finance-only'), contract('marketing-only')]

@_@ big brain thinking with too smol brain

1 reply

sungchun12 Sep 8, 2022

@christineberger

Should this be additional configs within dbt_project.yml OR its own config like dbt_contracts.yml?

Yay! Glad we're aligned

Should this config include imports only(think: upstream) OR both upstream/downstream projects?

Sounds like upstream only is building consensus similar to @Junobijlard's points above
The intention of a downstream config is to control the limits of what's consumed vs. default opening up everything to anyone. But I see the merits of researching a new mechanism to solve this purpose while upstream only configs work at the same time.

Should this file live in a single git repo to centralize contracting?
Great call out on reducing human load on constantly updating a single, centralized file. Each downstream project should own their own contract file similar to packages.yml

If I’m running this command starting from the core-only project, how does core-only know about downstream projects?

You hit the nail on the head in the latter half of your response. This assumes the upstream data is already built and ready for use, NO recreation of the data.

Limiting the use of models
LOVE these example code snippets. So valid that an +except: config is a normal way think about sharing data, "Everything except this sensitive one."

sungchun12 · 2022-09-08T22:22:06Z

sungchun12
Sep 8, 2022

Quick and Dirty Thoughts

ability to preview upstream data rows before you let it flow through your downstream project
basic profiling of upstream data like how rill-developer enables in real-time(think: nulls, averages, median, etc.): here

1 reply

christineberger Sep 9, 2022
Collaborator

I was talking about a similar idea to your first point this morning to @lbenezra-FA! It would also be great to be able to return the code for a particular model, to see what it's doing vs. navigating to the repo to see it! Currently for packages you could go look in the repo at the code, but when you install the package you can also navigate directly to the code in your dbt_packages folder. It's a small but super helpful time saver not having to switch windows.

sungchun12 · 2022-09-13T19:04:22Z

sungchun12
Sep 13, 2022

Hey Everyone!

Please follow progress here: https://github.com/orgs/dbt-labs/projects/24

2 replies

sungchun12 Sep 23, 2022

A nice weekend gift for everyone! Please let me know if I'm going the right way or this is so wrong you want me to throw away this code now.

Demo video on consuming an upstream project's nodes: here

sungchun12 Sep 23, 2022

Example Terminal Output

sungchun12 · 2022-09-22T16:39:08Z

sungchun12
Sep 22, 2022

In conversation with dbt users:

Background:

700 models split over 8 projects
Projects by domain(think: marketing, finance, etc.)
Library project with shared macros, packages, sources that serves as the producer that other projects consumes
Have one docs-only project to unify lineage
Only models from standardized are shared with domain projects

Downside:

lineage is fragmented across domain projects, source chaining across different projects and causes duplicates and is error prone
projects are imported multiple times over across domains
highly dependent on sources to chain projects together and need to build redundant test configs(think: unique, not null, freshness)

Considerations:

How to handle packages at the producer being imported into the consumer and they are duplicates? Quick thought: separate package dependencies and have the producer packages housed in their own subfolder
get a report card of the consumed project nodes, get a preview of the data similar to how API docs do the same: https://docs.getdbt.com/dbt-cloud/api-v2#tag/Runs/operation/getStep

2 replies

brandon-segal Nov 4, 2022

@sungchun12 Our team has a similar solution where we have an upstream dbt repo with two dbt projects. An orchestrated project that takes the common sources across our teams, cleans them and produces the data marts. Then, an external project defines the data marts from the first project as sources and creates ephemeral models from them. Downstream projects then consume that external project.

I wrote a proposal for a blog post on sharing dbt models across multi-repos that goes over this approach that includes using code cleanliness tools like commitizen to properly version your dbt projects as a way of communicating how up to date those imported models are. Is this how teams are commonly doing this now?

sungchun12 Nov 4, 2022

@brandon-segal Thanks for sharing this. This is the first time I've seen ephemeral models and commitzen in use, but the mechanics are spiritually aligned to feedback I got from dbt users today.

If this works for your team, that's so great, but the common pattern is what I illustrated above!

jonathanneo · 2022-10-12T09:25:10Z

jonathanneo
Oct 12, 2022

Hey @jtcohen6 , I agree with your assessment of the problem you are trying to solve.

Context and problem

To add more context, I work in an organisation where we have thousands of nodes in a single dbt DAG.

This is a new and natural phenomenon because of how dbt enables data analysts and analytics engineers to easily contribute models to the dbt project DAG. #power-to-the-people.

And yes, at some point, those DAGs get large and unwieldy, resulting in long build times.

So the next logical option is to break the big project down into sub-projects, effectively creating a DAG of DAGs. That's where the nightmare begins:

How do we orchestrate between different projects? Time-based triggers or dependency-based triggers?
We are breaking the critical path between nodes using a DAG of DAGs approach.

I have summarised these issues and more in my blog post.

Clarifying question

I like your proposed solution of having cross-project dependencies using refs: {{ ref('upstream_project', 'dim_users') }}.

This implies that there would need to be two dbt run modes in the future:

dbt run --project my_project # builds the DAG and runs only for the selected project, and treats any upstream project ref as dbt sources.
dbt run # builds the DAG for all projects using ref('upstream project', 'upstream_model') and runs all nodes.

dbt run would be executed at the root of some kind of folder that sits outside of a dbt project folder.

Are you thinking along the same lines too?

My proposed solution

An alternative and adjacent approach I would like to suggest is to de-couple the DAG and have each node of the DAG execute independently based on a freshness or condition-based trigger. I have written a blog to propose this approach here.

The key highlights are:

Create an event-loop that checks every 5 minutes to see if the run conditions are True. The run condition in the example below is that Task 1 and Task 2 both have at least a 60-minute window of unprocessed data, and would the node would run as soon as this condition is True.

This is similar to the approach of a distribution centre. Packages arrive from multiple sources, however each package is not shipped to a customer immediately. Instead, the distribution centre waits for more packages to arrive before sending a batch of packages at once to maximise delivery. In our context, each node is its own distribution centre, and checks for upstream dependencies before triggering itself.

Benefits:

Prevents joins on stale data
Upstream tables can be updated at different intervals. If one of the upstream tables did not have fresh data when running the current node, then it would result in semi-stale data being produced.

Loose coupling and independent sync frequency
This design pattern results in loose coupling between the nodes. Each node will execute as long as its upstream conditions evaluate True.

This means that each node has its own execution frequency, which allows each node to be executed independently from other nodes at varying intervals.

This is in stark contrast to the mono-dag where the entire DAG is triggered on one schedule (e.g. 00:00 daily).

Faster build and test times
This solves the problem of having a physical mono-dag which can have very long build and test times, as now each node can be built and tested based on the node’s immediate upstream dependencies. A logical mono-dag is still useful for checking that there are no cyclical dependencies between all nodes, and for visualising the lineage graph.

Uses critical path
This approach certainly removes the need to have a DAG of DAGs, which results in nodes not being executed using the critical path.

I am open to anyone's thoughts and critique of this approach.

1 reply

jonathanneo Oct 13, 2022

Airflow has recently released something similar in their Sept 2022 2.4.0 release called data-aware scheduling which uses a push mechanism as opposed to a poll mechanism in my blog article.

sungchun12 · 2022-10-13T22:09:21Z

sungchun12
Oct 13, 2022

Hey Folks!

I’m here to wrap up this month-long R&D effort for dbt Contracts.

There’s been a lot of work tracked across slack, github, demo videos, draft PRs that deserve to be consolidated and digestible for you all. Here it goes!

TLDR: This work was so rewarding because in the midst of navigating this giant ocean of a problem, the community rose to the occasion to provide constructive feedback, and solidified this problem is worth solving for large data teams. Together, we parsed what building block efforts are needed to make dbt contracts work cohesively. These are efforts that are a win in the short term AND re-usable for contracts work in the long term. Although this research phase is over(at least Sung leading it), the work continues with statically typed SQL coming to dbt(think: enforce data types on columns). Read below to see more :)

Goals:

Accelerate timelines on this feature becoming a reality, even if it's just requirements
Uncover "gotcha" implementations earlier rather than later
Get you, community, involved in shaping dbt's future

Relevant Videos: https://loom.com/share/folder/30c6ce127a6143e0b28a6720ffe1ca9b

Github Project: https://github.com/orgs/dbt-labs/projects/24/views/1

Relevant Docs:

Monorepo Blueprint with public/private ref permissions: https://whimsical.com/public-private-dbt-models-monorepo-edition-2NxeKuU8oUiPxpKrkWpsvo

Relevant Slack Threads:

https://getdbt.slack.com/archives/C033K8SF2CF/p1664560849333259

Draft PRs as Research:

ref works across two projects: #5966

ref permissions within a monorepo: #6007

Exposures as dbt Contracts configs: #5944

Building Block Efforts: These are necessary components to make dbt Contracts function as a whole, integrated experience. Instead of boiling the ocean at once, these should be built and then reused for dbt Contracts when the time comes!

Add column type configs(see schema constraints loom demo below) as native to dbt
Add test and freshness coverage statistics at the project level similar to this:
https://github.com/dbt-labs/dbt-project-evaluator#test-coverage
Make dbt docs faster and collapse nodes
Public/Private models in a monorepo

Next Steps:

Sung will contribute to making this native within dbt, BigQuery macro example here: https://www.loom.com/share/1f1f190e66254d12984962c613e8082d
- Another example in the wild: https://gist.github.com/bstancil/ee9ee57743e7423741fef3c0a3cc669d
Community, what building block efforts are worth it in the short AND long term worth building in dbt now?

6 replies

sungchun12 Oct 14, 2022

@jonathanneo check this one to see if it works! https://loom.com/share/folder/30c6ce127a6143e0b28a6720ffe1ca9b

jonathanneo Oct 14, 2022

Thanks @sungchun12 the link works.

I've had a look at the recordings to understand what you've defined as dbt contracts. This one in particular.

I don't really agree with your definition of a dbt contract at the moment. In your demo, your implementation of a dbt contract seems to cater more towards solving the problem of cross-project dependencies between nodes, rather than implementing a data contract.

The term 'contract' implies an agreement between a producer and a consumer. A data contract should have the following characteristics (taken from Chad's blog article):

(X) Contract is enforced - Yes, the command dbt contracts enforces that. The owner of a producer dbt model would need to now run this command as part of their development workflow to ensure that they are not making breaking changes for downstream consumers.
(X) Contracts are public - Yes, in your demo, you used an S3 bucket to store contracts. I would prefer if contracts are stored as part of the git repo using a mono-repo but multiple project folder approach. This allows contracts to be easily searchable in the repo itself rather than jumping between S3 and the repo.
( ) Data contracts cover schemas - No. it doesn't look like dbt contracts checks for the following: (1) column names match the contract definition, (2) column position match the contract definition, (3) column data types match the contract definition. See protobuf as an example of the fields used for a contract definition. For example:
```
model Person {
 string name = 1;
 int32 id = 2;
 string email = 3;
}
```
(X) Data contracts cover semantics - Yes, the command dbt contracts also seems to execute dbt test and dbt freshness under the hood, which checks that the data quality test passes and that the data is fresh.

In conclusion:
I like your implementation of cross-project dependency referencing using {{ ref('upstream_project', 'upstream_node') }}, however I don't agree with calling that functionality a 'contract' because it's not yet implementing schema enforcement.

jonathanneo Oct 14, 2022

I see that you have also implemented a table_constraints_schema.yml in this demo.

models:
  - name: my_table
    config: 
      has_constraints: true 
    columns: 
      - name: id 
        data_type: int64
     - name: color: 
       data_type: string

That is currently used by the producer to generate the table DDLs.

What is missing is using that as a contract for each producer-consumer relationship.

For example:

graph LR;
  producer_A-->consumer_B;
  producer_A-->consumer_C;
  producer_A-->consumer_D;

Producer_A has 3 consumers: consumer_B, consumer_C, consumer_D. Therefore, there will need to be 3 data contracts that exist:

contract for producer_A <--> consumer_B
contract for producer_A <--> consumer_C
contract for producer_A <--> consumer_D

Each contract should contain specifications for:

Schema: column names, column position (possibly?), column data type
Semantics: data quality tests (dbt test), data freshness

sungchun12 Oct 14, 2022

@jonathanneo Thanks so much for the thoughtful feedback and explaining why you disagree!

I don't really agree with your definition of a dbt contract at the moment. In your demo, your implementation of a dbt contract seems to cater more towards solving the problem of cross-project dependencies between nodes, rather than implementing a data contract.

Yes, the purpose of this specific demo was primarily focused on cross-project dependencies between nodes. To build confidence in these dependencies, I envision we take components of the data contracts concepts and/or mechanisms. You already listed out a couple of them. I agree with all of the concepts. The implementations are up for debate!

(X) Contracts are public - Yes, in your demo, you used an S3 bucket to store contracts. I would prefer if contracts are stored as part of the git repo using a mono-repo but multiple project folder approach. This allows contracts to be easily searchable in the repo itself rather than jumping between S3 and the repo.

OOoo. On a fist glance, I'd like this approach as it reduces the need to manage another credential. But taking a pause, there may be compiled SQL with multiple environment variables that make this approach prohibitive. I could see a producer create multiple contracts with the same dbt code that compiles to many different targets(think: multiple databases/schemas). In addition, the S3 approach mirrors the battle-tested mechanisms Terraform is successful with today. I recommend watching that here: https://www.loom.com/share/34d29d1151444ac9be98053612da0994

( ) Data contracts cover schemas - No. it doesn't look like dbt contracts checks for the following: (1) column names match the contract definition, (2) column position match the contract definition, (3) column data types match the contract definition.

You're right! Our research doesn't cover this in too much detail, that's where we realize the short term work needs to happen! Column names and data types are critical to verifying data is looking and feeling as expected. I also like your idea on column position! I didn't think of that.

My Ask: Want to work on bringing data types as native to the dbt experience together?
Let me know, and I'll reach out dbt public slack :)

jonathanneo Oct 15, 2022

Sounds good! Let's work on this together, I'm happy to contribute where I can.

belasobral93 · 2022-11-08T21:37:24Z

belasobral93
Nov 8, 2022
Collaborator

Hi team, want to pass on feedback I'm hearing from customers (incl. an organization that could have 100+ developers contributing and where they need to share code across teams while maintaining control on who can modify that code, when and where.)

It feels like a black box when I override global macros, to determine where models are built (ex: generate_schema_name). Because it isn’t immediately clear where each model is being deployed to - I don’t have confidence that duplicative transformations aren’t being done.
In my development environment as I ref models from an upstream package I want to interact with the dag as if the models were built in my project, i.e. being able to click into the node and see the sql behind it (today this only seems possible looking at the documentation site).

0 replies

paulbakkerbloom · 2023-01-06T08:29:08Z

paulbakkerbloom
Jan 6, 2023

Hi all, hope I'm not late to the party but wanted to bring up some suggestions for the contract feature:

In the context of data privacy / retention, I think it'd be very useful if the contracts would allow for different column / row level permissions, depending on the consumer. That way we could very easily ensure that people within the company don't have access to data they don't have to use. Is this something you're looking at?

Thanks in advance!

1 reply

jtcohen6 Jan 6, 2023
Maintainer Author

Hey @paulbakkerbloom - to keep this work within a reasonable scope, we're keeping our thinking here at a model level. And I don't anticipate us attempting any magic to implicitly coordinate between a model being public/private (can or cannot be referenced in another project), and actual underlying database permissions — though you'd very much have the ability to configure this yourself, via grants!

Certain data platforms support column-level policies for dynamic data masking/hashing, and row-level policies for restricting access. There are already community members applying these policies via dbt today, though it requires a bit of custom code (macros + hooks).

I agree that it's compelling to envision a future where a single "public" model being able to serve multiple downstream dbt consumers, with differential access to rows/columns, all configurable natively from within dbt.

jameszhenginsta · 2023-03-16T17:43:23Z

jameszhenginsta
Mar 16, 2023

Hi @jtcohen6 and dbt team,
I am happy to see the work on this project.
Does this project assume that all teams will be materializing in the same schema? We have a use case where each team materializes in their own schema within the same environment. For example enabling cross lineage on an upstream dbt model in schema_a and a downstream dbt model in schema_b. I would imagine for the downstream the interface would be something like

dbt run --select downstream_model --target prod --upstream-target(core, prod)

0 replies

Better mechanisms for cross-project lineage #5244

jtcohen6 May 13, 2022 Maintainer

Strong premises, loosely stated

What is it?

What might it look like?

Considerations

Replies: 18 comments · 33 replies

jtcohen6 May 19, 2022 Maintainer Author

jthandy Jul 11, 2022 Maintainer

jtcohen6 Aug 31, 2022 Maintainer Author

jtcohen6 Sep 6, 2022 Maintainer Author

Part 1: Old Problems with New Paradigms

The Problems

New Paradigms

Use Cases/ Who is this for?

jtcohen6 Sep 8, 2022 Maintainer Author

Access to source code

Data contracts

jtcohen6 Sep 8, 2022 Maintainer Author

jtcohen6 Sep 8, 2022 Maintainer Author

christineberger Sep 8, 2022 Collaborator

Setup

Upgrading contracts

Invoking upstream contract models

Limiting the use of models

christineberger Sep 9, 2022 Collaborator

Context and problem

Clarifying question

My proposed solution

belasobral93 Nov 8, 2022 Collaborator

jtcohen6 Jan 6, 2023 Maintainer Author

jtcohen6
May 13, 2022
Maintainer

Replies: 18 comments 33 replies

jtcohen6
May 19, 2022
Maintainer Author

jthandy
Jul 11, 2022
Maintainer

jtcohen6 Aug 31, 2022
Maintainer Author

jtcohen6 Sep 6, 2022
Maintainer Author

jtcohen6 Sep 8, 2022
Maintainer Author

jtcohen6 Sep 8, 2022
Maintainer Author

jtcohen6 Sep 8, 2022
Maintainer Author

christineberger
Sep 8, 2022
Collaborator

christineberger Sep 9, 2022
Collaborator

belasobral93
Nov 8, 2022
Collaborator

jtcohen6 Jan 6, 2023
Maintainer Author