We are re-building dbt-docs for speed & scale #13080
Replies: 6 comments 10 replies
-
|
I'm commenting to advocate for a static option. We work in a very constrained environment (state department of health) and we don't have access to compute to run the docs server. We can host static sites via the forge (GitLab Pages in our case) which is what we're doing now on dbt-core. Thanks for your consideration! |
Beta Was this translation helpful? Give feedback.
-
|
Will this be opensource? The current docs as it exists in dbt Core v1 is sluggish, as it's a single massive file, and is pretty much unmaintained/not cared about, as far as I'm aware. We ended up giving customers who want schema documentation generated documentation by SchemaSpy, which compiles into a static site, but split into files. Not everything needs to be a single page!! It's really basic, but is good enough. Context: Where I work, we have a feature where we give customers tables, which are built using dbt, into their databases, synced via Fivetran. Having looked at OpenMetadata, and seeing how they structure tables/columns/etc as a standardised schema, here is what I've been thinking lately:
|
Beta Was this translation helpful? Give feedback.
-
My CriticismI will be blunt, I am once again getting a lot of "trust me I know best" energy from this post and docs v2 in general. This post feels like an attempt at justification of a decision and not a discussion. The previous site was an artifact produced by dbt that could branded and modified. The new site is part of dbt and can only me modified if I fork the repo and compile dbt. That is a rather big regression in my opinion. The need to have and maintain hardware that not only has to act as a server but have to be able to run dbt is a regression over the many ways to host a static website. Why are there features of doc v2 that are only useful if you have a way to combine all of your run data together, yes dbt Platform can do that but there has been to guidance or support provided on how one would do that without dbt Platform. The features that require dbt Platform to work sit as unusable clutter the user experience. If feels like it has added friction for the express purpose of driving people to the managed service of dbt Platform. That does not feel like being a good steward of an open source tool. My Supportive FeedbackI think that the dbt Index data is very nice, but I feel like there are ways that it can be used with a static site, can that be a an option? Maybe bring back the I have other feedback about the functioning of the site that I have already shared on the Slack channel and I will link that here. |
Beta Was this translation helpful? Give feedback.
-
|
@dtaniwaki I saw you opened a PR to add a new page to dbt-docs to display config information about a given node. Including it in this discussion to get feedback on what other community members think! |
Beta Was this translation helpful? Give feedback.
-
|
We're gonna host a community feedback / office hours session coming up in two weeks, and we'd love to have you join and hear your feedback! Wednesday, 24 June, 9am Pacific: dbt docs for v2 👆 Click here to register, and we'll email you a link to join. |
Beta Was this translation helpful? Give feedback.
-
|
I just wanted to mention that it sounds extemly exciting. Documentation is very important, and dbt docs v1 has been a great way of providing a very good automated documentation in our company. We have a growing project (500+ models) where having all these new features sound very exciting. What I want to bring to attention is, that it is important to keep providing a way of having an static version of a documentation, even if it might lack some exciting features which you are planning, which can be hosted by the customer on gitlab or github pages or whereever easily. I'm confident with that most data teams won't have the resources, capacity or budget to get an server running which will host the dbt docs. It would be really a pity if you spend all these great effort in investing into dbt docs v2 just to have the users not use it due to being forced to actually host it. Also, for getting users to migrate from v1 to v2. Btw, we can not confirm the freezing in the browser or anything within our dbt docs, even if there are 50-100 nodes in the lineage. |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Drew built the first version of
dbt-docsin 2018 — and how it’s worked! You rundbt docs generate && dbt docs serveon your command line, and dbt generates metadata about your project and a html site to explore it. “Auto-generated and easy-to-host data documentation” was one of the biggest reasons that teams have adopted dbt over the years, as well as motivating them to adddescriptionand other metadata in their version-controlled project code — much to the benefit of LLMs today.At the same time, the vast majority of dbt developers are generating dbt code with the help of AI, and dbt projects are larger than ever before (mo’ data mo’ problems). For human members of data teams, project documentation is more than just a useful resource, for visually inspecting the DAG and searching across business logic — it’s an essential way to see what their agents are up to.
But you and your agents liked adding to the docs so much that the original design simply didn’t scale. Docs worked by loading the entire
manifest.jsonandcatalog.jsoninto the browser, then rendering the static website based on all the valuable metadata packed into those artifacts. When Drew first built it, nobody imagined a single manifest could be 50 MB, let alone a gigabyte. But that is exactly where many large dbt projects have ended up. We have heard from countless teams running dbt at scale that docs can become slow, unstable, and difficult to use as projects grow.Today, we open sourced the Fusion runtime as Core v2.0, and that made two things clear:
dbt-docsfor speed and scale.Fusion has the performance characteristics needed for large projects, and it still produces the same rich manifest data people rely on. Along with being a Rust-based runtime built for scale, another key difference is that docs no longer depends on loading massive
manifest.jsonandcatalog.jsonblobs in the browser. Instead, the same metadata is emitted as Parquet artifacts that are scalable, joinable, and analyzable, and the UI can query only what it needs.That’s why, this time around, Grace and I built dbt docs for v2. And oh how we think it will work!
“This is the most fun I had in the last month” - Elias
Goodbye static site, hello server
I want to analyze the dependencies for my project with thousands of models, without crashing my browser or bloating my agent’s context window
dbt-docs was originally powered by raw project artifacts (aka giant JSON blobs like
manifest.json), loaded directly into your browser by inlining those entire JSON blobs into the HTML payload. This was expedient, and made it easy to self-host, but it did not work at any medium-sized scale.In 2023, we solved the scaling problem for customers of dbt platform with dbt Catalog, featuring a new faster React frontend and powered by the dbt platform Discovery API on the backend. This was great for solving the immediate need of exploring big projects with speed, but there wasn’t a natural pathway from the small self-hosted thing to the large distributed solution.
The next generation of dbt-docs must be instead powered by a real unified backend that stores project context and metadata in a way that’s easily consumable for browsers, agents, and humans alike.
It must also include richer (bigger) metadata, including column-level lineage, inferred column grains, and type-aware impact analysis powered by the native SQL comprehension in the dbt Fusion engine.
dbt-docs will look less like a static web-page, and more like a database and server. It should be easy to get started by running on your own laptop, and then naturally to scale up to stateful production deployment.
Why a database? Three reasons:
Docs v2 gets better when it is connected
One important thing about docs v2: it still works locally, and it is still something you can self-host. That matters. A lot of people love dbt Docs precisely because it is generated from their project and easy to serve wherever they want.
But there is also a reality we should be honest about: the best documentation experience is not always possible from local artifacts alone.
Some context lives outside your repo. Some of it lives across projects. Some of it in BI tools downstream of dbt. Some of it lives in orchestration history, once dbt can see how your project is actually being run over time.
When docs v2 is connected to dbt Platform, we can start bringing that context into the experience. That unlocks things like:
We are going to build out these connections intentionally. Not because we want self-hosted docs to feel incomplete, but because we do not want “self-hosted” to mean “cut off from the best experience.” If your team wants to host your own catalog, great. If you also want to connect it to Platform and get richer context, that path should be easy too.
You may also see a few places in the product that encourage you to try the connected experience. We know that can feel like advertising space, so let me say the quiet part plainly: We believe the connected experience can be better, and we want people to be able to discover those additional capabilities, even if they haven’t seen a demo or read our roadmap.
A simple early example is column-level lineage. In v2, dbt-docs can visualize column lineage when Fusion’s SQL comprehension has produced the column lineage artifact locally. This feature is free, but it does require you to be on Fusion and logged in.
The goal is not to make docs less open. The goal is to make docs much more useful, whether you run it locally, connect it to Platform, or eventually decide the managed experience is the right fit for your team.
The fun part: try it out!
Here’s the big idea: this still feels like
dbt docs generate && dbt docs serve.dbt parse|compile|run|build --write-index==dbt docs generate. In addition to (and maybe someday instead of) generating metadata artifacts as giant JSON files, any command that uses--write-indexwill push your project metadata into scalable parquet artifacts (on your laptop, a database you’re hosting yourself, or a data store hosted in dbt platform). One key update here is different commands produce different types/amounts of metadata, all of which can be used to enrich the documentation experience. (hint: use--static-analysis strictin Fusion to generate column level lineage metadata)dbt docs serve→ Load up the new, beautiful, and **performant UI — served directly from the dbt Core v2.0 CLIHere’s what it looks like in action today:
models_tab.mov
macros_tab.mov
We're in alpha right now, and there's more work ahead before this is ready from primetime. The biggest things we’re looking for feedback on:
Plus here’s a sneak peak about a new feature we’re thinking about that relates to question two… 🙂
query_tab_demo.mp4
Beta Was this translation helpful? Give feedback.
All reactions