-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create ERD from relationships in schema.yml files #84
Comments
We're thinking about this too. dbt docs are great, but a lot of the data in there isn't for the right downstream audience. If you have business users that want to know what to query and where, having a clear ERD can be more useful than saying "go read the dbt docs". I agree that this should just be done in the schema.yml files, probably using the ref syntax with some specific variables for PKs, FKs, etc. |
We've tried a lot of solutions for this, none of which quite fit and none of which supported the schema.yml spec directly. https://github.com/BurntSushi/erd, https://gojs.net/latest/index.html, https://pypi.org/project/ERAlchemy/, http://www.nomnoml.com/, https://dbdocs.io/ Right now, we have a functional tool that parses the schema.yml, specifically the relationships tests and columns to auto-generate an ERD I've never worked in the dbt codebase, and there is a fair amount of work to make this something that will work generally. I'm interested in seeing this through, but would need some help |
This is really, really cool! Since dbt-docs is already built out of artifacts produced by E.g. grab the to/from components of all jq '.nodes | .[] | select(.test_metadata.name=="relationships") | .test_metadata.kwargs' target/manifest.json ...
{
"to": "ref('snowplow_web_page_context')",
"field": "page_view_id",
"column_name": "page_view_id",
"model": "{{ ref('snowplow_web_events') }}"
}
... So my questions are:
|
@jtcohen6 there is no benefit to parsing schema.yml directly. We don't have any experience working in the dbt docs code, so we just took the path of least resistance. My preference would be to use the existing specification if possible. The column names, unique tests, and relationship tests provide the necessary metadata. If we are able to use the existing specification, then something like this would just work. Maybe we would need to add a toggle in dbt_project.yml to enable the feature, or something to that effect. However, I can see a lot of edge cases that my initial assumptions are not good for. We stick to Kimball and star schemas pretty heavily, but what about people who do a normalized model and don't have a visually appealing star schema? What about the use case to see the entire data model in one visual? |
My first ideas for solutions typically lean more explicit than implicit. But if we're able to parse out relevant joins just based on testing I think it would make sense. Like @aroder mentioned there are corner cases on that.
Stuff like that. I think there are some clear easy wins / iterations that can happen on this which can certainly be improved over time. We're just doing things in LucidChart right now which makes me sad 😆 https://about.gitlab.com/handbook/business-ops/data-team/platform/edw/#first-iteration-in-gitlab-dwh |
This is a fantastic idea and would certainly decrease the fragmentation of documentation. I think defining the explicit configuration would be preferrable, if I was to add an opinion to the mix, since you could then define the relationship style for the ERD connecting lines. I'm wondering if using the With regards to limiting columns in the tables, perhaps you could have max limit of 50 columns with a collapse/expand option to hide the excess columns interactively. Also, imagine being able to hover over the column in the ERD and seeing a tooltip of the docs description or a link to docs page. |
I want this feature as much as anyone, and having said that, I'm not so sure that the data in the schema.yml files provides the full information needed to do this. A couple of specific items that you need to get in order to build an ERD that are not present in schema.yml files:
|
To get around this, you could create some custom schema tests in a package that are reserved for things that you want to include in the ERD. They'd be custom tests that didn't really do anything interesting, but they could serve as a hook for helping decide what to pick up. This feels like it would be a nice balance between not having to define everything twice and still getting to a more explicit definition for the ERD, to address @tayloramurphy 's comment. |
I really like this. We could put an ERD right on the exposure's page in docs site.
We have access to column datatypes in |
We wanted to know if you have any plans to release capabilities for data modeling with GUI to reverse and forward engineer DDL. |
Hi we are also looking for this functionality! Would love to hear if there is a timeline on this. |
Also very interested in this functionality. While waiting for an official implementation, I created this PowerShell solution that parses |
@jtcohen6, I'm interested in trying to go about developing the functionality here for injecting Right now, I'm starting some initial prototyping with @aroder who has an early stage POC that generates the WIth that, do you have recommendations on how I might be able get started to either A) add in this existing script to the dbt core system or B) provide further guidance on what is needed to get started on this? From there, I'm happy to carry on to the next phase of development that would leverage the manifest.json file like you recommended earlier up in this post. Let me know what you think on next steps for this. |
@jtcohen6 @drewbanin (?) just wanted to follow-up on this and see if there are any insights on next steps for this. |
To the great dbt community - wondering if 2022 brings any update on this 😄 |
@joeyfezster I'm looking for collaborators to make this more robust. It's working for us but could use some work #84 (comment) |
@aroder we have some code (written for other purposes) that lets us load up the network file and comprehend the compiled graph. If you have the visualization side, we might be able to back it with data in the way that Jeremy suggested above. Is that the part that you're looking for help with? |
I like this open source package for the R programing language: |
I would solve adding a yml for tuning the docs generation with multiple options like so
|
Hey folks, I just published first version CLI for generating ERD-as-code in DBML from manifest.json, which can be easily used together in the generating dbt docs pipeline. https://github.com/datnguye/dbterd Check it out! |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
unstale |
Thanks @eitsupi! Perhaps not surprising given that I opened the ticket initially, but I still think having a solution built into dbt would be great. |
This would be a great feature to update in dbt |
This would be a fantastic addition to DBT |
A solution like this built into dbt would be awesome |
I would love to have this feature in dbt-docs. Will help to convince teams to move to DBT |
Yes, we would really appreciate the ability to generate an ERD based on our models - what a timesaver in being able to get everyone "on the same page" without having to manually update when the data model changes. Great idea! |
This would be amazing, crossing my fingers this gets delivered eventually |
For complicated table relationships, it's nice to have an ERD (i.e. schema diagram) to help understand how everything fits together.
Schema.yml files already have relationship tests built into them that show the relationship between tables, so those could be parsed to get the data needed for the "keys," which may not actually be keys in most warehouse systems.
A first implementation of this functionality might just look at the schema.yml file and get any columns listed there so that only one data source is needed. A fuller implementation would get all columns from the tables.
Other thoughts:
The text was updated successfully, but these errors were encountered: