Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow response times interacting with data lineage chart on large projects #170

Closed
1 of 5 tasks
vogt4nick opened this issue Jan 21, 2021 · 22 comments
Closed
1 of 5 tasks
Labels

Comments

@vogt4nick
Copy link

Describe the bug

Users open the data lineage graph by clicking the button at the bottom-right corner of the page. It takes 2-3 seconds for the graph to load. This issue persists beyond the initial load too. Most interactions take 2-3 seconds to complete when several nodes are selected in the lineage graph. Response times are also slow when using selectors; the page becomes briefly unresponsive and keystrokes aren't immediately input to the text field.

Steps To Reproduce

Serve the data catalog for a large dbt project with relatively large manifest.json and catalog.json files; in my example, 300+ models and 1800+ tests generate a 6.2 MB manifest.json and a 1.2 MB catalog.json.

Click the "data lineage chart" button on the bottom-right corner of the page.

See profiling output below for benchmark.

Expected behavior

The data lineage chart should load under 500ms (or some other arbitrary threshold determined by users' tolerance).

Screenshots and log output

The "View Lineage Chart" button:

image

My profiling output:

image

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.18.0
   latest version: 0.18.1

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

Plugins:
  - bigquery: 0.18.0
  - snowflake: 0.18.0
  - redshift: 0.18.0
  - postgres: 0.18.0

The operating system you're using:

The data catalog is served with the base Docker image library/nginx:1.19.0-alpine.

The documentation is generated with the base Docker image library/python:3.7.7-slim-buster.

The output of python --version: 3.7.7

@vogt4nick
Copy link
Author

Profiling highlighted that garbage collection events make up most of the interaction. I believe – please feel free to correct me if you think I'm wrong – that the javascript is loading every node from disk and disposing the buffer separately. There are over 300 nodes in the graph, such behavior would cause a slow down w/o sufficient threads to process everything asynchronously.

I'm not proficient with javascript, but my coworkers and I identified what we think is a relevant StackOverflow post, where user Abe shared:

Something to consider is that readFile buffers the entire file! This will cause the huge memory bloat. Better alternative is to implement fs.createReadStream() which will only buffer the part of the file you're currently reading. Unfortunately, implementing that solution may require a full rewrite of your code as it returns fs.ReadStream which won't behave the way you're currently handling files Checkout this link and read the bottom of the section to see what I'm referencing

The js in index.html is very large and obfuscated to my eyes. I believe the fix will require knowledge from the dbt core team.

@drewbanin
Copy link
Contributor

big + 1 on this one! The website loads the manifest.json file over the network, so I don't believe that any fs/streaming solution is going to be appropriate here. I do think that the way we're loading/creating nodes for the DAG is synchronous and "blocking" in the frontend, and it makes the website pretty unusable until the entire DAG is loaded!

I think we should:

  1. Relocate this issue to the dbt-docs repo
  2. Experiment with different approaches that eg. let us very quickly render the chrome/nav for the website (+ search, for instance) and then more asynchronously and progressively build out the DAG viz

@jtcohen6 you buy it?

@jtcohen6
Copy link
Contributor

@drewbanin 100%

@jtcohen6 jtcohen6 transferred this issue from dbt-labs/dbt-core Jan 21, 2021
@kevnoo
Copy link

kevnoo commented Jun 4, 2021

I'm not sure where this issue currently stands but @vogt4nick, @drewbanin and @jtcohen6 - we (kraftheinz engineering) are just about wrapped up with converting dbt docs over to React. One of the first things we did was compressed the manifest.json file during the build which solved all of the performance issues. For context, we currently have over 115 projects and the site loads in less than 5 seconds after that change.

We will have a repo to share with everyone once we are done packaging it up for public consumption.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 6, 2021

@kevnoo Very cool!! I'd love to see, when it's ready to share

@drewbanin
Copy link
Contributor

super cool and very exciting to hear @kevnoo! Anything we can help out with?

Also - wondering if you're thinking about upstreaming these changes. Did you rebuild the site from scratch, or does it look like a branch off of the existing docs site? Would love to hear more if you're able to share!

@kevnoo
Copy link

kevnoo commented Jun 21, 2021

We are very close to publishing the code but would be open to chatting directly prior to (or after).

The site itself looks exactly the same (for the most part). We are using the manifest.json but do not use the catalog.json or run_results.json. The reason for this is because we have also integrated it directly with Snowflake via a backend API. We also had some issues with the catalog.json due to the way we have implemented DBT across all of our projects.

The biggest change was a massive revamp to the lineage graph - this is all done now using HTML Canvas.

As for upstreaming the changes, there would be some work required to have it act just like the existing version. We have not gone through the process to allow it to be part of the dbt docs generate process - this is something we are definitely open to but would need some guidance on. And obviously there would need to be a version that does include the use of run_results and catalog. These are all things we agree with and would be open to helping out on but as of now can't prioritize it due to a lack of time.

@ajbosco
Copy link

ajbosco commented Nov 8, 2021

@kevnoo any update on if/when you'll be publishing the React version of dbt docs?

@kevnoo
Copy link

kevnoo commented Nov 10, 2021

Hey @ajbosco - sorry for the delay! It's ready. @AlexanderKutz is going to work on publishing it over the next couple days.

@Mr-Nobody99
Copy link
Contributor

Hello @ajbosco and anyone else interested in checking it out, I have published the react version here

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the Stale label May 13, 2022
@saraleon1
Copy link

+1 dbt Cloud user - the slow response time of the data lineage chart for their larger project made this feature relatively unusable. Any time they tried to move or modify the chart, it lags and takes a few seconds to load correctly again

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 16, 2022
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 24, 2022
@StepienTomasz
Copy link

StepienTomasz commented Mar 27, 2023

+1. The page crashes with our large DBT repo when interacting with the lineage chart.
Here are some numbers for reference.

Found 1042 models, 1071 tests, 0 snapshots, 0 analyses, 493 macros, 0 operations, 29 seed files, 650 sources, 10 exposures, 0 metrics

image

@sfc-gh-kumaurya
Copy link

I am also facing the same issue as mentioned by @StepienTomasz while loading the lineage graphs for the "sources". We are getting the page unresponsive and it loads after 3-4 minutes.

@jaredx435k2d0
Copy link

Agreed it is very slow and causes a really poor user experience.

Would love to see this move forward! We're evangelizing dbt at the org and the docs are a large part of that.

@sungchun12
Copy link
Contributor

FYI on what a full rewrite can look and feel like for dbt docs: https://dagster.io/blog/dbt-docs-on-react

@jtcohen6 jtcohen6 removed the triage label Aug 17, 2023
@nobgb
Copy link

nobgb commented Oct 20, 2023

@sungchun12 absolutely loved the blog post!! Any info about lineage view and going to prod?
I'm familiar with so many projects that would benefit from it

@sungchun12
Copy link
Contributor

@nobgb no updates as I believe official dbt docs development will go through dbt Cloud only. I'll let the dbt maintainers speak to it more!

@nobgb
Copy link

nobgb commented Oct 30, 2023

couldn't find the info about it in dbt docs. Would you be so kind to share those with me, i'd like to follow that

@jtcohen6 jtcohen6 removed the triage label Nov 29, 2023
@rightx2
Copy link

rightx2 commented May 20, 2024

https://dagster.io/blog/dbt-docs-on-react this is awesome except one thing: data lineage view is not implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests