Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-94] Holistic tracking for perceived start-up performance #4625

Closed
jtcohen6 opened this issue Jan 26, 2022 · 5 comments · Fixed by #4912
Closed

[CT-94] Holistic tracking for perceived start-up performance #4625

jtcohen6 opened this issue Jan 26, 2022 · 5 comments · Fixed by #4912
Assignees
Milestone

Comments

@jtcohen6
Copy link
Contributor

There are three time-consuming steps that contribute to perceived slowness at the start of dbt run:

  1. Parsing: Reading files → creating the internal manifest. Scales with size of project, but helped by partial + static parsing, and with fairly solid telemetry today.
  2. Graph compilation (called here): Scales with the number of nodes in a project, limited by memory / networkx algorithm
  3. Adapter cache construction (called here: Once we have a database connection, run metadata queries (or call metadata endpoints) to populate the cache, saving repeated queries by subsequent materializations. Scales with number of objects in the database, and number of unique namespaces (databases/schemas) that need caching. Speed also varies significantly across different databases; this is particularly slow on Spark/Databricks today ([CT-202] Workaround for some limitations due to list_relations_without_caching method dbt-spark#228).

Not all commands complete all of these steps. For instance, dbt ls and dbt run-operation need only step 1. But all of these steps must run, each time and in sequence, for "graph runnable" tasks (run, test, build, etc)—and they contribute to the perceived delay between submitting the command, and seeing the first node's query hit the database.

While we don't have present appetite for top-to-bottom highly involved performance work, the sooner we can add telemetry, the better served we'll be when we do. Let's start by adding high-level telemetry for steps 2+3.

@jtcohen6 jtcohen6 added this to the v1.1.0 milestone Jan 26, 2022
@github-actions github-actions bot changed the title Holistic tracking for perceived start-up performance [CT-94] Holistic tracking for perceived start-up performance Jan 26, 2022
@leahwicz
Copy link
Contributor

leahwicz commented Feb 28, 2022

@jtcohen6 Let's refine this to get definitive exit goals and split this up

  • Are we trying to improve perf? Profiling is better
  • Are we trying to verify customers are hitting a certain SLA? Then metrics are the way to go

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Mar 2, 2022

Really appreciate the ask for clarification here!

Last year, we set an SLA of "5 seconds start-up time," but the performance work we targeted was narrowly oriented around parsing. I'd like to capture more telemetry so we can see how that SLA exists in the real world: What's the total number of seconds between the very first step in main.py, and the first node that actually starts running?

Today, we lack visibility into:

  • Adapter-specific performance across different databases, especially when those databases are operating at scale (thousands of objects). This is the thing I'm most interested in tracking, and I'm not sure we have good ways to profile it.
  • Non-standard DAG shapes where our graph compilation algorithms perform poorly (e.g. >100 tests for every model). This is something we could better achieve with profiling and a wider variety of performance testing projects.

@jtcohen6
Copy link
Contributor Author

Another important input: the number of nodes in a project, versus the number of nodes actually selected. This feels like a prerequisite to really understanding the potential impact of heavier lifts we might make to ameliorate start-up time, such as #4688.

@iknox-fa
Copy link
Contributor

@jtcohen6 Just to clarify in terms of work-- This would boil down to adding telemetry for Graph compilation and Adapter cache construction. That's how we are estimating this so please let us know if that doesn't sound right.

@jtcohen6
Copy link
Contributor Author

@iknox-fa Correct!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants