Replies: 2 comments
-
Updated description with links to the fuller version of this concept and proof of work: |
Beta Was this translation helpful? Give feedback.
0 replies
-
Thanks for such a thorough write-up here and within the series of posts @timle2 ! This is a Big Idea 💡 that I'm going to promote to a Discussion. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Is this your first time submitting a feature request?
Describe the feature
dbt runs workflow orchestration internally without exposing much of what is happening to the user. There is currently no way to see a execution plan (a.k.a. what tasks of the DAG are to be run, and in what order) based on a given dbt command. This information is only apparent during the run via console output, and after the run via the run command.
Why do we want the execution plan before a run happens?
There are many situations where having access dbt's execution plan, before the run happens, is of value.
Having an execution plan before a job runs is valuable because it allows for better planning, validation, and optimization of the job execution.
Firstly, an execution plan outlines the tasks and their dependencies in a pipeline, providing a roadmap for job execution. This allows teams to plan their resources and schedule their jobs in advance, ensuring that there are no resource conflicts or bottlenecks during execution.
Secondly, an execution plan can be used to validate the pipeline and its individual tasks before running the job. By analyzing the execution plan, teams can identify any potential issues or errors that may arise during job execution, such as missing dependencies or circular references. This validation step helps to catch errors early on and avoid costly job failures or data corruption down the line.
Finally, an execution plan can be used to optimize job execution by identifying areas for improvement or potential bottlenecks in the pipeline.
While many of these scenarios are handled manually by visualizing the DAGs in docs, it's not simple to programmatically generate a comparable output for analysis before a run occurs.
Additionally, there are users that wish to export the dbt DAG to a different workflow orchestration platform, and run each task individually there. There can be great values in pairing of dbt's powerful capabilities for building and compiling complex SQL-based workflows with an external workflow orchestration tool that offers an equally rich set of features for workflow orchestration. The goal there being to leverage the strengths of both tools to create a more robust and flexible data transformation workflow that can be easily managed and maintained over time.
What about dbt list
An obvious entry point to extract the dbt DAG for a run would be the dbt list command. The dbt list command is used to display a list of available dbt resources (models, tests, and macros) in a project. When you run dbt list, it will print out a table that shows the names and descriptions of all available resources in your dbt project. It will return the results based on a --models/--select input, but does not order the models in the same way as they would run during execution. The dbt list command does not output an execution plan, but it can be useful to see a summary of all available resources in your project.
How dbt runs tasks based on the DAG
dbt converts the DAG for a given run into an internal graph_queue construct. It performs a topological sort to score each task, and then orders them into a queue. Each item in the queue is then run, and marked done on completion.
Ideally we'd be able to get the tasks and order that exist in this queue in order to have an accurate understanding of what dbt will run.
e.g.
ideally being able to take a run command (seen here in dbt docs)
(Please note that in my example model_e is ephemeral)
and generate an execution_plan (without running any models!) like so:
(Please note that in my example model_e is ephemeral and so excluded from the plan)
Describe alternatives you've considered
dbt list command
Who will this benefit?
Power users looking to:
For more, here is a series of articles I've written on this topic!
Are you interested in contributing this feature?
yes
Anything else?
No response
Beta Was this translation helpful? Give feedback.
All reactions