Description
dagrun.first_task_scheduling_delay measures data_interval_end → first_start_date, which conflates two things: (a) scheduler latency to enqueue the first task, and (b) executor latency to pick the task up. Splitting these helps locate where time is being spent when a DAG run starts late.
The executor-pickup portion (queued_at → first_start_date) has no metric today.
Use case / motivation
When the first task of a DAG run starts late, ops want to know: was the scheduler slow to queue it, or was the executor slow to pick it up? One metric per phase.
Proposal
Add dagrun.first_task_start_delay, computed as first_start_date - queued_at on dag run completion. Tag by dag_id and run_type to match the existing tag shape on first_task_scheduling_delay.
I expect there's a "do we want N metrics or one with M dimensions" discussion — flagging this as an issue first instead of a direct PR so the shape can settle before code.
Are you willing to submit a PR?
Code of Conduct
Description
dagrun.first_task_scheduling_delaymeasuresdata_interval_end → first_start_date, which conflates two things: (a) scheduler latency to enqueue the first task, and (b) executor latency to pick the task up. Splitting these helps locate where time is being spent when a DAG run starts late.The executor-pickup portion (
queued_at → first_start_date) has no metric today.Use case / motivation
When the first task of a DAG run starts late, ops want to know: was the scheduler slow to queue it, or was the executor slow to pick it up? One metric per phase.
Proposal
Add
dagrun.first_task_start_delay, computed asfirst_start_date - queued_aton dag run completion. Tag bydag_idandrun_typeto match the existing tag shape onfirst_task_scheduling_delay.I expect there's a "do we want N metrics or one with M dimensions" discussion — flagging this as an issue first instead of a direct PR so the shape can settle before code.
Are you willing to submit a PR?
Code of Conduct