Skip to content

Utilize tags for metrics sent to SafeDogStatsdLogger #8743

@williamBartos

Description

@williamBartos

Description

A recent pr enabled dogstatsd support for Airflow metrics: #7376. While this enables the use of dogstatsd, the code sending metrics to SafeDogStatsdLogger doesn't utilize tagging and instead sends a unique, monolithic metric that cant be aggregated across identifiers such as <dag_id>. This isn't scalable when someone wants to monitor metrics across multiple DAG as each metric sent by each DAG is unique. The amount of monitors increases with the amount of DAGs.

An example here are the timer metrics sent by a DagRun, such as dagrun.duration.failed.<dag_id>. When sent by the DagRun object, <dag_id> isn't a tag but part of the entire metric itself: https://github.com/apache/airflow/blob/master/airflow/models/dagrun.py#L412-L420

What is the problem here?

By sending metrics to DataDog without tags, it becomes impossible to aggregate metrics across <dag_id> because each dagrun.duration.failed.<dag_id> sent by a DAG is completely unique to that <dag_id>.

If I have 20 dags in production and want to monitor dagrun.duration.failed.<dag_id>, that means I'll need 20 separate monitors!

image

But if <dag_id> is sent as a tag, a single monitor could be used and DataDog can group the metric by <dag_id>.

Use case / motivation

The current way metrics are sent to DataDog isn't scalable as its preventing a user from aggregating common metrics across unique tags.

Following the DagRun example given above, the information needed to send this metric as a tag is available. Given this line of code: https://github.com/apache/airflow/blob/master/airflow/models/dagrun.py#L418 and the accompanying function definition: https://github.com/apache/airflow/blob/master/airflow/stats.py#L172 we can modify the function call to send <dag_id> as a tag:

toy example:

duration = (self.end_date - self.start_date)
if self.state is State.SUCCESS:
    if isinstance(Stats, SafeDogStatsdLogger)
        Stats.timing('dagrun.duration.success', duration, tags=[self.dag_id])
    else:
        Stats.timing('dagrun.duration.success.{}'.format(self.dag_id), duration)

The preference here is probably not to do type checking before submitting the metric. I'm willing to discuss other solutions here or as part of a PR, and to implement the agreed upon solution.

Related Issues

This is the ticket that created the SafeDogStatsdLogger class: #7376

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions