Skip to content

Conversation

@mik-laj
Copy link
Member

@mik-laj mik-laj commented Nov 7, 2020

While I was working on generating automatic reference documentation, I entered the labels parameter to be able to generate text lazily. However, when I took a closer look at it, I realized that it can also be used to generate valid tags for statsd metrics.
Close: #11463

The name of the metric is not perfect as it contains placeholders, but it is good enough to be found in the documentation. We don't have two different sets of keys.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added area:dev-tools area:docs area:Scheduler including HA (high availability) scheduler labels Nov 7, 2020
@mik-laj mik-laj marked this pull request as ready for review November 7, 2020 12:13
@mik-laj mik-laj requested a review from potiuk November 7, 2020 12:17
@mik-laj mik-laj requested a review from kaxil November 7, 2020 12:33
# TODO: Remove for Airflow 2.0
filename = file_stat.file.split('/')[-1].replace('.py', '')
Stats.timing(f'dag.loading-duration.{filename}', file_stat.duration)
Stats.timing('dag.loading-duration.{filename}', file_stat.duration, labels={"filename": filename})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add (on top of existing ones) new metrics that have just "short" name and the variables are only passed as labels?

For example in this case:

Stats.timing('dag.loading-duration', file_stat.duration, labels={"filename": filename})

I think those will be much easier to aggregate in some tools.

Copy link
Member

@potiuk potiuk Nov 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we should leave the old stats as they are and only add labels in the new ones ? WDYT? We could even add some prefix/indication that those stats are the "labeled" ones and distinguish them in the stats documentation ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only problem is that not every client supports tags, and some clients support tags in a different way than the official Datadog. For this reason, I preferred not to add more keys, but leave these decisions to the statsd client.
If the client supports the tags, it will keep metrics in the best form for itself.

We currently have two clients:

  • Classic statsd client (without tags support), which in this case will get a metric - dag.loading-duration.my-awesome-file.py without tags
  • Datadog statsd (with tags support), which in this in this case will get a metric - dag.loading-duration.filename with tag filename:my-awesome-file.py

So you can aggregate this data if you need it.

This key name is not ideal when the client supports tags, but we don't need to generate the metrics key twice. On the other hand, writing the available parameters in the name of the metrics is not a stupid idea. This can help us find the tags more easily.

It is also worth adding that Stackdriver stores labels differently. Instead of tags, it uses a dictionary that has a normal key and value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added documentation that describes this behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi guys.
Usually, best practice is to have generic metric names and use tags for any variable parameters, that it will make it possible to do all possible breakdowns, slice and dice etc. in the monitoring tool (if it supports it of course).
With that said, even if it is a weird name I actually do not see problem to have metric name like dag.loading-duration.filename with tags=['filename={filename}'], if this is going to simplify implementation/support/usage.

However, there is one caveat. What if some time later we are going to extend metric with additional tags, e.g. for metric operator_failures_{task_type} we would like to add one more tag like dag_id. Maybe not a very good example, but I am pretty sure for some metrics we could put more information that will be helpful to build breakdown by.
Then we actually will have to change name of the metric, that will be a breaking change, cause all existing setup alerts/dashboards will stop to work (or we will have to support two metrics from now operator_failures_{task_type} + operator_failures_{task_type}_{dag_id}, which is not also ideal).

So, as for me, actually omitting placeholders in names like operator_failures/dag.loading-duration/... and using tags should be the best for clients which support tags and want to leverage them. Unless we are sure that we will not add more tags to existing metrics, which can be not the true.
Then it seems like having maybe two type of metrics, one without tags, other with tags without placeholders in the names might be a good solution.

But maybe this is not an issue. What do you think guys about the fact that metrics can be extended with additional tags in the future.

@mik-laj
Copy link
Member Author

mik-laj commented Nov 12, 2020

@Acehaidrey do you have any thoughts on this change?

@turbaszek turbaszek requested a review from potiuk November 18, 2020 18:13
@potiuk
Copy link
Member

potiuk commented Nov 22, 2020

some confflicts - but it's good for me.

@_format_safe_stats_logger_args
def incr(self, stat, count=1, rate=1, labels=None):
"""Increment stat"""
del labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents an error about an unused variable and at the same time we do not have a poorly descriptive variable _.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually needed? Because of the _format_safe_stats_logger_args decorator, this is never called with any value:

    def func_wrap(_self, stat, *args, labels=None, **kwargs):
        if labels:
            # Remove {} from stat key e.g. {class_name} => class_name
            stat = stat.format(**labels)
        return validate_stat(func)(_self, stat, *args, **kwargs)

labels is an explicit parameter, so won't be in the kwargs passed to this function (func in that snippet)

@functools.wraps(func)
def func_wrap(_self, stat, *args, labels=None, **kwargs):
if labels:
# Remove {} from stat key e.g. {class_name} => class_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Remove {} from stat key e.g. {class_name} => class_name
# Remove {} from stat key e.g. {class_name} => SchedulerJob

description: str


METRICS_LIST: List[Metric] = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the purpose of this List, nor the Metric class -- it appears this is never used for emitting metrics at runtime, so I'm not sure why we need this?

What was your thinking here please?

@Acehaidrey
Copy link
Contributor

I honestly think this is great. I'm not sure how to capture this as we use opentsdb and have an internal client for it, so I had to make changes to all of these stat calls to match the same signature and use tags instead of the a.b.c. This will help make this easier to manage. It seems graphite uses that specific notation to do aggregation, whereas other clients use tags for that same result.

@github-actions
Copy link

github-actions bot commented Mar 1, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Mar 1, 2021
Stats.incr(f'operator_successes_{self.task.task_type}', 1, 1)
task_type = self.task.task_type
Stats.incr('operator_successes_{task_type}', 1, 1, labels={"task_type": task_type})
Stats.incr('ti_successes')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add the dag/task tags to ti_successes and ti_failures as well? I think this would useful for aggregation so we can ignore or watch specific dags for failures

@github-actions github-actions bot closed this Mar 9, 2021
@williamBartos
Copy link

Are there plans to still merge this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:dev-tools area:monitoring area:Scheduler including HA (high availability) scheduler stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tags rather than names in variable parts of the metrics

7 participants