Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statsd metrics dagrun.schedule_delay sends time in different format against the seconds specified in docs #33426

Closed
2 tasks done
Bowrna opened this issue Aug 16, 2023 · 13 comments · Fixed by #36404
Closed
2 tasks done
Assignees
Labels
kind:bug This is a clearly a bug kind:documentation

Comments

@Bowrna
Copy link
Contributor

Bowrna commented Aug 16, 2023

What do you see as an issue?

For the airflow statsd metric received at the statsd client, the metric dagrun.schedule_delay described in airflow docs says it sends the seconds of delay between the scheduled DagRun start date and the actual DagRun start date.
Screenshot 2023-08-16 at 10 47 30 AM

But in the code, it sends the difference of the two datetime and datetime datatype is sent to the statsd timer metrics. In the client end it is received as the float object.
Though the Airflow UI doesn't indicate any schedule delay, the metrics received has data like below

{'metric': 'airflow.dagrun.schedule_delay.test_spark', 'points': [[1692013326, [720730.214]]], 'type': 'timers', 'host': 'airflow'}

Here the metrics points received is float and it could be microsecond / millisecond value of the datetime object as there is no conversion to seconds before setting it in Statsd metrics.
Screenshot 2023-08-16 at 10 48 23 AM

Solving the problem

Either converting the datetime object to seconds / milliseconds will fix the issue. Or if we are setting the datetime object in the statsd timer having a clear docs on what to expect value and its metric at statsd client end would be useful

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@Bowrna Bowrna added kind:bug This is a clearly a bug kind:documentation needs-triage label for new issues that we didn't triage yet labels Aug 16, 2023
@potiuk potiuk removed the needs-triage label for new issues that we didn't triage yet label Aug 16, 2023
@potiuk
Copy link
Member

potiuk commented Aug 16, 2023

Feel free to fix it.

@potiuk
Copy link
Member

potiuk commented Aug 16, 2023

cc: @ferruzzi as tje all-round metrics expert to advise :)

@Bowrna
Copy link
Contributor Author

Bowrna commented Aug 16, 2023

thanks @potiuk . @ferruzzi i would like to know if the above way of fixing the issue by changing it to seconds is right or do you see a different way to solve this problem?

@ferruzzi
Copy link
Contributor

ferruzzi commented Aug 16, 2023

That's very interesting that this came up now. I was just having a chat yesterday about how some of the timers emit seconds and others emit milliseconds and we should probably standardize on one or the other.

As for this one, changing what is emitted is going to break someone but I agree that it is a bug which should be fixed. We currently have a number of metrics being emitted twice for various reasons, and adding a new metric is both easy and non-breaking. So my suggestion would be:

  1. Update the docs to indicate the right formatting, and indicate that it is deprecated.
  2. Add a new metric there which emits in milliseconds. (I'm team milliseconds; if you feel seconds is a better choice then feel free, just be sure to indicate which it is when you add it to the docs.)

Question for @potiuk : How would we go about actually deprecating a metric? We can't do the usual "print a warning" like we do in a method/class/module because that would print constantly.

@potiuk
Copy link
Member

potiuk commented Aug 19, 2023

Question for @potiuk : How would we go about actually deprecating a metric? We can't do the usual "print a warning" like we do in a method/class/module because that would print constantly.

I am not sure you can deprecate metric. The only idea I have is to provide new, consistent metrics and disable old ones by default, but allow to enable them selectively - by adding them to "enabled_deprecated_metrics" configuration list (but print deprecation warning in this case in log when they are emited).

Another option is that if we all agree that the old metrics is WRONG - we treat it as bugfix and change them (also allowing probably to turn them back to what they were before - but with emitting deprecation warnings). Might be worth to implement some small framework allowing both cases, anticipating future cleanups we might want to do .

@ferruzzi
Copy link
Contributor

That sounds like a good bit of initial work, but a solid plan if we want to start on a cleanup pass. Hmmm. I guess either way, that sounds like a separate task. I'll keep it in mind though.

For this issue, I'd say for now let's double-emit. Just add another metric which does it in milliseconds and add a comment in the code explaining why we did that. If we do a cleanup and standardizing pass, we can worry about "deprecating" the old one.

@Bowrna
Copy link
Contributor Author

Bowrna commented Sep 1, 2023

thanks @potiuk @ferruzzi i would like to join for the cleanup and standardizing work and will share information on that part. For this issue fix, i will double emit the said parameter with retaining one existing in code now and other one i will emit the milliseconds and add a fix for this issue.

@Bowrna
Copy link
Contributor Author

Bowrna commented Nov 2, 2023

This is the code I retrieved from the python StatsdClient and it shows that timer if we pass the timedelta converts it into milliseconds and push it.

With that said could I simply edit the documentation part alone saying its a milliseconds information that's being pushed to Statsd.
https://github.com/jsocol/pystatsd/blob/c687a4e6bcad1314088cf4934580a8bf46290a8a/statsd/client/base.py#L23C1-L34


    def timing(self, stat, delta, rate=1):
        """
        Send new timing information.


        `delta` can be either a number of milliseconds or a timedelta.
        """
        if isinstance(delta, timedelta):
            # Convert timedelta to number of milliseconds.
            delta = delta.total_seconds() * 1000.
        self._send_stat(stat, '%0.6f|ms' % delta, rate)

cc: @potiuk @ferruzzi

@potiuk
Copy link
Member

potiuk commented Nov 7, 2023

I am good for it - but I would have to analyze it in detail to be sure :)

@Bowrna
Copy link
Contributor Author

Bowrna commented Nov 10, 2023

sure @potiuk Please let me know if you need any details from me on this

@potiuk
Copy link
Member

potiuk commented Nov 12, 2023

Yeah. Start with the docs. I think we might want to do more of a consistency push when we add traces and then we can clean it up and make some consistent approach.

@Bowrna
Copy link
Contributor Author

Bowrna commented Dec 7, 2023

hi @potiuk did you get a chance to analyze this part in detail? I will push changes in doc saying it is milliseconds info that statsd client pushes

@potiuk
Copy link
Member

potiuk commented Dec 10, 2023

Go ahead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug This is a clearly a bug kind:documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants