Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dag task scheduled and queued duration metrics #37936

Merged
merged 5 commits into from
Jun 14, 2024

Conversation

htpawel
Copy link
Contributor

@htpawel htpawel commented Mar 6, 2024


From statsd documentation (calling-timing-manually):


# Pass milliseconds directly

start = time.time()
time.sleep(3)
# You must convert to milliseconds:
dt = int((time.time() - start) * 1000)
statsd.timing('slept', dt)

dt int time must be in milliseconds rather than seconds. You can check other examples in airflow code also. Then statsd lib converts it to seconds before exporting metric.

Copy link

boring-cyborg bot commented Mar 6, 2024

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@potiuk
Copy link
Member

potiuk commented Mar 10, 2024

cc: @ferruzzi @Bowrna -> I remember there were discussions about it in the past . Did we decide to leave it in s for back-compatibility ?

Copy link
Collaborator

@dirrao dirrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see a few metrics are measured in seconds. At some point of time we have to do unification.

@htpawel
Copy link
Contributor Author

htpawel commented Mar 11, 2024

@potiuk @dirrao
I searched entire airflow 2.7.1 code for Stats.timing and except this place only we are providing delta correctly. Only here we are calling .total_seconds() on delta which is incorrect. Current metrics documentation states that those two metrics are in seconds, but they are not. Right now to have useful data (in seconds) you need to multiply metric value by 1000 - which is obvious bug and mismatch between documentation and reality and should be fixed asap imo.

@dirrao
Copy link
Collaborator

dirrao commented Mar 11, 2024

@potiuk @dirrao
I searched entire airflow 2.7.1 code for Stats.timing and except this place only we are providing delta correctly. Only here we are calling .total_seconds() on delta which is incorrect. Current metrics documentation states that those two metrics are in seconds, but they are not. Right now to have useful data (in seconds) you need to multiply metric value by 1000 - which is obvious bug and mismatch between documentation and reality and should be fixed asap imo.

@htpawel,
I have checked the documentation. If so, then documentation needs to updated.

@htpawel
Copy link
Contributor Author

htpawel commented Mar 11, 2024

@dirrao
Nope, documentation is correct I believe, we want it to be seconds, but right now they are not seconds in result because of above bug

@ferruzzi
Copy link
Contributor

This would break the dashboard of any user currently monitoring that metric and their timers will suddenly show 1000x longer durations, right?

@htpawel
Copy link
Contributor Author

htpawel commented Mar 12, 2024

@ferruzzi
If someone is using it right now and applied temporary workaround (added *1000 to metric value) with TODO comment like we did - yes, it might break their dashboard. But I don't think it's the reason to not fix this :D First of all, documentation says it is seconds as result, so documentation is lying to us right now. Second, it will be in release notes so if someone is updating airflow without reading them - it's their fault. Third, it's new metric so I doubt many people are using it. Some of them might find its result useless (like I did at the beginning) and drop, some of them might just applied it and be happy they don't have any queues because they believe in documentation that those are seconds. In general it does not look good if you don't want to fix your code to work properly just because people might have already made ugly workarounds themselves to fix your bug IMHO.

@ferruzzi
Copy link
Contributor

I never said don't fix it. It's just a matter of if we call it a bugfix and fix it now, creating a breaking change and sending users scrambling to figure out why their queues are suddenly 1000 times longer, or if we have to go through the deprecation process so users are informed.

Since this is documented as seconds, I think we can call it a bugfix and just do it, but it's still something that needs to be decided.

@Bowrna
Copy link
Contributor

Bowrna commented Mar 22, 2024

cc: @ferruzzi @Bowrna -> I remember there were discussions about it in the past . Did we decide to leave it in s for back-compatibility ?

timedelta object statsd handles internally by converting to milliseconds. But there are some case where we explicitly pass the seconds instead of timedelta object and it is stored in statsd as seconds. so I added a fix with different key name where the newly added key name persists the data in milliseconds while the older key stays for backward compatibility with a deprecation message.

@potiuk
Copy link
Member

potiuk commented Mar 26, 2024

Hey @vandonr - since you are the original author of #30612 and likely will be easier to get all the context back - maybe you can comment on that one? Is that the right fix ? Or maybe @ferruzzi @o-nikolas - since you were around when it was implemented - is there any reason those durations should be kept as they are (and maybe there is a constructive proposal how to handle that case. It looks like we have somewhat long lasting issues that those new metrics introduced in #30612 caused and possibly we should find a way how to address it before 2.9.0 rc2 is out ?

@potiuk potiuk added this to the Airflow 2.9.0 milestone Mar 26, 2024
@vandonr
Copy link
Contributor

vandonr commented Mar 27, 2024

I don't have anything against converting this metric to milliseconds, I believe I wasn't aware of that statsd recommendation when I wrote that code.

However, there are plenty of other timers that are emitted in seconds in airflow, and I think if we make a migration effort, we might as well migrate all at once rather than little by little.

We must be aware of the high impact this can have for users as well: if they have threshold alerts on those metrics, they will certainly ring when they see the metric go x1000.

@htpawel
Copy link
Contributor Author

htpawel commented Mar 27, 2024

@vandonr You probably didn't read the topic fully or misunderstood. Airflow indeed emits most of metrics (or all of them) in seconds, and those two should also be emitted in seconds like you wanted to achieve and like it is stated in documentation and like I also want it to be (and everyone I suppose). But unfortunately they are NOT right now. It is because you need to pass milliseconds or delta time object to Statsd timing (then it will emit metric in seconds). But you are passing seconds which is incorrect. Check all other places in Airflow code - delta time object is passed. This is obvious bug which we need to fix with above bugfix.

Copy link
Contributor

@o-nikolas o-nikolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than adding a unittest to ensure this doesn't happen again. I'm happy to approve the change.

I think to me we can safely call this a bugfix and not worry about back-compat (https://xkcd.com/1172/ etc etc).

@Bowrna
Copy link
Contributor

Bowrna commented Mar 28, 2024

@vandonr You probably didn't read the topic fully or misunderstood. Airflow indeed emits most of metrics (or all of them) in seconds, and those two should also be emitted in seconds like you wanted to achieve and like it is stated in documentation and like I also want it to be (and everyone I suppose). But unfortunately they are NOT right now. It is because you need to pass milliseconds or delta time object to Statsd timing (then it will emit metric in seconds). But you are passing seconds which is incorrect. Check all other places in Airflow code - delta time object is passed. This is obvious bug which we need to fix with above bugfix.

@htpawel #33426 (comment)

Please check here. Statsd timing internally sends the stat in milliseconds. I am sharing this here as you told Statsd timing emit metric in seconds. Does statsd sets the data in milliseconds but emits in seconds?

@htpawel
Copy link
Contributor Author

htpawel commented Mar 28, 2024

Ok now I am not 100% sure it emits it in seconds (it will require further investigation), but anyway it is 100% clear that it expects time delta or milliseconds as input, not seconds (it is even said in the code you sent in comment), to work properly.

@htpawel
Copy link
Contributor Author

htpawel commented Mar 28, 2024

Ok, I did further investigation and now I know everything :D
Statsd by convention should always emit milliseconds. (See ms in their code: self._send_stat(stat, '%0.6f|ms' % delta, rate)). And that is why they are requesting time delta object or value in milliseconds: (in their code: delta can be either a number of milliseconds or a timedelta.). But later you need to do something with those metrics, so we (and probably most of the people) use statsd-exporter from prometheus. And in statsd-exporter documentation you can read Timers will be accepted with the ms statsd type. Statsd timer data is transmitted in milliseconds, while Prometheus expects the unit to be seconds. The exporter converts all timer observations to seconds.. So that is why I thought it is emitted in seconds, I was just checking statsd-exporter /metrics endpoint. So it turns out that not this, but other metrics are wrong in documentation, eg. dagrun.schedule_delay.<dag_id> as it passes time delta to Stats.timing so emits milliseconds but airflow documentation says those are seconds. And above two from this PR are indeed emitted in seconds, but obviously it is not correct either as those should be milliseconds and are emitted with ms suffix. So there are more things f*cked up generally. Timing/Timer should always receive milliseconds value or time delta object and it emits milliseconds, which can be later consumed by statsd-exporter and exported in seconds for prometheus to scrape them.

@potiuk
Copy link
Member

potiuk commented Apr 2, 2024

Sounds like a breaking bug-fix to me . I.e. yes - we know it will be breaking, but it was broken in the first place anyway. @ferruzzi - WDYT? I'd merge it and add significant note about it

@tanvn
Copy link
Contributor

tanvn commented Apr 3, 2024

I has been using Airflow with statsd-exporter for quite a long time and to be honest, while I found the document about the timers sometimes inconsistent, the values exported by Statsd which we are using look good to me (except task scheduled and queued duration metrics as I have not used them yet)
So +1 for this PR and a fix in the document 🙇

@HTRafal
Copy link

HTRafal commented Apr 4, 2024

So dag.<dag_id>.<task_id>.queued_duration documented as Seconds a task spends in the Queued state, before being Running ends up exported to Prometheus as a value in kiloseconds :D. You have to multiply it by 1000 to get a value in seconds.

@ferruzzi
Copy link
Contributor

ferruzzi commented Apr 4, 2024

KILOSECONDS? Why the heck used KILOSECONDS for anything?

It looks like consensus is that we don't have to deal with deprecation, so that's great.

@potiuk
Copy link
Member

potiuk commented Apr 4, 2024

It looks like consensus is that we don't have to deal with deprecation, so that's great.

Agree.

@ferruzzi
Copy link
Contributor

ferruzzi commented Apr 4, 2024

@Bowrna @HTRafal @tanvn - Does this proposed change look like it will address your comments? If you all confirm, I'll approve and we can get it merged. The code looks fin to me, just want to confirm that it will fix what you are seeing.

@htpawel - Can you think of some kind of unit test on (Stats.timer, satsd_logger.timer and/or otel_logger.timer) that asserts the expected output format? Maybe create a timer, sleep(1), and assert the duration is how we want it to look? I know this was a really small thing and it really dragged out, sorry about that.

@htpawel
Copy link
Contributor Author

htpawel commented Apr 22, 2024

@ferruzzi @o-nikolas
What test did you have in mind? I can't think of any to accomplish this:

Other than adding a unittest to ensure this doesn't happen again. I'm happy to approve the change.

Developers should use Stats.timer() context manager as much as they can, but there are scenarios like the one above when Stats.timing() is needed. Then developers should pass either timedelta object or milliseconds, but it is not possible to validate this input, you can't know if given number is milliseconds or seconds. That's why statsd third party module is not able to validate that also - it is in documentation.

@o-nikolas
Copy link
Contributor

Developers should use Stats.timer() context manager as much as they can, but there are scenarios like the one above when Stats.timing() is needed. Then developers should pass either timedelta object or milliseconds, but it is not possible to validate this input, you can't know if given number is milliseconds or seconds. That's why statsd third party module is not able to validate that also - it is in documentation.

ACK, running the workflow again after the latest merge from main

@tanvn
Copy link
Contributor

tanvn commented May 11, 2024

If we are not going to add any unit tests, I think it is acceptable to merge this PR now?

@ferruzzi
Copy link
Contributor

What test did you have in mind? I can't think of any to accomplish this

If you are trying to make sure the metric is emitted in seconds and not milliseconds, maybe start a timer, sleep(1), and assert that Stats.timer() was called with a value of 1 and not 1000?

@htpawel
Copy link
Contributor Author

htpawel commented May 15, 2024

What test did you have in mind? I can't think of any to accomplish this

If you are trying to make sure the metric is emitted in seconds and not milliseconds, maybe start a timer, sleep(1), and assert that Stats.timer() was called with a value of 1 and not 1000?

I've already made sure that metric is emitted in milliseconds (if you use it correctly, so pass milliseconds or delta) but if someone will pass seconds it will emit seconds obviously (with ms suffix..) and there is no test / mechanism to prevents this automatically in any way.. Developers just must know that Statsd expects milliseconds or delta object only, that's convention.

@ferruzzi
Copy link
Contributor

Developers just must know that Statsd expects milliseconds or delta object only, that's convention.

That's horrible. So we don't have any way of knowing, catching, or preventing this from happening again? Alright. I'll approve it as-is then.

@eladkal
Copy link
Contributor

eladkal commented Jun 3, 2024

Sounds like a breaking bug-fix to me .

If this is the case we probably should add a newsfragmant that warn/explain users
@htpawel can you handle it? it's a simple as adding 37936.significant.rst file to
https://github.com/apache/airflow/tree/main/newsfragments

@htpawel
Copy link
Contributor Author

htpawel commented Jun 5, 2024

@eladkal @ferruzzi done

@htpawel htpawel force-pushed the fix-statsd-metrics branch 5 times, most recently from b5de713 to 0ec39ad Compare June 10, 2024 11:17
@htpawel
Copy link
Contributor Author

htpawel commented Jun 13, 2024

@ferruzzi who has write access and can merge this?

@eladkal eladkal modified the milestones: Airflow 2.10.0, Airflow 2.9.3 Jun 13, 2024
@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label Jun 13, 2024
@htpawel
Copy link
Contributor Author

htpawel commented Jun 13, 2024

@eladkal seems like once again there is some flaky test failing, not related to my change. Could someone override its result or rerun only it or fix? Otherwise we will never merge such a simple fix :/

@potiuk
Copy link
Member

potiuk commented Jun 14, 2024

Ok. Let's treat it as a bugfix and merge.

@potiuk potiuk merged commit bffb7b0 into apache:main Jun 14, 2024
51 checks passed
Copy link

boring-cyborg bot commented Jun 14, 2024

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

jannisko pushed a commit to jannisko/airflow that referenced this pull request Jun 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet