SLA Rewrite - Deprecate Existing SLA Logic #21233
Replies: 3 comments 1 reply
-
Also tagging @jedcunningham @kaxil @ashb @ephraimbuddy and @dstandish from Astronomer to get more thoughts and feedback. Feel free to tag anyone important that I missed about this. |
Beta Was this translation helpful? Give feedback.
-
Absolutely agree with rewrite. I think this is a great start of the discussion and I think eventually it should end-up with AIP proposal on the devlist. I think it would also be great to let people know at the devlist that we are discussing it here (which I think is a much better than discussing technical details in the devlist BTW). |
Beta Was this translation helpful? Give feedback.
-
Hello. Did anyone created a PR to address the issue of the sla_miss_callback not being called as soon as the SLA is missed. If so, which Airflow version has the fix? Thank you |
Beta Was this translation helpful? Give feedback.
-
Topic of Discussion
I've heard a lot of buzz regarding rewriting SLA functionality in Airflow. I agree with this change, I think with the modern version of Airflow, that SLAs are in serious need of a reword. There are many open issues with SLAs around the modern architecture of Airflow like multiple emails being sent when using 2 schedulers and other bugs that are present in the new version of Airflow. See also this link for another problem involving DAGFileProcessorProcess timeout while processing a large amount of SLAs at the same time.
There are also other feature requests about functionality that is not possible with the current implementation of SLA. There is even a comment on that one where @eladkal talks about rewriting the SLA feature, which was part of the inspiration for this discussion and @potiuk seems to agree.
Some of the bugs that happen for SLAs are because of new features that have been implemented in Airflow since the SLA function was implemented. This includes the Taskflow API and multiple HA schedulers in Airflow 2, that can cause issues.
At a high level, I'd like to discuss ideas for features for an SLA rewrite that we'd like to see implemented. Then after the feature set is decided, we can discuss the flow of SLAs through the code and more concrete solution details. Would this become an AIP? I'm happy to learn about that process if so.
Features
Behavior and Definitions
timedelta
timedelta
Improvements over Existing SLAs
sla_miss_callback
is called as soon as the SLA is missed, not when the Task or the DAG is completed.SLAMiss
s as they happen and notifications and other external callsDagFileProcessor
process:manage_slas
sla_miss_callback
sla_miss_callback
notifies the user in a different way and emails will just duplicate this process. Issue descriptionSLAMiss
is raised, and the process that eventually handles the callback and adding a newSLAMiss
to the databaseDagFileProcessor
process from timing out and causing more issues with SLAs that are hard to debug.Thoughts and discussion below please!
Beta Was this translation helpful? Give feedback.
All reactions