Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop SLA callbacks gazumping other callbacks and DOS'ing the DagProcessorManager queue #25147

Merged
merged 3 commits into from
Jul 21, 2022
Merged

Stop SLA callbacks gazumping other callbacks and DOS'ing the DagProcessorManager queue #25147

merged 3 commits into from
Jul 21, 2022

Conversation

argibbs
Copy link
Contributor

@argibbs argibbs commented Jul 19, 2022

Problem

When SLAs are enabled, DAG processing can grind to a halt. this manifests as updates to dag files being ignored: newly added dags do not appear, and changes to existing dags do not take effect.

The behaviour will be seemingly random - some dags will update, others not. The reason is because internally the DagProcessorManager maintains a queue of dags to parse and update. Dags get into this queue by a couple of mechanisms. The first and obvious one is the scanning of the file system for dag files.

However, dags can also get into this queue during evaluation of the dag's state by the scheduler (scheduler_job.py). Since these event-based callbacks presumably require more rapid reaction than a regular scan of the file system, they go to the front of the queue.

And this is how SLAs break the system; prior to this MR they are treated the same as other callbacks, i.e. they cause their file to go to the front of the queue. The problem is that SLAs are evaluated per dag file, but a single dag may have many tasks with SLAs. Thus the evaluation of a single DAG may generate many SLA callbacks.

These cause the affected file to go to the front of the queue. It's re-evaluated, and then the SLA events are fired again.

What this means in practice is that you will see the DagProcessorManager process a dag file with SLAs, move onto the next file in the queue, maybe even make it to 2 or 3 more dags ... and then more SLAs callbacks arrive from the first dag and reset the queue. The DagProcessorManager never makes it all the way to the end of its queue.

Solution

It's pretty simple: the DagProcessorManager queue is altered s.t. SLA callbacks are only added if they don't already exist (remember they're processed per-dag, but generated one per task-with-SLA), and when added they do not change the place of the dag file in the queue. If it's not in the queue, it's added at the back.

Notes

This may feel a bit sticky-tape-and-string; you could argue that the SLACallbacks shouldn't be generated so rapidly. However, the only thing that knows the state of the queue is the DagProcessorManager, and it's unreasonable to expect the DagProcessors to throttle themselves without knowing whether such throttling is necessary.

To put it another way, more optimisations in the DagProcessors are possible, but having the queue gate the callbacks as they're added is necessary and sufficient to stop the SLAs spamming the queue.

This change should fix the problems initially reported in #15596 (although note that since that request was first reported, subsequent changes have altered how the failure manifests - the DagProcessorManager no longer times out, it simply keeps processing the same few dags over and over).

Works on my machine (TM)

Obviously I found this problem because it was manifesting in my local install of airflow. I've patched my install with an equivalent change, and it's fixed the issue.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jul 19, 2022
@argibbs
Copy link
Contributor Author

argibbs commented Jul 19, 2022

@potiuk you said to ping you when I had the PR ready :)

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the change - it does seem to solve the problem and it's impact is minimal. @ashb - any commments here?

Small change requested for debug log.

@argibbs argibbs requested a review from potiuk July 21, 2022 09:15
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love others to take a look @ashb @uranusjr @mhenc ?

@argibbs
Copy link
Contributor Author

argibbs commented Jul 21, 2022

NW, I'll wait for a second opinion.

(I need to tickle the build checks anyway to get the tests green). Possibly a n00b question, but there's no way to re-run a failed check other than force-pushing the last commit again?

@potiuk
Copy link
Member

potiuk commented Jul 21, 2022

NW, I'll wait for a second opinion.

(I need to tickle the build checks anyway to get the tests green). Possibly a n00b question, but there's no way to re-run a failed check other than force-pushing the last commit again?

Do you see this https://github.blog/2022-03-16-save-time-partial-re-runs-github-actions/ - or is it maintainer-only feature (not sure)

@potiuk
Copy link
Member

potiuk commented Jul 21, 2022

(BTW. This option appears only after the whole workflow completes)

@argibbs
Copy link
Contributor Author

argibbs commented Jul 21, 2022

Ah ty.

I didn't see the drop down menu or retry button and I thought I looked fairly thoroughly yesterday (builds failed on my first push too) ... but possible I checked before it was all done, or I missed it (the retry button in particular looks to maybe require you mouse over the right spot to find it). I will see how it goes and check if it fails again; I know what I'm looking for now.

@potiuk
Copy link
Member

potiuk commented Jul 21, 2022

Cool. Let me know - we were so happy that it appeared in March, that we did not pay attention if it is only for maintainers or not.

BTW. There is another option from the UI - you can close/reopen PR, but this will re-trigger the whole build again and there are cases it won't work - for example when image build failed.

@argibbs
Copy link
Contributor Author

argibbs commented Jul 21, 2022

Yeah, whatever permissions you need to trigger a re-run, I don't have them. No retry icon on mouse over in the logs view, and no re-run dropdown either. Not a biggie, I'm content to keep re-pushing the latest commit, this is more of an FYI.
image

@potiuk
Copy link
Member

potiuk commented Jul 21, 2022

Yeah, whatever permissions you need to trigger a re-run, I don't have them. No retry icon on mouse over in the logs view, and no re-run dropdown either. Not a biggie, I'm content to keep re-pushing the latest commit, this is more of an FYI. image

:(

@potiuk
Copy link
Member

potiuk commented Jul 21, 2022

Merging now - unless someone will have objections - it's simple enough change to be reverted/improved on in case.

@potiuk potiuk merged commit 17ec6db into apache:main Jul 21, 2022
@eladkal eladkal added this to the Airflow 2.3.4 milestone Jul 22, 2022
@argibbs
Copy link
Contributor Author

argibbs commented Jul 22, 2022

Legendary, ty!

@argibbs
Copy link
Contributor Author

argibbs commented Jul 22, 2022

Good news / bad news: this fix works as intended, however, it simply clears the way for another bug to manifest.

Just when I thought I was out, they drag me back in.

I'm working on a fix locally (again), will raise an MR when I've tested, etc. etc.

@potiuk
Copy link
Member

potiuk commented Jul 22, 2022

Good news / bad news: this fix works as intended, however, it simply clears the way for another bug to manifest.

Just when I thought I was out, they drag me back in.

I'm working on a fix locally (again), will raise an MR when I've tested, etc. etc.

Looking forward to it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants