Automatically reschedule stuck queued tasks in CeleryExecutor #23432

repl-chris · 2022-05-02T22:07:36Z

Celery can lose tasks on worker shutdown, causing airflow to just wait on them indefinitely (may be related to celery/celery#7266). This PR detects these "hung" tasks and sets them back to SCHEDULED state so the scheduler can queue them up again.

Closes: #19699

This is basically a resurrection of PR19769 which was reverted because no one could reproduce/test it. We can reproduce the problem reliably in an isolated test environment, so we were able to test this fairly thoroughly.

cc: @ephraimbuddy @kristoffern @kaxil

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragement file, named {pr_number}.significant.rst, in newsfragments.

boring-cyborg · 2022-05-02T22:07:39Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://s.apache.org/airflow-slack

airflow/executors/celery_executor.py

airflow/config_templates/config.yml

repl-chris · 2022-05-05T21:42:12Z

FYI @ephraimbuddy @tanelk @eladkal @kaxil @XD-DENG @ashb this is ready for re-review

ephraimbuddy · 2022-05-06T13:04:09Z

airflow/executors/celery_executor.py

One issue we had before was load on the scheduler. Is the UI complaining about the scheduler? Also, how often do you get the AirflowTaskTimeout error?

No, the UI isn't complaining at all. As for the timeout I've never seen it - I only put that code in because you had it in the original PR here 32d7060 ....but I guess I did change the query to probably be a decent amount faster than the previous implementation, so I really doubt it's necessary anymore

ashb

Few code changes, and this needs documenting -- since with this change if you have too many items in the celery queue then they will get cleared, even if they aren't lost.

For that reason I think I'd like you to look at If we can see if the taskmeta is actually lost, to separate "lost" from "just in a queue behind n long running tasks."

ashb · 2022-05-06T16:09:24Z

airflow/config_templates/config.yml

Does this need a new setting, or is the existing "orphan" settings enough here?

the orphan/adoption stuff doesn't really have an appropriate setting to use instead afaik - the "adopted task timeout" runs on every heartbeat, which is fine because it's very cheap unless a task has actually timed out....in contrast, this check will (minimally) issue a database query every time it runs, which we probably don't want to do on every heartbeat.

Alternatively, instead of issuing the query to find these tasks we could internally track "sent to celery time" and "last known state" by task....then we wouldn't need to issue the query and we could run on every heartbeat instead...I kinda like this option, but if we're talking about separating "lost" from "just in a long queue" then it starts to get sketchy (see separate comment below)

In theory this feature also largely makes _check_for_stalled_adopted_tasks() unnecessary, as "stalled adopted tasks" and "lost tasks" are effectively the same thing. I didn't remove _check_for_stalled_adopted_tasks() though as I figured it added unnecessary risk, and I was trying to stay true to the design of the original PR by ephraimbuddy. A stalled adopted task, I guess, also has a slight behavioural difference - it fails the task rather than re-scheduling it (which I'm not entirely convinced is really correct)

airflow/executors/celery_executor.py

ashb · 2022-05-06T16:12:58Z

airflow/executors/celery_executor.py

Wrong timer.

It actually is using this task_adoption_timeout setting for the "max queued time"....it only has its own setting for "run frequency"

repl-chris · 2022-05-06T23:54:15Z

For that reason I think I'd like you to look at If we can see if the taskmeta is actually lost, to separate "lost" from "just in a queue behind n long running tasks."

@ashb Yes, this would be great, I agree. I'm far from a celery expert but as far as I can tell celery does not provide any capability to do this. Any implementation would need to be broker-specific, and directly access celery internals, reading pending work items from the queue directly. This is not possible on SQS or rabbit afaik...but I've only witnessed this problem on redis, so maybe the fix only needs to work on redis. Even on redis the solution isn't pretty though...

The redis queue in celery is implemented using a redis "list" data type, which, internally is a linked-list. So, there's no great way to check if it contains a given task - we need to scan the entire list O(n) to check if a single task is in there. Redis does provide an LPOS command which could do this scan for us, but it's still O(n), and it's only in newer versions of redis (>=6.0.6). The list contains serialized JSON documents for the work items, so to use LPOS we'd need to synthetically generate a character-perfect copy of that JSON document, which uses an undocumented format. I think in theory it may be possible to use LUA scripting to do this scan...I'm extremely open to other ideas, but I think the only way we could realistically make this work decently is to grab our oldest known "PENDING" task_id, send a LUA script to redis which scans the list doing a string.find(list_element, '"task_id"="our-task-id"') sorta thing.....and then hope celery doesn't change how it works 😛 This whole direction sounds like a pretty bad idea IMHO...but I could go in this direction if we actually want to go there...

Is rescheduling a "just-in-a-long-queue" task really that bad? Maybe I don't fully grasp the impacts of it, but revoking and re-enqueuing a task which was healthy and waiting seems like a pretty good option compared to the alternative 😄

repl-chris · 2022-05-09T14:47:33Z

@ashb As another option, I could stop using the task_adoption_timeout and instead add an explicit stuck_queued_task_timeout setting, and have it disabled by default. Then there isn't a behavioural change unless the user explicitly enables that setting....

dima-asana · 2022-05-11T01:32:18Z

First, thank you so much for working on this. We’ve run into a very similar issue and are eagerly looking forward to a fix. We also now have an isolated environment where we can reproduce this, so I’d be happy to test some version of this patch if you’d like (may have turnarounds of a week or so, sorry in advance). I went ahead and tested your current code with our setup and it works perfectly for us - the stuck task is revoked in Celery so there is no chance of it running twice and sent back to the scheduler.

I'm far from a celery expert but as far as I can tell celery does not provide any capability to do this

I’m also not a celery expert, but I believe tasks lost due to a worker shutdown like celery/celery#7266 have already gone from the broker queue to the celery worker consumer. This means that you would be able to find the tasks on celery rather than on the broker. You might try to look at app.control.inspect().reserved for tasks that got reserved but didn’t make it to a worker process. Unfortunately I don’t think it’s possible to differentiate these tasks with “legitimately reserved” tasks. That said, the default airflow configuration has a very short lifespan for legitimately reserved tasks – with prefetch_multiplier as 1, acks_late as True, and the fair scheduling strategy, tasks are only reserved for workers that can “immediately” consume them (in practice for us tasks are legitimately reserved for ~0.01s)

We experience this behavior due to a different bug in Celery. You can see a longer description of our experience here.

"adopted task timeout" runs on every heartbeat

I’m either misunderstanding this or it’s not accurate. At least, there is a orphaned_tasks_check_interval and in our experience it gets used. It could potentially be a simplification as you mentioned to unify the orphaned TI behavior - e.g. let it pick up both TI’s orphaned by a dead scheduler (e.g. primarily tasks in SCHEDULED state) and TI’s orphaned by a malfunctioning celery consumer (e.g. primarily tasks in QUEUED state).

As a sidenote, for tasks lost by shut down workers like celery/celery#7266, you might give this config in celery a shot.

repl-chris · 2022-05-11T03:44:53Z

I’m either misunderstanding this or it’s not accurate. At least, there is a orphaned_tasks_check_interval and in our experience it gets used.

Oooh yes, you're absolutely correct. I had the blinders on and was only looking at the "adopted task timeout" functionality of the celery executor - I totally forgot about the actual orphaned task adoption functionality of the scheduler 😂

You might try to look at app.control.inspect().reserved for tasks that got reserved but didn’t make it to a worker process.

Thanks for the tip...I'll look into this a little further but I'm pretty sure they won't be there. My understanding is that reserved tasks will eventually get re-delivered to a different worker after 6 hours, but I've had tasks stuck for far longer than that...and I did also verify that the stuck tasks did not exist in the unacked redis list (which I'd expect is the data source for inspect().reserved). Never-the-less I will dig into this a little further, thanks! :)

repl-chris · 2022-05-12T22:55:04Z

FYI I have re-implemented this feature as an expansion of the existing "adopted task timeout" capability in #23690. I believe the new implementation is superior as it is cleaner (IMHO), lighter-weight, and introduces minimal behavioural change unless it's explicitly enabled in the config.

ashb · 2022-05-17T15:51:38Z

Closing in favour of #23690

repl-chris requested review from XD-DENG, ashb and kaxil as code owners May 2, 2022 22:07

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label May 2, 2022

tanelk reviewed May 3, 2022

View reviewed changes

airflow/executors/celery_executor.py Outdated Show resolved Hide resolved

tanelk reviewed May 3, 2022

View reviewed changes

airflow/executors/celery_executor.py Outdated Show resolved Hide resolved

ephraimbuddy reviewed May 3, 2022

View reviewed changes

airflow/config_templates/config.yml Outdated Show resolved Hide resolved

eladkal added this to the Airflow 2.3.1 milestone May 3, 2022

ephraimbuddy reviewed May 6, 2022

View reviewed changes

ashb requested changes May 6, 2022

View reviewed changes

repl-chris force-pushed the RescheduleStuckCeleryTasks branch from 1743e4a to 29f7644 Compare May 6, 2022 20:23

repl-chris mentioned this pull request May 12, 2022

Automatically reschedule stalled queued tasks in CeleryExecutor (v2) #23690

Merged

repl-chris added 2 commits May 17, 2022 17:48

Automatically reschedule stuck queued tasks in CeleryExecutor

13845fc

Clean up naming, abort clearing if times out loading tasks

fec8bb1

potiuk force-pushed the RescheduleStuckCeleryTasks branch from e5a587e to fec8bb1 Compare May 17, 2022 15:48

potiuk requested a review from ashb May 17, 2022 15:49

ashb closed this May 17, 2022

ephraimbuddy removed this from the Airflow 2.3.1 milestone May 19, 2022

Automatically reschedule stuck queued tasks in CeleryExecutor #23432

Automatically reschedule stuck queued tasks in CeleryExecutor #23432

Uh oh!

Conversation

repl-chris commented May 2, 2022

Uh oh!

boring-cyborg bot commented May 2, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

repl-chris commented May 5, 2022

Uh oh!

ephraimbuddy May 6, 2022

Choose a reason for hiding this comment

Uh oh!

repl-chris May 6, 2022

Choose a reason for hiding this comment

Uh oh!

ashb left a comment

Choose a reason for hiding this comment

Uh oh!

ashb May 6, 2022

Choose a reason for hiding this comment

Uh oh!

repl-chris May 6, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashb May 6, 2022

Choose a reason for hiding this comment

Uh oh!

repl-chris May 6, 2022

Choose a reason for hiding this comment

Uh oh!

repl-chris commented May 6, 2022

Uh oh!

repl-chris commented May 9, 2022

Uh oh!

dima-asana commented May 11, 2022

Uh oh!

repl-chris commented May 11, 2022

Uh oh!

repl-chris commented May 12, 2022

Uh oh!

ashb commented May 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants