Skip to content

Conversation

@repl-chris
Copy link
Contributor

Celery can lose tasks on worker shutdown, causing airflow to just wait on them indefinitely (may be related to celery/celery#7266). This PR detects these "hung" tasks and sets them back to SCHEDULED state so the scheduler can queue them up again.

Closes: #19699

This is basically a resurrection of PR19769 which was reverted because no one could reproduce/test it. We can reproduce the problem reliably in an isolated test environment, so we were able to test this fairly thoroughly.

cc: @ephraimbuddy @kristoffern @kaxil


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragement file, named {pr_number}.significant.rst, in newsfragments.

@repl-chris repl-chris requested review from XD-DENG, ashb and kaxil as code owners May 2, 2022 22:07
@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label May 2, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented May 2, 2022

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@eladkal eladkal added this to the Airflow 2.3.1 milestone May 3, 2022
@repl-chris
Copy link
Contributor Author

FYI @ephraimbuddy @tanelk @eladkal @kaxil @XD-DENG @ashb this is ready for re-review

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue we had before was load on the scheduler. Is the UI complaining about the scheduler? Also, how often do you get the AirflowTaskTimeout error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the UI isn't complaining at all. As for the timeout I've never seen it - I only put that code in because you had it in the original PR here 32d7060 ....but I guess I did change the query to probably be a decent amount faster than the previous implementation, so I really doubt it's necessary anymore

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few code changes, and this needs documenting -- since with this change if you have too many items in the celery queue then they will get cleared, even if they aren't lost.

For that reason I think I'd like you to look at If we can see if the taskmeta is actually lost, to separate "lost" from "just in a queue behind n long running tasks."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a new setting, or is the existing "orphan" settings enough here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the orphan/adoption stuff doesn't really have an appropriate setting to use instead afaik - the "adopted task timeout" runs on every heartbeat, which is fine because it's very cheap unless a task has actually timed out....in contrast, this check will (minimally) issue a database query every time it runs, which we probably don't want to do on every heartbeat.

Alternatively, instead of issuing the query to find these tasks we could internally track "sent to celery time" and "last known state" by task....then we wouldn't need to issue the query and we could run on every heartbeat instead...I kinda like this option, but if we're talking about separating "lost" from "just in a long queue" then it starts to get sketchy (see separate comment below)

In theory this feature also largely makes _check_for_stalled_adopted_tasks() unnecessary, as "stalled adopted tasks" and "lost tasks" are effectively the same thing. I didn't remove _check_for_stalled_adopted_tasks() though as I figured it added unnecessary risk, and I was trying to stay true to the design of the original PR by ephraimbuddy. A stalled adopted task, I guess, also has a slight behavioural difference - it fails the task rather than re-scheduling it (which I'm not entirely convinced is really correct)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong timer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually is using this task_adoption_timeout setting for the "max queued time"....it only has its own setting for "run frequency"

@repl-chris repl-chris force-pushed the RescheduleStuckCeleryTasks branch from 1743e4a to 29f7644 Compare May 6, 2022 20:23
@repl-chris
Copy link
Contributor Author

For that reason I think I'd like you to look at If we can see if the taskmeta is actually lost, to separate "lost" from "just in a queue behind n long running tasks."

@ashb Yes, this would be great, I agree. I'm far from a celery expert but as far as I can tell celery does not provide any capability to do this. Any implementation would need to be broker-specific, and directly access celery internals, reading pending work items from the queue directly. This is not possible on SQS or rabbit afaik...but I've only witnessed this problem on redis, so maybe the fix only needs to work on redis. Even on redis the solution isn't pretty though...

The redis queue in celery is implemented using a redis "list" data type, which, internally is a linked-list. So, there's no great way to check if it contains a given task - we need to scan the entire list O(n) to check if a single task is in there. Redis does provide an LPOS command which could do this scan for us, but it's still O(n), and it's only in newer versions of redis (>=6.0.6). The list contains serialized JSON documents for the work items, so to use LPOS we'd need to synthetically generate a character-perfect copy of that JSON document, which uses an undocumented format. I think in theory it may be possible to use LUA scripting to do this scan...I'm extremely open to other ideas, but I think the only way we could realistically make this work decently is to grab our oldest known "PENDING" task_id, send a LUA script to redis which scans the list doing a string.find(list_element, '"task_id"="our-task-id"') sorta thing.....and then hope celery doesn't change how it works 😛 This whole direction sounds like a pretty bad idea IMHO...but I could go in this direction if we actually want to go there...

Is rescheduling a "just-in-a-long-queue" task really that bad? Maybe I don't fully grasp the impacts of it, but revoking and re-enqueuing a task which was healthy and waiting seems like a pretty good option compared to the alternative 😄

@repl-chris
Copy link
Contributor Author

@ashb As another option, I could stop using the task_adoption_timeout and instead add an explicit stuck_queued_task_timeout setting, and have it disabled by default. Then there isn't a behavioural change unless the user explicitly enables that setting....

@dima-asana
Copy link
Contributor

First, thank you so much for working on this. We’ve run into a very similar issue and are eagerly looking forward to a fix. We also now have an isolated environment where we can reproduce this, so I’d be happy to test some version of this patch if you’d like (may have turnarounds of a week or so, sorry in advance). I went ahead and tested your current code with our setup and it works perfectly for us - the stuck task is revoked in Celery so there is no chance of it running twice and sent back to the scheduler.

I'm far from a celery expert but as far as I can tell celery does not provide any capability to do this

I’m also not a celery expert, but I believe tasks lost due to a worker shutdown like celery/celery#7266 have already gone from the broker queue to the celery worker consumer. This means that you would be able to find the tasks on celery rather than on the broker. You might try to look at app.control.inspect().reserved for tasks that got reserved but didn’t make it to a worker process. Unfortunately I don’t think it’s possible to differentiate these tasks with “legitimately reserved” tasks. That said, the default airflow configuration has a very short lifespan for legitimately reserved tasks – with prefetch_multiplier as 1, acks_late as True, and the fair scheduling strategy, tasks are only reserved for workers that can “immediately” consume them (in practice for us tasks are legitimately reserved for ~0.01s)

We experience this behavior due to a different bug in Celery. You can see a longer description of our experience here.

"adopted task timeout" runs on every heartbeat

I’m either misunderstanding this or it’s not accurate. At least, there is a orphaned_tasks_check_interval and in our experience it gets used. It could potentially be a simplification as you mentioned to unify the orphaned TI behavior - e.g. let it pick up both TI’s orphaned by a dead scheduler (e.g. primarily tasks in SCHEDULED state) and TI’s orphaned by a malfunctioning celery consumer (e.g. primarily tasks in QUEUED state).

As a sidenote, for tasks lost by shut down workers like celery/celery#7266, you might give this config in celery a shot.

@repl-chris
Copy link
Contributor Author

I’m either misunderstanding this or it’s not accurate. At least, there is a orphaned_tasks_check_interval and in our experience it gets used.

Oooh yes, you're absolutely correct. I had the blinders on and was only looking at the "adopted task timeout" functionality of the celery executor - I totally forgot about the actual orphaned task adoption functionality of the scheduler 😂

You might try to look at app.control.inspect().reserved for tasks that got reserved but didn’t make it to a worker process.

Thanks for the tip...I'll look into this a little further but I'm pretty sure they won't be there. My understanding is that reserved tasks will eventually get re-delivered to a different worker after 6 hours, but I've had tasks stuck for far longer than that...and I did also verify that the stuck tasks did not exist in the unacked redis list (which I'd expect is the data source for inspect().reserved). Never-the-less I will dig into this a little further, thanks! :)

@repl-chris
Copy link
Contributor Author

FYI I have re-implemented this feature as an expansion of the existing "adopted task timeout" capability in #23690. I believe the new implementation is superior as it is cleaner (IMHO), lighter-weight, and introduces minimal behavioural change unless it's explicitly enabled in the config.

@potiuk potiuk force-pushed the RescheduleStuckCeleryTasks branch from e5a587e to fec8bb1 Compare May 17, 2022 15:48
@potiuk potiuk requested a review from ashb May 17, 2022 15:49
@ashb
Copy link
Member

ashb commented May 17, 2022

Closing in favour of #23690

@ashb ashb closed this May 17, 2022
@ephraimbuddy ephraimbuddy removed this from the Airflow 2.3.1 milestone May 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

task_instances stuck in "queued" and are missing corresponding celery_taskmeta entries

6 participants