The redis lock implementation could use some more thought #47
Replies: 1 comment 6 replies
-
|
Hi @coredumperror, Thanks for your detailed report. I am struggling a bit to keep up. Maybe there's a misunderstanding about the current implementaiton. The lock is not used to lock tasks but to ensure there is only a single scheduler running at every given moment. If you want to lock tasks, you can do so in your application. This isn't really a feature I aim to provide in this package. We currently use a dead man's snitch approach. The scheduler needs to be alive and kicking to maintain the lock. It needs to constantly extend the lock. If it ever fails to do so, the lock is released after a small timeout period ( In your case, the scheduler must have still been running; otherwise, Redis would have released the lock. Let me know if you have questions or other suggestions. Cheers! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Right now, it seems like the lock system works by having the
crontaskmanagement command attempt to acquire a lock against the specified redis server when it starts, and if it can't do that, it kills itself.This works fine in a very stable multi-server environment, but it breaks down in a system that's designed to be fluid and respond to instance failures and excess traffic by scaling up and scaling down the size of the server cluster.
We ran into a problem last night with my team's AWS Elastic Container Service-based application, where of the four instances that we run, the instance that happened to acquire the crontask lock died. It seems that, after a new instance was spun back up by the autoscaling group to replace it, it couldn't get the lock, since the management command in the dead instance had not been cleanly killed.
This meant that all four of the otherwise healthy instances had a dead crontask scheduler, and as a result we were getting repeated cron monitor failure emails from Sentry with no immediately obvious cause.
For this reason, I believe that changing how the lock system functions would be a good idea.
In my team's apps that still use Celery, we use a decorator like this to ensure that only a single one of our multi-instance applications run each scheduled task:
I've been trying this out for our crontask-scheduled tasks, and it seems to work well:
One instance runs the decorated task, and the others log
cron.task.already_locked, rather than calling the task function.This ensures that even if one of the instances which are running the application code fails, that won't prevent the other instances from taking up the slack, as they are always competing against each other to figure out who gets to run each cron task every time.
Beta Was this translation helpful? Give feedback.
All reactions