The redis lock implementation could use some more thought #47

coredumperror · 2026-05-14T21:15:43Z

coredumperror
May 14, 2026

Right now, it seems like the lock system works by having the crontask management command attempt to acquire a lock against the specified redis server when it starts, and if it can't do that, it kills itself.

This works fine in a very stable multi-server environment, but it breaks down in a system that's designed to be fluid and respond to instance failures and excess traffic by scaling up and scaling down the size of the server cluster.

We ran into a problem last night with my team's AWS Elastic Container Service-based application, where of the four instances that we run, the instance that happened to acquire the crontask lock died. It seems that, after a new instance was spun back up by the autoscaling group to replace it, it couldn't get the lock, since the management command in the dead instance had not been cleanly killed.

This meant that all four of the otherwise healthy instances had a dead crontask scheduler, and as a result we were getting repeated cron monitor failure emails from Sentry with no immediately obvious cause.

For this reason, I believe that changing how the lock system functions would be a good idea.

In my team's apps that still use Celery, we use a decorator like this to ensure that only a single one of our multi-instance applications run each scheduled task:

import contextlib
from functools import wraps
from django.conf import settings
from django_rq.connection_utils import get_connection
from redis.exceptions import LockError

def with_lock(f):
    """
    Acquire a distributed lock from redis before running the cron task. If we
    can't acquire the lock, log that instead of running the task.
    """
    @wraps(f)
    def wrapper(*args, **kwargs):
        lock_name = kwargs.pop("lock_name", settings.SERVER_DOMAIN) + f.__name__
        timeout = kwargs.pop("timeout", 60 * 5)
        have_lock = False

        redis_client = get_connection("default")
        lock = redis_client.lock(lock_name, timeout=timeout)

        try:
            have_lock = lock.acquire(blocking=False)
            if have_lock:
                f(*args, **kwargs)
            else:
                logger.info("cron.task.already_locked", task=f.__name__)
        finally:
            if have_lock:
                # Sometimes we get LockErrors about this lock not being owned, which we don't care about.
                with contextlib.suppress(LockError):
                    lock.release()
    return wrapper

I've been trying this out for our crontask-scheduled tasks, and it seems to work well:

# Run this task every 5 minutes.
@cron(CronTrigger(minute="*/5"))
@task
@with_lock
def publish_scheduled_pages():
    logger.info("publish_scheduled_pages.start")
   ...

One instance runs the decorated task, and the others log cron.task.already_locked, rather than calling the task function.

This ensures that even if one of the instances which are running the application code fails, that won't prevent the other instances from taking up the slack, as they are always competing against each other to figure out who gets to run each cron task every time.

codingjoe · 2026-05-15T09:36:22Z

codingjoe
May 15, 2026
Maintainer

Hi @coredumperror,

Thanks for your detailed report. I am struggling a bit to keep up. Maybe there's a misunderstanding about the current implementaiton.

The lock is not used to lock tasks but to ensure there is only a single scheduler running at every given moment. If you want to lock tasks, you can do so in your application. This isn't really a feature I aim to provide in this package.

We currently use a dead man's snitch approach. The scheduler needs to be alive and kicking to maintain the lock. It needs to constantly extend the lock. If it ever fails to do so, the lock is released after a small timeout period (LOCK_REFRESH_INTERVAL), and another instance could acquire it.

In your case, the scheduler must have still been running; otherwise, Redis would have released the lock.

Let me know if you have questions or other suggestions.

Cheers!
Joe

6 replies

codingjoe May 19, 2026
Maintainer

Hm… I see. That doesn't look healthy; I understand your frustration. You mention ECS; would you mind sharing your K8S config for the scheduler?

coredumperror May 19, 2026
Author

We don't actually use Kubernetes, so I haven't got such a config to share. What did you want to learn from it? I might be able to share whatever equivalent we use, though I'm not the team's AWS expert.

codingjoe May 26, 2026
Maintainer

@coredumperror I think someone located the issue: #53

coredumperror May 26, 2026
Author

Oh interesting. So the intended behavior is not for the scheduler to kill itself permanently if it can't acquire the lock at container boot time? I'll have to give v1.3.0 a try, to see if it'll work for our needs without some of the customization I've done in our code to get around that problem.

codingjoe May 27, 2026
Maintainer

The scheduler didn't shut down and stayed around as a zombie. Let me know how it goes :)

Uh oh!

The redis lock implementation could use some more thought #47

Uh oh!

Uh oh!

coredumperror May 14, 2026

Replies: 1 comment · 6 replies

Uh oh!

Uh oh!

codingjoe May 15, 2026 Maintainer

Uh oh!

codingjoe May 19, 2026 Maintainer

Uh oh!

coredumperror May 19, 2026 Author

Uh oh!

Uh oh!

codingjoe May 26, 2026 Maintainer

Uh oh!

coredumperror May 26, 2026 Author

Uh oh!

codingjoe May 27, 2026 Maintainer

coredumperror
May 14, 2026

Replies: 1 comment 6 replies

codingjoe
May 15, 2026
Maintainer

codingjoe May 19, 2026
Maintainer

coredumperror May 19, 2026
Author

codingjoe May 26, 2026
Maintainer

coredumperror May 26, 2026
Author

codingjoe May 27, 2026
Maintainer