Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition causing celerybeat and celeryworker to fail #2727

Closed
malefice opened this issue Aug 6, 2020 · 8 comments
Closed

Race condition causing celerybeat and celeryworker to fail #2727

malefice opened this issue Aug 6, 2020 · 8 comments
Labels
bug docker wontfix Wait a 10 days days and automatically close

Comments

@malefice
Copy link

malefice commented Aug 6, 2020

What happened?

There is a race condition where celerybeat and/or celeryworker services will fail to start if the django-celery-beat migrations have not run. Bringing up the stack up again once those migrations have run will resume normally as expected.

What should've happened instead?

While this can only occur under specific conditions (usually first run), it is best to have a guarantee that all services will be up and running 100% out-of-the-box. The celerybeat and/or celeryworker services should have a health check in its entrypoint script prior to running its command script. I personally have some code for this if you guys are interested, but I want to hear your thoughts first.

Steps to reproduce

Tested on Ubuntu 18.04.4 LTS, Docker Engine version 19.03.12, docker-compose version 1.17.1

  1. Create a new project from this template as you normally would but with docker and celery support.
  2. Build and bring the whole stack up. The race condition usually shows up for me on the first run.
  3. If it did not trigger the first time, then drop the database volume, and bring up the stack. You may need to repeat this step multiple times.

Some screenshots

image

image

@browniebroke
Copy link
Member

Thanks for the great explanations in this bug report. It would be nice to fix it, yes. You're saying you have some code to fix it, I'd be interested in knowing more about your solution?

@malefice
Copy link
Author

malefice commented Aug 6, 2020

I am currently using manage.py showmigrations app_name to do this. It lists down all the migrations for the app, and a migration file has not yet been applied if [ ] is present like so:

image

To ensure that all the migrations for django_celery_beat have run properly, we can grep the output of the command. This was how I did it.

django_celery_migrations_complete() {
  count=$(python manage.py showmigrations django_celery_beat | grep '\[ \]' | wc -l)
  if [[ $count -eq 0 ]]; then
      return 0
  fi
    return 1
}

until django_celery_migrations_complete; do
  >&2 echo 'Waiting for Django Celery migrations to complete...'
  sleep 5
done
>&2 echo 'Django Celery migrations have been applied'

The only thing I cannot avoid is a little code duplication. Since this requires database access, the excerpt above must be preceded by the same postgres health check present in the django service's entrypoint script. I am not sure how to keep things DRY, so I implemented it such that I have a separate entrypoint script for the django service, and another entrypoint script for the celery services like so:

# Inside docker-compose file:

  celerybeat:
    <<: *django
    image: celerybeat
    depends_on:
      - redis
      - postgres
    ports: []
    entrypoint: /celery-entrypoint
    command: /start-celerybeat

  celeryworker:
    <<: *django
    image: celeryworker
    depends_on:
      - redis
      - postgres
    ports: []
    entrypoint: /celery-entrypoint
    command: /start-celeryworker

@browniebroke
Copy link
Member

Just to clarify one small thing, are you having this issue locally on in production?

As a reminder/notes of things work locally:

I was initially thinking to make celery services depend on django in the docker-compose file. Not sure of the drawbacks of such solution though.

Another idea I've read about was to declare another service just for migrations and make all others depend on it.

The depends_on from docker-compose might not be enough though. I think that might start the celery containers too soon still, in which case we'll need to run something inside the start Celery scripts to wait for migrations, a bit like you do.

One other drawback I foresee with the solution is that it depends on the output of the presentational showmigrations command, which isn't ideal. If the Django project decides to reformat the output, this will break.

I've had some issues in the past when trying to parse that output, it was fine for my toy project, but not something I'm too keen to add to a popular open source project.

These are just my thoughts from a quick look, let me know if I've missed anything. Other ideas are welcome of course.

@malefice
Copy link
Author

malefice commented Aug 7, 2020

Just to clarify one small thing, are you having this issue locally on in production?

I am running this locally, and I am aware that migrations will be run automatically on the django container for local development.

Another idea I've read about was to declare another service just for migrations and make all others depend on it.

The depends_on from docker-compose might not be enough though. I think that might start the celery containers too soon still, in which case we'll need to run something inside the start Celery scripts to wait for migrations, a bit like you do.

I have tried using depends_on, and it does not work, because according to the docker docs, depends_on controls only the sequence in which the services are started by docker-compose, and it does not care about the results. On that note, delegating a separate container to handle migrations upon which the other django-based containers will depend will also not work. I tried that approach anyway just to gather these screenshots:

image
image

In the screenshots, django-bootstrap is the container responsible for running migrations, and the other django-ish containers "depends" on it in the compose file. You can clearly see that the start scripts do not run in the order one would expect. Only the "initial starting" of the containers will be in dependency order.

One other drawback I foresee with the solution is that it depends on the output of the presentational showmigrations command, which isn't ideal. If the Django project decides to reformat the output, this will break.

The project already pins a Django version, so in my opinion, that becomes less of an issue. An alternative is to issue raw SQL queries to verify that all the migrations for django-celery-beat have run. That is much more complicated VS fixing a much simpler parsing issue that may occur only when you change the pinned Django version, and based on this, it does not happen often, if at all at least for showmigrations.

It is also an alternative to just not fix this race condition at all. Just document it as such, and tell users to first build, then run docker-compose -f local.yml run --rm django python manage.py migrate, then bring the services up as part of first time setup.

@Andrew-Chen-Wang
Copy link
Contributor

@malefice That's pretty much the reason for the Travis implementation for Docker. This initial migration issue that I wanted to fix (obviously failed) could avoid this condition if celerybeat had its own entrypoint file that we can specify in the compose file.

@xjlin0
Copy link

xjlin0 commented Jan 16, 2022

if using docker-compose 1.29 or later, will the following condition work?

  depends_on: service_completed_successfully

Also, we can always run migration before starting the server.

@browniebroke
Copy link
Member

Someone tried to fix that issue in #4459 and after taking further look at this, I'm not convinced it's a bug worth fixing. The error will happen once, the first time you start you server, and will pretty much go away for ever. The error being returned is quite clear, and fixing it properly would involve hiding the problem.

I'm marking it as won't fix, unless someone comes up with a simple and foolproof solution...

@browniebroke browniebroke added solved Wait a bit and close the issue if no response assuming it was solved. and removed solved Wait a bit and close the issue if no response assuming it was solved. labels Jan 20, 2024
@github-actions github-actions bot removed the wontfix Wait a 10 days days and automatically close label Jan 25, 2024
@browniebroke browniebroke added the wontfix Wait a 10 days days and automatically close label May 13, 2024
Copy link
Contributor

As discussed, we won't be implementing this. Automatically closing.

@browniebroke browniebroke closed this as not planned Won't fix, can't repro, duplicate, stale May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug docker wontfix Wait a 10 days days and automatically close
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants