New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--autoscale causes billard to raise WorkerLostError #2682

Closed
nihn opened this Issue Jun 30, 2015 · 11 comments

Comments

Projects
None yet
6 participants
@nihn
Copy link

nihn commented Jun 30, 2015

Hello,

I was investigating strange error which we have for one of our tasks. We received regular errors about billard losing workers:

Task (...)[5832c7bd-fe9d-45cd-865e-b6beb4c09e47] raised unexpected: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)

I was searching for some other processes which could killed our workers but system logs was empty and when I was going to give up we turned off autoscale due to some other issues and these errors dissapered completely.

Celery 3.1.18
Django 1.7.8
librabbitmq 1.6.1

@thedrow

This comment has been minimized.

Copy link
Member

thedrow commented Jul 5, 2015

If you're using Autoscaling and -Ofair we found a bug that causes workers to be lost.
Other than that I'll need more info like which parameters are you using to run the worker, what kind of tasks are you running (long running, short running, might crash due to a C extension or pure python) and how does your setting look like?
Can you also run Celery in debug mode so we'll be able to see what's going on exactly?

@nihn

This comment has been minimized.

Copy link
Author

nihn commented Jul 6, 2015

I can't pass you all the info because we use it on production servers and only under huge load (100-200 tasks/s) we have this events. Worker had these parameters (from supervisor conf):

[program:celery_push]
command=(...)/bin/celery
    -A sync worker
    -Q invalidate_entity
    -n celery_push@%%h
    --autoscale=50,2
    -Ofair
user=celery
autostart=true
autorestart=true
startsecs=10
stopwaitsecs=600
killasgroup=true

celery settings:

BROKER_URL = 'amqp://'
CELERY_ACCEPT_CONTENT = ['pickle']
CELERY_IGNORE_RESULT = True
CELERYD_HIJACK_ROOT_LOGGER = False

task settings:

bind = True
max_retries = 5
default_retry_delay = 60

Task makes two queries to Cassandra database and then push data to other service. Task takes ~7-9ms.

EDIT:
Ahh, i didn't read carefuly, yes we used -Ofair with --autoscale.

@thedrow

This comment has been minimized.

Copy link
Member

thedrow commented Jul 6, 2015

I'm wondering if this is a duplicate of #2480.
Can you share the traceback so we can compare?
Also, try running without either autoscaling or -Ofair.

@nailgun

This comment has been minimized.

Copy link

nailgun commented Jul 14, 2015

I have same issue after enabling autoscaling on production servers. In stacktraces I see only this:

Stacktrace (most recent call last):

  File "billiard/pool.py", line 1171, in mark_as_worker_lost
    human_status(exitcode)),

A message Scaling down X processes. always precedes this exception in a log.

The full log message (got using systemd-journal and graylog2):

  _boot_id
170753b0fbc8406c889fc2eb1052118f
  _cmdline
/easypost/deploy-34/app/bin/python /easypost/deploy-34/app/bin/celery -A mysite worker -Ofair -l debug --autoscale=20,3 --hostname=sensored
  _comm
celery
  _exe
/easypost/deploy-34/app/bin/python
  _gid
1000
  _pid
760
  _systemd_unit
easypost-celery.service
  _transport
journal
  _uid
1000
  code_file
/easypost/deploy-34/app/lib/python2.7/site-packages/celery/worker/autoscale.py
  code_func
_shrink
  code_line
141
  facility
python
  level
Info [6]
  logger
celery.worker.autoscale
  message
Scaling down -1 processes.
  process_name
MainProcess
  source
sensored
  syslog_identifier
python
  thread_id
123896179836736
  thread_name
MainThread
  version
1.1
@nailgun

This comment has been minimized.

Copy link

nailgun commented Jul 14, 2015

And this is duplicate of closed #2587

@thedrow

This comment has been minimized.

Copy link
Member

thedrow commented Jul 14, 2015

@nailgun I marked #2587 as duplicate of this issue. Thanks.

@nihn

This comment has been minimized.

Copy link
Author

nihn commented Dec 3, 2015

@thedrow I disabled -Ofair and it didn't help. It seems that --autoscaling alone causing this.

@ask

This comment has been minimized.

Copy link
Member

ask commented Jun 24, 2016

Closing this, as we don't have the resources to complete this task.

Autoscale will be undocumented in 4.0

@ask ask closed this Jun 24, 2016

@trevoriancox

This comment has been minimized.

Copy link

trevoriancox commented Dec 6, 2016

Autoscale is still documented in 4.0: http://docs.celeryproject.org/en/latest/reference/celery.bin.worker.html#cmdoption-celery-worker--autoscale

With 3.x we have been affected by this issue, but if the documentation has discouraged --autoscale we would have used --concurrency instead without concern.

@thedrow

This comment has been minimized.

Copy link
Member

thedrow commented Dec 8, 2016

@trevoriancox That's a documentation bug.

@jcushman

This comment has been minimized.

Copy link

jcushman commented Dec 7, 2017

I have a minimal reproduction of this error on celery 4.1 in docker, in case there are ever resources available to revisit autoscaling:

https://gist.github.com/jcushman/b9081cf686b0801d481639988c5194fd

The reproduction constantly scales up and down while running tasks that use a lot of CPU (basically sum(1..30000000)). This triggers WorkerLostError for approximately 1 in 1000 tasks on my machine. It seems to specifically depend on the CPU per task -- it stops happening with sum(1..10000000). This makes me wonder if it's some sort of timeout getting hit when billiard asks a process if it's ready to scale down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment