Task loss on retry when using a hybrid/staged Celery 3->4 deployment #4356

freakboy3742 · 2017-11-01T04:42:23Z

If you have a Celery 3.1.25 deployment involving many workers, and you want to upgrade to Celery 4, you may wish to do "canary" testing of a limited subset of workers to validate that the upgrade won't introduce any problems, prior to upgrading your entire worker fleet to Celery4. This "canary" mode involves having both Celery 3.1.25 and Celery 4 workers running at the same time.

However, if you do this, and you have tasks that retry, you experience problems if a task is attempted on a Celery 3.1.25 node, then a Celery 4 node, and then a Celery 3.1.25 node.

When the Celery 3.1.25 task is executed on a Celery 4, the task message is upgraded to Protocol 2. However, the upgrade results in a hybrid message that complies with both formats, and when the task fails and is retried on the Celery 3.1.25 worker, the "hybrid" message is mis-identified as a Protocol 1 message, resulting in a hard crash and message loss.

Checklist

I have included the output of celery -A proj report in the issue.
I have verified that the issue exists against the master branch of Celery.

Steps to reproduce

A full reproduction case can be found in this gist:

https://gist.github.com/ewdurbin/ddf4b0f0c0a4b190251a4a23859dd13c

In local testing, the following two versions were used:

###Celery 3.1.25:

software -> celery:3.1.25 (Cipater) kombu:3.0.37 py:2.7.13
            billiard:3.3.0.23 redis:2.10.6
platform -> system:Darwin arch:64bit imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:disabled

BROKER_URL: 'redis://localhost:6379//'
CELERY_ENABLE_UTC: True
CELERY_RESULT_SERIALIZER: 'json'
CELERY_ACCEPT_CONTENT: ['json']
CELERY_TIMEZONE: 'UTC'
CELERY_TASK_SERIALIZER: 'json'

###Celery 4.1.0:

software -> celery:4.1.0 (latentcall) kombu:4.1.0 py:2.7.13
            billiard:3.5.0.3 redis:2.10.6
platform -> system:Darwin arch:64bit imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:disabled

task_serializer: 'json'
result_serializer: 'json'
CELERY_ENABLE_UTC: True
accept_content: ['json']
enable_utc: True
timezone: 'UTC'
broker_url: u'redis://localhost:6379//'

Although these test results were obtained on a Mac running Sierra, the problem has also been observed in production on AWS EC2 Linux machines.

Expected behavior

A task should be able to move back and forth between a 3.1.25 worker and a 4.1.0 worker without any problems.

Actual behavior

A task can be executed on a Celery 3.1.25 worker , then on a Celery 4.1.0 worker; but when the task is then run on a Celery 3.1.25 worker, the following error is produced:

Traceback (most recent call last):
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/worker/__init__.py", line 206, in start
    self.blueprint.start(self)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/bootsteps.py", line 374, in start
    return self.obj.start()
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 280, in start
    blueprint.start(self)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 884, in start
    c.loop(*c.loop_args())
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/worker/loops.py", line 76, in asynloop
    next(loop)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/kombu/async/hub.py", line 340, in create_loop
    cb(*cbargs)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/kombu/transport/redis.py", line 1019, in on_readable
    self._callbacks[queue](message)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/kombu/transport/virtual/__init__.py", line 535, in _callback
    return callback(message)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/kombu/messaging.py", line 598, in _receive_callback
    return on_m(message) if on_m else self.receive(decoded, message)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/kombu/messaging.py", line 564, in receive
    [callback(body, message) for callback in callbacks]
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 462, in on_task_received
    self.app, type_, body, message, headers)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 483, in proto2_to_proto1
    args, kwargs, embed = body
ValueError: too many values to unpack

This kills the worker, and the task message is lost.

The text was updated successfully, but these errors were encountered:

freakboy3742 · 2017-11-01T04:56:23Z

See #4357 for a fix in the forward compatibility code in Celery 3.1.25; #4358 for a matching fix in Celery 4.1.0

auvipy · 2017-11-01T13:53:02Z

thanks for the report and fixes Russel! could you also provide the tests for your patches?

freakboy3742 · 2017-11-01T23:36:52Z

Not sure how to go about an automated test for these, as on both ends, the test involves running a version of Celery that isn't the version being tested. Any tips/suggestions appreciated.

auvipy · 2017-11-02T02:40:06Z

@thedrow any insights?

wcooley · 2017-12-18T20:24:00Z

NB: This can happen also with a 4.1.0 producer and 3.1.25 consumer, where the consumer calls retry(). It seems that when it returns the message to the queue, it has translated the body to the protocol 1 format but it retains the headers for protocol 2, so it is detected as 2.

I manually applied #4357 to my production system (eek!) and it appears to be working.

thedrow · 2017-12-19T06:15:14Z

Great!

…ersions (#4358) * handle "hybrid" messages which have passed through a protocol 1 and protocol 2 consumer in its life. we detected an edgecase which is proofed out in https://gist.github.com/ewdurbin/ddf4b0f0c0a4b190251a4a23859dd13c#file-readme-md which mishandles messages which have been retried by a 3.1.25, then a 4.1.0, then again by a 3.1.25 consumer. as an extension, this patch handles the "next" iteration of these mutant payloads. * explicitly construct proto2 from "hybrid" messages * remove unused kwarg * fix pydocstyle check * flake8 fixes * correct fix for misread pydocstyle error

The issue referenced was [resolved](celery/celery#6374), but it actually does not apply to the bit of code it's pointing to. Fixes #4649 I tried removing the code referenced by the comment, and it causes the test `tests/contrib/celery/test_integration.py::CeleryDistributedTracingIntegrationTask::test_distributed_tracing_propagation_async` to fail. As far as I can tell from about an hour of investigation, the celery [fix](celery/celery#4356) only affects "hybrid messages" (those that blend protocol 1 and 2 syntax as part of a rolling Celery upgrade), which our test suite doesn't use.

The issue referenced was [resolved](celery/celery#6374), but it actually does not apply to the bit of code it's pointing to. Fixes DataDog#4649 I tried removing the code referenced by the comment, and it causes the test `tests/contrib/celery/test_integration.py::CeleryDistributedTracingIntegrationTask::test_distributed_tracing_propagation_async` to fail. As far as I can tell from about an hour of investigation, the celery [fix](celery/celery#4356) only affects "hybrid messages" (those that blend protocol 1 and 2 syntax as part of a rolling Celery upgrade), which our test suite doesn't use.

auvipy added this to the v4.2 milestone Dec 19, 2017

thedrow mentioned this issue Jan 12, 2018

Add test coverage for #4356 #4483

Open

auvipy closed this as completed Jan 13, 2018

This was referenced Jan 30, 2018

Upgrade to Celery 3.1.25 in preparation to Celery 4. getredash/redash#2270

Merged

upgrade to Celery 4.1.0 mozilla/redash#297

Merged

mashrikt mentioned this issue May 11, 2018

upgrade to Celery 4.1.0 getredash/redash#2529

Closed

jezdez mentioned this issue Sep 27, 2018

Upgrade Celery to 4.2.1. getredash/redash#2773

Merged

This was referenced Mar 21, 2023

Upstream issue celery/celery#4875 closed DataDog/dd-trace-py#4649

Closed

fix(celery): remove reference to github issue DataDog/dd-trace-py#5381

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task loss on retry when using a hybrid/staged Celery 3->4 deployment #4356

Task loss on retry when using a hybrid/staged Celery 3->4 deployment #4356

freakboy3742 commented Nov 1, 2017

freakboy3742 commented Nov 1, 2017

auvipy commented Nov 1, 2017

freakboy3742 commented Nov 1, 2017

auvipy commented Nov 2, 2017

wcooley commented Dec 18, 2017

thedrow commented Dec 19, 2017

Task loss on retry when using a hybrid/staged Celery 3->4 deployment #4356

Task loss on retry when using a hybrid/staged Celery 3->4 deployment #4356

Comments

freakboy3742 commented Nov 1, 2017

Checklist

Steps to reproduce

Expected behavior

Actual behavior

freakboy3742 commented Nov 1, 2017

auvipy commented Nov 1, 2017

freakboy3742 commented Nov 1, 2017

auvipy commented Nov 2, 2017

wcooley commented Dec 18, 2017

thedrow commented Dec 19, 2017