Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task loss on retry when using a hybrid/staged Celery 3->4 deployment #4356

Closed
2 tasks done
freakboy3742 opened this issue Nov 1, 2017 · 8 comments
Closed
2 tasks done
Milestone

Comments

@freakboy3742
Copy link
Contributor

If you have a Celery 3.1.25 deployment involving many workers, and you want to upgrade to Celery 4, you may wish to do "canary" testing of a limited subset of workers to validate that the upgrade won't introduce any problems, prior to upgrading your entire worker fleet to Celery4. This "canary" mode involves having both Celery 3.1.25 and Celery 4 workers running at the same time.

However, if you do this, and you have tasks that retry, you experience problems if a task is attempted on a Celery 3.1.25 node, then a Celery 4 node, and then a Celery 3.1.25 node.

When the Celery 3.1.25 task is executed on a Celery 4, the task message is upgraded to Protocol 2. However, the upgrade results in a hybrid message that complies with both formats, and when the task fails and is retried on the Celery 3.1.25 worker, the "hybrid" message is mis-identified as a Protocol 1 message, resulting in a hard crash and message loss.

Checklist

  • I have included the output of celery -A proj report in the issue.
  • I have verified that the issue exists against the master branch of Celery.

Steps to reproduce

A full reproduction case can be found in this gist:

https://gist.github.com/ewdurbin/ddf4b0f0c0a4b190251a4a23859dd13c

In local testing, the following two versions were used:

###Celery 3.1.25:

software -> celery:3.1.25 (Cipater) kombu:3.0.37 py:2.7.13
            billiard:3.3.0.23 redis:2.10.6
platform -> system:Darwin arch:64bit imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:disabled

BROKER_URL: 'redis://localhost:6379//'
CELERY_ENABLE_UTC: True
CELERY_RESULT_SERIALIZER: 'json'
CELERY_ACCEPT_CONTENT: ['json']
CELERY_TIMEZONE: 'UTC'
CELERY_TASK_SERIALIZER: 'json'

###Celery 4.1.0:

software -> celery:4.1.0 (latentcall) kombu:4.1.0 py:2.7.13
            billiard:3.5.0.3 redis:2.10.6
platform -> system:Darwin arch:64bit imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:disabled

task_serializer: 'json'
result_serializer: 'json'
CELERY_ENABLE_UTC: True
accept_content: ['json']
enable_utc: True
timezone: 'UTC'
broker_url: u'redis://localhost:6379//'

Although these test results were obtained on a Mac running Sierra, the problem has also been observed in production on AWS EC2 Linux machines.

Expected behavior

A task should be able to move back and forth between a 3.1.25 worker and a 4.1.0 worker without any problems.

Actual behavior

A task can be executed on a Celery 3.1.25 worker , then on a Celery 4.1.0 worker; but when the task is then run on a Celery 3.1.25 worker, the following error is produced:

Traceback (most recent call last):
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/worker/__init__.py", line 206, in start
    self.blueprint.start(self)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/bootsteps.py", line 374, in start
    return self.obj.start()
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 280, in start
    blueprint.start(self)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 884, in start
    c.loop(*c.loop_args())
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/worker/loops.py", line 76, in asynloop
    next(loop)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/kombu/async/hub.py", line 340, in create_loop
    cb(*cbargs)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/kombu/transport/redis.py", line 1019, in on_readable
    self._callbacks[queue](message)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/kombu/transport/virtual/__init__.py", line 535, in _callback
    return callback(message)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/kombu/messaging.py", line 598, in _receive_callback
    return on_m(message) if on_m else self.receive(decoded, message)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/kombu/messaging.py", line 564, in receive
    [callback(body, message) for callback in callbacks]
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 462, in on_task_received
    self.app, type_, body, message, headers)
  File "/Users/rkm/zapier/celery/celery3-venv/lib/python2.7/site-packages/celery/worker/consumer.py", line 483, in proto2_to_proto1
    args, kwargs, embed = body
ValueError: too many values to unpack

This kills the worker, and the task message is lost.

@freakboy3742
Copy link
Contributor Author

See #4357 for a fix in the forward compatibility code in Celery 3.1.25; #4358 for a matching fix in Celery 4.1.0

@auvipy
Copy link
Member

auvipy commented Nov 1, 2017

thanks for the report and fixes Russel! could you also provide the tests for your patches?

@freakboy3742
Copy link
Contributor Author

Not sure how to go about an automated test for these, as on both ends, the test involves running a version of Celery that isn't the version being tested. Any tips/suggestions appreciated.

@auvipy
Copy link
Member

auvipy commented Nov 2, 2017

@thedrow any insights?

@wcooley
Copy link

wcooley commented Dec 18, 2017

NB: This can happen also with a 4.1.0 producer and 3.1.25 consumer, where the consumer calls retry(). It seems that when it returns the message to the queue, it has translated the body to the protocol 1 format but it retains the headers for protocol 2, so it is detected as 2.

I manually applied #4357 to my production system (eek!) and it appears to be working.

@thedrow
Copy link
Member

thedrow commented Dec 19, 2017

Great!

@auvipy auvipy added this to the v4.2 milestone Dec 19, 2017
thedrow pushed a commit that referenced this issue Jan 12, 2018
…ersions (#4358)

* handle "hybrid" messages which have passed through a protocol 1 and protocol 2 consumer in its life.

we detected an edgecase which is proofed out in https://gist.github.com/ewdurbin/ddf4b0f0c0a4b190251a4a23859dd13c#file-readme-md which mishandles messages which have been retried by a 3.1.25, then a 4.1.0, then again by a 3.1.25 consumer. as an extension, this patch handles the "next" iteration of these mutant payloads.

* explicitly construct proto2 from "hybrid" messages

* remove unused kwarg

* fix pydocstyle check

* flake8 fixes

* correct fix for misread pydocstyle error
@auvipy auvipy closed this as completed Jan 13, 2018
emmettbutler added a commit to DataDog/dd-trace-py that referenced this issue Mar 22, 2023
The issue referenced was
[resolved](celery/celery#6374), but it actually
does not apply to the bit of code it's pointing to.

Fixes #4649

I tried removing the code referenced by the comment, and it causes the
test
`tests/contrib/celery/test_integration.py::CeleryDistributedTracingIntegrationTask::test_distributed_tracing_propagation_async`
to fail. As far as I can tell from about an hour of investigation, the
celery [fix](celery/celery#4356) only affects
"hybrid messages" (those that blend protocol 1 and 2 syntax as part of a
rolling Celery upgrade), which our test suite doesn't use.
IlyaMichlin pushed a commit to IlyaMichlin/dd-trace-py that referenced this issue Mar 30, 2023
The issue referenced was
[resolved](celery/celery#6374), but it actually
does not apply to the bit of code it's pointing to.

Fixes DataDog#4649

I tried removing the code referenced by the comment, and it causes the
test
`tests/contrib/celery/test_integration.py::CeleryDistributedTracingIntegrationTask::test_distributed_tracing_propagation_async`
to fail. As far as I can tell from about an hour of investigation, the
celery [fix](celery/celery#4356) only affects
"hybrid messages" (those that blend protocol 1 and 2 syntax as part of a
rolling Celery upgrade), which our test suite doesn't use.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@freakboy3742 @thedrow @wcooley @auvipy and others