-
Notifications
You must be signed in to change notification settings - Fork 52
-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully handle long-running consumer tasks #130
Comments
Yeah Bodhi's masher (being renamed to composer) does run long tasks triggered by messages, and it also uses threads while doing so. However, we are planning to switch it and many of our other tasks to use celery instead of fedmsg/fedora_messaging, mostly because they are tasks and that's what celery is designed to do. Bodhi will still have a few other fedora_messaging consumers. For example, we have one to mark builds as signed when robosignatory signs them, and we might add another one to process messages from Greenwave. A workaround for this problem for projects that have something like Celery might be to just have the message consumer fire off a Celery task to do whatever work is necessary to respond to a message, when that work might take a while. |
Another idea that occurred to me while running is we could document this behavior in the There are two downsides off the top of my head: this would add a hard dependency on Twisted (which is currently optional), and it would complicate things if we at some point want to move to plain asyncio and drop the Twisted dependency. However, I think Twisted plans to integrate nicely with asyncio so that may be something we don't need to worry about. A bit more research is probably warranted. |
Yeah I think that using Twisted for the main API could be a problem, since significant changes in Pika's Twisted plugin have been done recently and probably won't land in EPEL anytime soon. |
Well, we're already building pika for the infrastructure EPEL7. We could, in theory, ship pika-1.0.0b2 in the infrastructure repositories for Fedora and EPEL although I'd rather not for Fedora. You're obviously more familiar with the API changes, but the main one was the lack of publish confirms, right? If we just use it for the consume API that shouldn't pose a problem. Relying on the pre-fetch count might work... although I just noticed it is a RabbitMQ extension. I guess we're already using the publish confirms so that's not an issue, but probably worth documenting. I have no doubt Twisted has a locking primitive, though if prefetch is 1 I can't think why that wouldn't be equivalent. I'll see about making a proof-of-concept which should hopefully answer a lot of my questions. |
So I've been playing around with this for a bit, and things I've got are:
The problem is I really don't like the API. I've done it a few different ways and I'm still not happy. I'm going to work on something else for a day or two and see if coming back with fresh eyes will inspire me. It definitely works at the moment, but it's hacky at the moment. In case you're curious, the API I tried to get working neatly is:
It's the same call as the normal consume API, but it returns a Deferred/Future that called back when the consumer either crashed or was requested to stop by raising a HaltConsumer exception. I had trouble with things getting complicated managing several underlying deferreds (because this API can result in multiple AMQP consumers if there are multiple queues) and deciding what to do if one queue consumer fails, etc. I also am not certain I can get |
The PR has been merged, closing this issue. |
At the moment, our consumer API does not handle callbacks that take significantly longer than the AMQP heartbeat interval.
To reproduce this, use a vanilla RabbitMQ broker on Fedora with fedora-messaging, the heartbeat is 60s, and run this callback with
fedora-messaging consume
:This will exit with:
The problem is this:
Long-running consumer callback blocks the pika event loop, so no heartbeating can occur.
Connection times out and is killed by the broker.
The message(s) the consumer was processing are re-queued by the broker as they were not acknowledged by the consumer (it acknowledges the messages once the consumer returns)
The consumer does not gracefully restart the connection, but that's a minor problem because...
When the consumer reconnects, it gets the message it was processing again.
Now, it's less than ideal to have consumer callbacks block for ages, but I suspect there are many such consumers in Fedora (the masher I think? @bowlofeggs would know) so I think we need to handle this better.
The basic problem is we need to heartbeat at the same time the message callback is running. That means either a) consumer callbacks need to do asynchronous IO (and have scheduling points if they're doing long CPU-intensive tasks) or b) the consumer callback runs in a separate thread.
The most appealing way forward for me is to back the consumer API with the Twisted client we already have, but call the consumer callback with the
reactor.deferToThread
API to run the message in a thread. This lets the Twisted event loop heartbeat while it waits on the thread to finish, and the callback can use synchronous APIs all it wants.We need to be careful since consumer callbacks have not historically needed to be thread-safe, so handling one message at a time seems safest. This means using whatever the twisted equivalent of https://docs.python.org/3/library/asyncio-sync.html#asyncio.Lock is.
Another consideration is to make sure we still handle OS signals gracefully.
There is one last tricky issue here, which is that Twisted's reactor (AFAIK) can only be started once, which means we're going to have problems if callbacks raise a HaltException and then the caller tries to call
fedora_messaging.api.consume
later. Perhaps https://github.com/itamarst/crochet is the best way to handle that.The text was updated successfully, but these errors were encountered: