A home for dead jobs and retry limit #8

fritzy · 2015-04-27T18:58:08Z

A common use case we have is creating jobs for external APIs. Upon repeated failure (failing a certain number of times) stick the job in a list of jobs that an admin needs to look at.

For example:

addjob emailnotify "info here" 0 retry 120 retrylimit 5 failure emailnotifydead

So if a job fails to ack 5 times, either through timeout or being re-enqued, it should go to the failure queue for an admin to look at if provided.

Awesome work, Antirez!

antirez · 2015-04-28T13:18:52Z

Hello @fritzy, I understand the use case for this, and there is a symmetrical feature in Amazon SQS for example, but at the same time there is no global retry counter being Disque distributed. This problem could be kinda fixed by using a non perfect counter in each node that uses QUEUED messages in order to take the counter updated. However there is another problem due to the distributed nature of Disque, we also need to coordinate nodes so that usually we get just a dead letter for each job that was not able to be processed, otherwise all the nodes having a copy may put the job into the dead letter queue multiple times.

This also seems to be kinda solvable using some protocol that does not ensure unicity but that approximates it. Basically it is not a trivial problem, but with some design we may find ways to solve it.

A few more thoughts:

This means to enlarge the job structure even more, it's already 120 bytes.
Maybe to specify the queue name for dead letters explicitly is not strictly needed? We may just prefix the queue name with something like dead_letter:queue_name. Not sure about that, saves memory but is less flexible, maybe some user may want all the dead letters to go into the same queue.
It is possible to address this in a much more memory efficient way at the cost of other things: we are going to have an iterator to iterate jobs given specific attributes. We may ask to iterate all the jobs inside the node where the approximated_retry_attempt is greater than N. However this makes the problem a per node problem, while all the rest of the Disque API is distributed and not linked to a specific node, so I'm not a big fan of this.
The obvious counter argument for this feature is that this should be up to workers: when they can't process a job they may add a dead letter themselves for people to inspect. This is how Resque works I guess? However this means certain kinds of failures are harder to catch, for example the worker crashing as soon as it starts to process a given message, without having a chance to send a dead letter.

No clear ideas here in my side :-) Need to think / work on this more time. Any feedback?

fritzy · 2015-04-28T16:09:19Z

Three thoughts:

Generated names are fine.

NSQ allows you to listen for requeue events (I believe). If you could do this, a watcher process could manually move it elsewhere, assuming it could prevent a worker from picking it back up.

The accuracy of the counter isn't terribly important. Really we're just looking for a threshold of pain -- this job is not likely to succeed, so we pass it off somewhere else. So if we said 5 failures, and we didn't notice until 10 failures, it's a bit wasteful, and takes up resources, but the number of times itself isn't really that important.

sheerun · 2015-04-30T08:13:17Z

Isn't one global dead job queue enough? It's up to workers to process dead jobs, and report them. Using disque as a "database" for dead jobs doesn't seem like a good idea. Even if it only means to sort them by the originating queue.

I think it could be solved by allowing to add metadata to each message, just as RabbitMQ does. We need it other usecases too. For example for failed jobs we want to add stacktrace and error name to the jobs and put it in dead jobs queue. At the same time we'd like to avoid modyfying body of the original job.

antirez · 2015-04-30T08:53:47Z

@fritzy @sheerun thanks for your comments. A few random arguments to keep the discussion going.

It's interesting to note that the dead letter could be implemented client side if we offer, via GETJOB, a way to fetch also this approximated "retry counter" so that the worker itself may add the job into the dead letter. The only issue I see with this compared to the other approach of handling this inside the server, is that if we have a failure where the worker immediately crashes once it tries to process the job, it will not be able to add the dead letter.
The Pub/Sub thing would be nice and extremely easy to implement, just a command where you listen for events like: job expired before being processed, job retry limit reached, ... However like any fire and forget thing it is not reliable. But, maybe it's not really needed to be reliable if we re-issue the event every time the job is re-queued with a retry count greater than the retry limit. Eventually some listener will notice. So this could be an option.

I'm a bit more biased towards the client-side solution if you think this could work, for reasons of simplicity of the Disque server and for the reasons @sheerun says: usually we want to augment the failed job with more information that the worker provides.

So a feature sketch for you to evaluate:

Take the approximated count of the number of times a job was reissued in some interesting way. Not super reliable during failures, but reasonably reliable otherwise would be cool. We can use the QUEUED or WILLQUEUE cluster message I guess or something like that.
Provide a GETJOB option to also fetch the retry count for the job, so that the worker can act ASAP if needed and post a dead letter.

Sounds good?

sheerun · 2015-04-30T09:21:59Z

Don't get me wrong. I think putting dead jobs to dead job queue (or job_events queue described below) after timeout (no ack), is quite important. Without this there's no guarantee some jobs aren't dropped, because worker hangs or doesn't implement dead-job logic.

I think one clean solution would be to:

Not to handle retry counts at disque level at all
Introduce unacknowledged / timeout state for a job.
Create some special job_events queue that both disque and clients can push to. Clients can subscribe to job events selectively in a pub/sub manner (e.g. on transition to unacknowledged disque adds a job to job_events queue stating exactly that).
Clients can process events from job_events and re-schedule dead jobs as needed with increased re-try counter (stored in the metadata). I still think introducing metadata to jobs is quite important.

sheerun · 2015-04-30T09:37:02Z

On the other hand the body of jobs in job_events queue could serve as jobs metadata. Clients could read it and store somewhere if needed.

antirez · 2015-05-18T11:32:58Z

Closing this in favor of #68 in order to start a fresh discussion around a different design proposal. My top priority was to explore alternative designs giving the same practical features without to resort to actual dead letter implementation.

antirez added the feature request label May 4, 2015

antirez closed this as completed May 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A home for dead jobs and retry limit #8

A home for dead jobs and retry limit #8

fritzy commented Apr 27, 2015

antirez commented Apr 28, 2015

fritzy commented Apr 28, 2015

sheerun commented Apr 30, 2015

antirez commented Apr 30, 2015

sheerun commented Apr 30, 2015

sheerun commented Apr 30, 2015

antirez commented May 18, 2015

A home for dead jobs and retry limit #8

A home for dead jobs and retry limit #8

Comments

fritzy commented Apr 27, 2015

antirez commented Apr 28, 2015

fritzy commented Apr 28, 2015

sheerun commented Apr 30, 2015

antirez commented Apr 30, 2015

sheerun commented Apr 30, 2015

sheerun commented Apr 30, 2015

antirez commented May 18, 2015