NACKJOB (re-queue) and WAITJOB #43

VojtechVitek · 2015-05-04T21:36:39Z

First of all, let me thank you for this awesome project. It feels fast & stable, even though it's still Alpha. Good job! 👍

I'm trying to solve these two Use-cases:

NACKJOB (force re-queue)

Let's assume I'm a Consumer and I fail to process the message. It'd be great if I could NACKJOB id, so the job gets re-queued right away (if the RETRY timeout was specified).

** Or should I use ACKJOB id ADDJOB queue data instead? Can I do this atomically?
WAITJOB

It'd be great to have a blocking operation that would wait for job to finish (get ACKed):

WAITJOB id [id ...] timeout
... in other words, similar cmd to Redis' BLPOP key [key ...] timeout

Feedback welcome, I could be missing something obvious.

The text was updated successfully, but these errors were encountered:

antirez · 2015-05-04T22:03:52Z

Helo @VojtechVitek, thanks! I'm positive about both the features, @dvirsky and me had private chats about this. Currently for NACKJOB you could use ENQUEUE, I need to check the exact behavior compared to what we expect from a negative ACK, but it looks exactly what we want: put back on QUEUE ASAP, and inform the other nodes to postpone their re-queueing operation since we just queued the job.

However currently the command is completely ignored if sent to a node that does not know about the job at all. This may make sense after all, since to broadcast cluster-wide an ENQUEUE message is non trivial if we want to have just a single node to enqueue the job. Moreover it is not critical that a negative ACK works on Disque, since the system will put the job on queue again after a retry time, anyway.

So about WAITJOB, it looks like something I would like to have and there are very good use cases. I'm not sure just about the vararg nature of your proposed solution, if it would return only when all the jobs are finished or as soon as the first is acked, returning the job ID of the acked job. Both semantics may be useful honestly, maybe it makes sense to support an "ALL" option.

VojtechVitek · 2015-05-04T22:59:13Z

@antirez Thanks for your feedback!

I'd be OK with WAITJOB id only, since I can call this cmd sequentially to perform "ALL". I'll leave it to you.

Another feature that I have in my mind is changing the queue of the job.
Use case: Let's say I have "low" and "high" priority queues and my Consumer gets job from "low" priority job (using GETJOB FROM "high" "low") and then it fails. I'd like the job to be re-queued to the higher priority queue now - what would you suggest?
Since there's no MULTI/EXEC for atomic transactions, as in Redis, how would I ACK id and ADDJOB "high" data atomically?

**Or ENQUEUE "queue_name" jobid?

dvirsky · 2015-05-05T05:55:30Z

@antirez I tried adding ENQUEUE as a fail mechanism, but the problem was I couldn't control the number of retries. The only way to do it currently is to copy the entire job, in the job data encode the number of retries, and add the new job. Otherwise, until the TTL is reached you could be doing thousands of retries.

antirez · 2015-05-05T21:42:44Z

Hey, about the different topics here:

Yep maybe we can start with just WAITJOB jobid timeout and have a return value conceived to "scale" for the variadic version. The command is quite straightforward if we have the following semantics:

Block waiting for the job to be acknowledged or return ASAP if the job is already acknowledged.
Return the number of acked jobs, so 1 if the command succeeds, 0 if the timeout is reached. This potentially scales to a variadic version.
If the node does not know anything about such a job, an error is returned.

However note that there are a few things happening here:

If you do ADDJOB / WAITJOB one after the other, it is possible that the job is delivered + acknowledged ASAP and WAITJOB is issued when the node no longer knows anything about the job.
Probably we should have an ADDJOB option that implies to wait for the job to be acknowledged? So add & wait is a single operation and there is no race. It is also faster this way since we need one operation instead of two.

About ENQUEUE and nevative acks, @dvirsky I believe we need to take some counter, so an actual NACK could be useful. I think nodes should take an approximated number of:

Re-queue attempts for timeout.
Re-queue attempts because of negative acknowledge.

When the job is fetched, you can pass GETJOB an option to retrieve also these two numbers. This way your worker is able to spot both failures because the worker can't proceed, and failures because the worker in one way or the other lost the message. It could be conceptually simpler to just take a counter, but here there is to consider if the difference instead is pretty important or not.

Moving to different priority, I believe it's better served by ACK the old job and re-add into a different queue, since jobs currently are pretty immutable, and this is a key property since the system is AP, so that we need to merge only the job state: if it gets acknowledged we converge to that, but otherwise, the job is immutable.

Note that there is no race condition here conceptually: if you want to move the job to a different priority, you ACK and re-add. If in the middle somebody processed the job, you can consider that as multiple delivery which is something that you need to face anyway with Disque and other message queues. Moreover such an atomic primitive would be as futile as trying to achieve exactly once delivery which is impossible. Example:

Worker fetches the message when it's low priority.
Worker processes it, but crashes before the ACK.
Atomic ACK-and-READD in a different queue primitive succeeds but still the multiple delivery happened.

VojtechVitek · 2015-05-05T22:08:10Z

@antirez
However note that there are a few things happening here:

If you do ADDJOB / WAITJOB one after the other, it is possible that the job is delivered + acknowledged ASAP and WAITJOB is issued when the node no longer knows anything about the job.
Probably we should have an ADDJOB option that implies to wait for the job to be acknowledged? So add & wait is a single operation and there is no race. It is also faster this way since we need one operation instead of two.

In my opinion, WAITJOB non-existing-job should just immediately return (nil). This would be compatible with SHOW non-existing-job. The semantics is "block until the job is ACKed (dequeued) or return nil".

EDIT: I'd be OK with blocking ADDJOB, but WAITJOB makes more sense to me. What if I want to add several jobs and then wait on all of them? This way I could do it easily using the native Disque syntax.

VojtechVitek · 2015-05-05T22:19:40Z

Moving to different priority, I believe it's better served by ACK the old job and re-add into a different queue, since jobs currently are pretty immutable, and this is a key property since the system is AP, so that we need to merge only the job state: if it gets acknowledged we converge to that, but otherwise, the job is immutable.

Yeah, that makes sense. I believe this is very similar to #45. Let's leave it as is for now, as we can use ACKJOB and ADDJOB instead. Thanks!

djanowski · 2015-05-13T03:20:03Z

Maybe I'm asking something too obvious here. On what condition would you want to nack a job? My first impression is that if you really want to retry as soon as possible then you might as well retry in the consumer immediately. If you were able to recover from an error and were able to call NACKJOB, then you can easily retry the job there. If we encourage NACKJOB as the command to run when the consumer fails for any reason, then a bug in the code or a connection error can trigger a loop where jobs are being consumed and nacked continuously.

One case where you might want to nack is when a shutdown of your workers is required. In this case, you could go for a completely graceful shutdown (have the process wait until current jobs are finished and acked, then shut down) or you could stop working immediately, nack all the current jobs and expect for other consumer to pick them up as soon as possible.

antirez · 2015-05-13T05:52:37Z

Damian my guess is that NACK is intimately related to letting the message queue count the number of failures in order to implement the "dead letter" feature for jobs failing to be processed more than N times. Another use case is when the cause for the failed processing is local to the worker so it wishes for a different one to try ASAP. Basically it's hard to evaluate and design NACK without reasoning on the other related features...

VojtechVitek · 2015-05-13T17:37:07Z

Yea, as Salvatore said, one of the main motivations for NACK is counting the failures.

Also, if a worker wants to shutdown gracefully (without waiting for operation to finish!), it wants to NACK the job, so other workers can pick it up asap. Think of jobs that take more time to process (say 2+ minutes) - you can't just afford waiting 2+ minutes after worker receives SIGTERM.

dvirsky · 2015-05-14T11:56:36Z

A few use cases I might think of for NACKing a job:

"Type 1": Temporary problems from the worker's perspective. e.g. - outside network is unreachable, disk is full on a specific machine, bad configuration value, etc etc. NACK will let another worker try doing it ASAP instead of in a few seconds/minutes.

"Type 2": Temporary problems with an external resource that might be fixed in a second - timeouts reading from database, an outside REST API returned error 500, etc. We can retry them on the worker side, but we have a job scheduler, so why not use it?

BTW we might not know if a problem is of type 1 or 2 so easily.

antirez · 2015-05-14T11:59:13Z

@dvirsky yep hard to tell the type of the problem but IMHO: on exceptions the worker should NACK. Then it's up to the people running the site to perform incremental updates of the workers and not all-today changes that may lead to all the workers having this same issue. This way all the problems may be semantically the same: even bugs may be limited to a small percentage of nodes in the first rolling of the update.

antirez · 2015-05-18T11:31:03Z

Hello, summarized and proposed things related to this issue (the NACK part) here: #68.

antirez · 2015-11-28T10:51:36Z

Closed because this feature was implemented.

VojtechVitek · 2015-11-28T16:51:10Z

Thank you, @antirez!

antirez added the feature request label May 4, 2015

djanowski mentioned this issue May 13, 2015

UPDATEJOB jobid value [options] #45

Closed

VojtechVitek mentioned this issue May 13, 2015

Document re-queue behavior in regard of message sorting in the queue #63

Closed

antirez closed this as completed Nov 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NACKJOB (re-queue) and WAITJOB #43

NACKJOB (re-queue) and WAITJOB #43

VojtechVitek commented May 4, 2015

antirez commented May 4, 2015

VojtechVitek commented May 4, 2015

dvirsky commented May 5, 2015

antirez commented May 5, 2015

VojtechVitek commented May 5, 2015

VojtechVitek commented May 5, 2015

djanowski commented May 13, 2015

antirez commented May 13, 2015

VojtechVitek commented May 13, 2015

dvirsky commented May 14, 2015

antirez commented May 14, 2015

antirez commented May 18, 2015

antirez commented Nov 28, 2015

VojtechVitek commented Nov 28, 2015

NACKJOB (re-queue) and WAITJOB #43

NACKJOB (re-queue) and WAITJOB #43

Comments

VojtechVitek commented May 4, 2015

antirez commented May 4, 2015

VojtechVitek commented May 4, 2015

dvirsky commented May 5, 2015

antirez commented May 5, 2015

VojtechVitek commented May 5, 2015

VojtechVitek commented May 5, 2015

djanowski commented May 13, 2015

antirez commented May 13, 2015

VojtechVitek commented May 13, 2015

dvirsky commented May 14, 2015

antirez commented May 14, 2015

antirez commented May 18, 2015

antirez commented Nov 28, 2015

VojtechVitek commented Nov 28, 2015