Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NACKJOB (re-queue) and WAITJOB #43

Closed
VojtechVitek opened this issue May 4, 2015 · 14 comments
Closed

NACKJOB (re-queue) and WAITJOB #43

VojtechVitek opened this issue May 4, 2015 · 14 comments

Comments

@VojtechVitek
Copy link
Contributor

First of all, let me thank you for this awesome project. It feels fast & stable, even though it's still Alpha. Good job! 👍

I'm trying to solve these two Use-cases:

  1. NACKJOB (force re-queue)

    Let's assume I'm a Consumer and I fail to process the message. It'd be great if I could NACKJOB id, so the job gets re-queued right away (if the RETRY timeout was specified).

    ** Or should I use ACKJOB id ADDJOB queue data instead? Can I do this atomically?

  2. WAITJOB

    It'd be great to have a blocking operation that would wait for job to finish (get ACKed):

    WAITJOB id [id ...] timeout
    ... in other words, similar cmd to Redis' BLPOP key [key ...] timeout

Feedback welcome, I could be missing something obvious.

@antirez
Copy link
Owner

antirez commented May 4, 2015

Helo @VojtechVitek, thanks! I'm positive about both the features, @dvirsky and me had private chats about this. Currently for NACKJOB you could use ENQUEUE, I need to check the exact behavior compared to what we expect from a negative ACK, but it looks exactly what we want: put back on QUEUE ASAP, and inform the other nodes to postpone their re-queueing operation since we just queued the job.

However currently the command is completely ignored if sent to a node that does not know about the job at all. This may make sense after all, since to broadcast cluster-wide an ENQUEUE message is non trivial if we want to have just a single node to enqueue the job. Moreover it is not critical that a negative ACK works on Disque, since the system will put the job on queue again after a retry time, anyway.

So about WAITJOB, it looks like something I would like to have and there are very good use cases. I'm not sure just about the vararg nature of your proposed solution, if it would return only when all the jobs are finished or as soon as the first is acked, returning the job ID of the acked job. Both semantics may be useful honestly, maybe it makes sense to support an "ALL" option.

@VojtechVitek
Copy link
Contributor Author

@antirez Thanks for your feedback!

I'd be OK with WAITJOB id only, since I can call this cmd sequentially to perform "ALL". I'll leave it to you.

Another feature that I have in my mind is changing the queue of the job.
Use case: Let's say I have "low" and "high" priority queues and my Consumer gets job from "low" priority job (using GETJOB FROM "high" "low") and then it fails. I'd like the job to be re-queued to the higher priority queue now - what would you suggest?
Since there's no MULTI/EXEC for atomic transactions, as in Redis, how would I ACK id and ADDJOB "high" data atomically?

**Or ENQUEUE "queue_name" jobid?

@dvirsky
Copy link

dvirsky commented May 5, 2015

@antirez I tried adding ENQUEUE as a fail mechanism, but the problem was I couldn't control the number of retries. The only way to do it currently is to copy the entire job, in the job data encode the number of retries, and add the new job. Otherwise, until the TTL is reached you could be doing thousands of retries.

@antirez
Copy link
Owner

antirez commented May 5, 2015

Hey, about the different topics here:

  1. Yep maybe we can start with just WAITJOB jobid timeout and have a return value conceived to "scale" for the variadic version. The command is quite straightforward if we have the following semantics:
  • Block waiting for the job to be acknowledged or return ASAP if the job is already acknowledged.
  • Return the number of acked jobs, so 1 if the command succeeds, 0 if the timeout is reached. This potentially scales to a variadic version.
  • If the node does not know anything about such a job, an error is returned.

However note that there are a few things happening here:

  • If you do ADDJOB / WAITJOB one after the other, it is possible that the job is delivered + acknowledged ASAP and WAITJOB is issued when the node no longer knows anything about the job.
  • Probably we should have an ADDJOB option that implies to wait for the job to be acknowledged? So add & wait is a single operation and there is no race. It is also faster this way since we need one operation instead of two.
  1. About ENQUEUE and nevative acks, @dvirsky I believe we need to take some counter, so an actual NACK could be useful. I think nodes should take an approximated number of:
  • Re-queue attempts for timeout.
  • Re-queue attempts because of negative acknowledge.

When the job is fetched, you can pass GETJOB an option to retrieve also these two numbers. This way your worker is able to spot both failures because the worker can't proceed, and failures because the worker in one way or the other lost the message. It could be conceptually simpler to just take a counter, but here there is to consider if the difference instead is pretty important or not.

  1. Moving to different priority, I believe it's better served by ACK the old job and re-add into a different queue, since jobs currently are pretty immutable, and this is a key property since the system is AP, so that we need to merge only the job state: if it gets acknowledged we converge to that, but otherwise, the job is immutable.

Note that there is no race condition here conceptually: if you want to move the job to a different priority, you ACK and re-add. If in the middle somebody processed the job, you can consider that as multiple delivery which is something that you need to face anyway with Disque and other message queues. Moreover such an atomic primitive would be as futile as trying to achieve exactly once delivery which is impossible. Example:

  • Worker fetches the message when it's low priority.
  • Worker processes it, but crashes before the ACK.
  • Atomic ACK-and-READD in a different queue primitive succeeds but still the multiple delivery happened.

@VojtechVitek
Copy link
Contributor Author

@antirez
However note that there are a few things happening here:

If you do ADDJOB / WAITJOB one after the other, it is possible that the job is delivered + acknowledged ASAP and WAITJOB is issued when the node no longer knows anything about the job.
Probably we should have an ADDJOB option that implies to wait for the job to be acknowledged? So add & wait is a single operation and there is no race. It is also faster this way since we need one operation instead of two.

In my opinion, WAITJOB non-existing-job should just immediately return (nil). This would be compatible with SHOW non-existing-job. The semantics is "block until the job is ACKed (dequeued) or return nil".

EDIT: I'd be OK with blocking ADDJOB, but WAITJOB makes more sense to me. What if I want to add several jobs and then wait on all of them? This way I could do it easily using the native Disque syntax.

@VojtechVitek
Copy link
Contributor Author

Moving to different priority, I believe it's better served by ACK the old job and re-add into a different queue, since jobs currently are pretty immutable, and this is a key property since the system is AP, so that we need to merge only the job state: if it gets acknowledged we converge to that, but otherwise, the job is immutable.

Yeah, that makes sense. I believe this is very similar to #45. Let's leave it as is for now, as we can use ACKJOB and ADDJOB instead. Thanks!

@djanowski
Copy link
Contributor

Maybe I'm asking something too obvious here. On what condition would you want to nack a job? My first impression is that if you really want to retry as soon as possible then you might as well retry in the consumer immediately. If you were able to recover from an error and were able to call NACKJOB, then you can easily retry the job there. If we encourage NACKJOB as the command to run when the consumer fails for any reason, then a bug in the code or a connection error can trigger a loop where jobs are being consumed and nacked continuously.

One case where you might want to nack is when a shutdown of your workers is required. In this case, you could go for a completely graceful shutdown (have the process wait until current jobs are finished and acked, then shut down) or you could stop working immediately, nack all the current jobs and expect for other consumer to pick them up as soon as possible.

@antirez
Copy link
Owner

antirez commented May 13, 2015

Damian my guess is that NACK is intimately related to letting the message queue count the number of failures in order to implement the "dead letter" feature for jobs failing to be processed more than N times. Another use case is when the cause for the failed processing is local to the worker so it wishes for a different one to try ASAP. Basically it's hard to evaluate and design NACK without reasoning on the other related features...

@VojtechVitek
Copy link
Contributor Author

Yea, as Salvatore said, one of the main motivations for NACK is counting the failures.

Also, if a worker wants to shutdown gracefully (without waiting for operation to finish!), it wants to NACK the job, so other workers can pick it up asap. Think of jobs that take more time to process (say 2+ minutes) - you can't just afford waiting 2+ minutes after worker receives SIGTERM.

@dvirsky
Copy link

dvirsky commented May 14, 2015

A few use cases I might think of for NACKing a job:

"Type 1": Temporary problems from the worker's perspective. e.g. - outside network is unreachable, disk is full on a specific machine, bad configuration value, etc etc. NACK will let another worker try doing it ASAP instead of in a few seconds/minutes.

"Type 2": Temporary problems with an external resource that might be fixed in a second - timeouts reading from database, an outside REST API returned error 500, etc. We can retry them on the worker side, but we have a job scheduler, so why not use it?

BTW we might not know if a problem is of type 1 or 2 so easily.

@antirez
Copy link
Owner

antirez commented May 14, 2015

@dvirsky yep hard to tell the type of the problem but IMHO: on exceptions the worker should NACK. Then it's up to the people running the site to perform incremental updates of the workers and not all-today changes that may lead to all the workers having this same issue. This way all the problems may be semantically the same: even bugs may be limited to a small percentage of nodes in the first rolling of the update.

@antirez
Copy link
Owner

antirez commented May 18, 2015

Hello, summarized and proposed things related to this issue (the NACK part) here: #68.

@antirez antirez closed this as completed Nov 28, 2015
@antirez
Copy link
Owner

antirez commented Nov 28, 2015

Closed because this feature was implemented.

@VojtechVitek
Copy link
Contributor Author

Thank you, @antirez!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants