-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NACKJOB (re-queue) and WAITJOB #43
Comments
Helo @VojtechVitek, thanks! I'm positive about both the features, @dvirsky and me had private chats about this. Currently for However currently the command is completely ignored if sent to a node that does not know about the job at all. This may make sense after all, since to broadcast cluster-wide an ENQUEUE message is non trivial if we want to have just a single node to enqueue the job. Moreover it is not critical that a negative ACK works on Disque, since the system will put the job on queue again after a retry time, anyway. So about |
@antirez Thanks for your feedback! I'd be OK with Another feature that I have in my mind is changing the queue of the job. **Or |
@antirez I tried adding ENQUEUE as a fail mechanism, but the problem was I couldn't control the number of retries. The only way to do it currently is to copy the entire job, in the job data encode the number of retries, and add the new job. Otherwise, until the TTL is reached you could be doing thousands of retries. |
Hey, about the different topics here:
However note that there are a few things happening here:
When the job is fetched, you can pass
Note that there is no race condition here conceptually: if you want to move the job to a different priority, you ACK and re-add. If in the middle somebody processed the job, you can consider that as multiple delivery which is something that you need to face anyway with Disque and other message queues. Moreover such an atomic primitive would be as futile as trying to achieve exactly once delivery which is impossible. Example:
|
In my opinion, EDIT: I'd be OK with blocking |
Yeah, that makes sense. I believe this is very similar to #45. Let's leave it as is for now, as we can use |
Maybe I'm asking something too obvious here. On what condition would you want to nack a job? My first impression is that if you really want to retry as soon as possible then you might as well retry in the consumer immediately. If you were able to recover from an error and were able to call One case where you might want to nack is when a shutdown of your workers is required. In this case, you could go for a completely graceful shutdown (have the process wait until current jobs are finished and acked, then shut down) or you could stop working immediately, nack all the current jobs and expect for other consumer to pick them up as soon as possible. |
Damian my guess is that NACK is intimately related to letting the message queue count the number of failures in order to implement the "dead letter" feature for jobs failing to be processed more than N times. Another use case is when the cause for the failed processing is local to the worker so it wishes for a different one to try ASAP. Basically it's hard to evaluate and design NACK without reasoning on the other related features... |
Yea, as Salvatore said, one of the main motivations for NACK is counting the failures. Also, if a worker wants to shutdown gracefully (without waiting for operation to finish!), it wants to NACK the job, so other workers can pick it up asap. Think of jobs that take more time to process (say 2+ minutes) - you can't just afford waiting 2+ minutes after worker receives SIGTERM. |
A few use cases I might think of for NACKing a job: "Type 1": Temporary problems from the worker's perspective. e.g. - outside network is unreachable, disk is full on a specific machine, bad configuration value, etc etc. NACK will let another worker try doing it ASAP instead of in a few seconds/minutes. "Type 2": Temporary problems with an external resource that might be fixed in a second - timeouts reading from database, an outside REST API returned error 500, etc. We can retry them on the worker side, but we have a job scheduler, so why not use it? BTW we might not know if a problem is of type 1 or 2 so easily. |
@dvirsky yep hard to tell the type of the problem but IMHO: on exceptions the worker should |
Hello, summarized and proposed things related to this issue (the NACK part) here: #68. |
Closed because this feature was implemented. |
Thank you, @antirez! |
First of all, let me thank you for this awesome project. It feels fast & stable, even though it's still Alpha. Good job! 👍
I'm trying to solve these two Use-cases:
NACKJOB (force re-queue)
Let's assume I'm a Consumer and I fail to process the message. It'd be great if I could
NACKJOB id
, so the job gets re-queued right away (if the RETRY timeout was specified).** Or should I use
ACKJOB id
ADDJOB queue data
instead? Can I do this atomically?WAITJOB
It'd be great to have a blocking operation that would wait for job to finish (get ACKed):
WAITJOB id [id ...] timeout
... in other words, similar cmd to Redis'
BLPOP key [key ...] timeout
Feedback welcome, I could be missing something obvious.
The text was updated successfully, but these errors were encountered: