-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs randomly get stuck #228
Comments
How are you seeing this? Is this from the UI only, or actually tasks executing. What if concurrency is 1 and this happens? Or is just manifested by #227 |
I see that no jobs are happening, also don't see anything in the logs. I am unable to test this with concurrency 1. |
I don't have environment any more which I can use to test this out. |
We are experiencing this in production with Exq 0.8.3. Occasionally, jobs will get stuck in a perpetual processing state and hoard a process slot. Restarting does not fix and I have to manually clear out the process list via Let me know if there's anything you'd like me do to try to narrow down the issue. |
@jbrowning can you try Exq version 0.8.4, I recently added something that could potentially fix this. |
Ok, sounds good. Will let you know. Thanks for the quick response @akira |
@akira we've had 0.8.4 in production for a few weeks now and still end up with stuck jobs. It seems to occur most often when a large number of jobs (2000+) are added to the queue at once. |
@jbrowning thanks for update. Will take a look. |
@jbrowning I haven't had a chance to narrow down yet. Can you give me following info if possible:
Thanks! |
Based on past experience ive had with sidekiq exhibiting the behaviour, it was often the case of Blocking IO that never finished and didnt have timeout code. The classic example being network / http IO. Where a HTTP request would just wait forever and not realize it was disconnect because Ruby. Could it be similar? Curious if you are doing webscraping / http or any blocking io? |
@jbrowning I was able to reproduce this in one case, I will try to get a fix asap. |
@jbrowning If you can upgrade to Exq 0.8.6 - there was a situation where Process.exit would have caused #251 - so this is one possibility. However, as @j-mcnally the other possibility is that the worker is stuck waiting on something application specific (http request, etc). In this case, the fix would not help. You should be careful about setting timeouts for everything. You can also set a timeout in your worker by calling
|
@akira thanks again. Will deploy 0.8.6 today and see how it goes. |
How did it go @jbrowning ? :) |
@jbrowning FYI I'm going to have some more updates that can potentially help, hoping to get them this weekend. |
@akira does 0.8.7 contain those updates? We had this occur again this weekend. |
@jbrowning I know that advertising another project here is not the most gentle thing for the author of exq, but I see you're struggling for quite a while. When I had the same issues with exq I ended up with writing https://github.com/mspanc/jumbo and it works for us for really high loads for a few months already. |
@mspanc Great job on the library. Not having the jobs persisted is a bit of a killer in this case, no? Do you use hot code reload for deploys? |
Possibly a good conversation for the Jumbo repo? :) |
@lpil definitely, my bad. |
@jbrowning Yes, they had the fix to Exq running under high load (fixing issue with memory increasing). With those changes I was able to run Exq for sustained periods with maximum amount of messages going through. I was originally suspecting you were having that issue, however, at this point I am thinking that this is not due to load, probably type of message that is causing this issue. It would be great to have a reproducible case for this. By any change, do you see any stack trace for |
Again, thanks for all your help on this @akira. After deploying 0.9.0 we haven't seen any more "stuck jobs" but I have been seeing enqueue calls timing out. {:timeout, {GenServer, :call, [Exq, {:enqueue, "default", Worker, [params], []}, 5000]}} Were there any changes in 0.9.0 that would have caused the Happy to open this as a new issue if you'd rather do that. |
Perhaps we should close this issue and open a new one for that timeout? :) |
For people who stumble upon this error on production systems: I believe the error:
may be happening on production because the params is something that can't be serialized to JSON. For some reason I am getting the same error when this happens on production, while in dev the error is different:
while on production:
and then I have only timeout reported to my exception tracking (appsignal) and the original error is there in logs but at least AppSignal doesn't associate it with the exception. |
I have found cases when jobs get stuck.
This is the sample job (data retreived via exq_ui):
PID: vault:<0.15483.6>
queue: heavy
module: Vault.Jobs.Data.Processing.AnalysisAudioDurationJob
args: 9b091696-5ce7-4e2c-a2f5-a904f784d964,true
started at: Wed Jan 04 2017 23:00:52 GMT+0100 (CET)
It has ended it with an exception and never got cleared from the queue, blocking slot for other jobs.
It seems that PID in the log and exq_ui do not match.
The text was updated successfully, but these errors were encountered: