Some jobs never complete, stalling delayed_job #665

Open
calvinl opened this Issue Jun 12, 2014 · 13 comments

Comments

Projects
None yet
7 participants
@calvinl

calvinl commented Jun 12, 2014

Sometimes, a job that normally takes < 1 second to run, will by executed by a worker, but it will never complete.

I've configured monit to monitor the workers, but they don't detect that the worker crashed or has stopped, since it hasn't. The process is still running, but the job isn't being completed by the worker. In the logs, it just shows the job as RUNNING but it never completes.

When I restart the worker manually, I get a message saying process with pid {pid} won't stop, we forcefully kill it...

Even though I'm running multiple workers, this is a problem when all of my workers just die, then I'm left to manually restart them.

Anyone have this happen to them, or at least have any suggestions on how to debug it?

@albus522

This comment has been minimized.

Show comment Hide comment
@albus522

albus522 Jun 12, 2014

Member

Find out what job was running and check your code and any supporting libraries for anything that could fall into an infinite loop or be waiting for something to complete. Trace the entire code path through any other data objects it may touch. Does it rely on some data that may have been deleted before the job could be run? Also look for any code that could override a ruby Timeout::Error because the job should timeout after max_run_time has expired. There are certain things, like some specific network calls, which will prevent a Timeout from being raised until they return.

While I am not absolving DJ from being the culprit, the vast majority of this type of issue have involved something other than DJ. Once a job makes it to the running state there is very little that DJ has a hand in.

Member

albus522 commented Jun 12, 2014

Find out what job was running and check your code and any supporting libraries for anything that could fall into an infinite loop or be waiting for something to complete. Trace the entire code path through any other data objects it may touch. Does it rely on some data that may have been deleted before the job could be run? Also look for any code that could override a ruby Timeout::Error because the job should timeout after max_run_time has expired. There are certain things, like some specific network calls, which will prevent a Timeout from being raised until they return.

While I am not absolving DJ from being the culprit, the vast majority of this type of issue have involved something other than DJ. Once a job makes it to the running state there is very little that DJ has a hand in.

@calvinl

This comment has been minimized.

Show comment Hide comment
@calvinl

calvinl Jun 12, 2014

They stall on all different jobs. It's never the same one. I've traced through the logic for each of them and there aren't any traces of infinite loops or anything, but there is a commonality for each one in that they all make network calls. If that's the problem, how would one go about enforcing a timeout since no exceptions are raised until they return?

calvinl commented Jun 12, 2014

They stall on all different jobs. It's never the same one. I've traced through the logic for each of them and there aren't any traces of infinite loops or anything, but there is a commonality for each one in that they all make network calls. If that's the problem, how would one go about enforcing a timeout since no exceptions are raised until they return?

@calvinl

This comment has been minimized.

Show comment Hide comment
@calvinl

calvinl Jun 13, 2014

The other thing is that I haven't changed anything in the codebase, and it was working well prior to a few days ago. We've seen a traffic increase in that time, so I'm pretty confident that it's being overloaded somehow.

Additionally, in monit, I'm showing that the stalled delayed_job has a 0% cpu usage and it stays that way. Could I use this as a trigger for monit to restart the job?

calvinl commented Jun 13, 2014

The other thing is that I haven't changed anything in the codebase, and it was working well prior to a few days ago. We've seen a traffic increase in that time, so I'm pretty confident that it's being overloaded somehow.

Additionally, in monit, I'm showing that the stalled delayed_job has a 0% cpu usage and it stays that way. Could I use this as a trigger for monit to restart the job?

@calvinl

This comment has been minimized.

Show comment Hide comment
@calvinl

calvinl Jun 13, 2014

Also, do you have any examples with this issue that you're saying involves something other than DJ? I can't seem to find any. Maybe they'll jog some ideas. I've been debugging this for 3 days to no avail, kinda pulling my hair out :(

calvinl commented Jun 13, 2014

Also, do you have any examples with this issue that you're saying involves something other than DJ? I can't seem to find any. Maybe they'll jog some ideas. I've been debugging this for 3 days to no avail, kinda pulling my hair out :(

@albus522

This comment has been minimized.

Show comment Hide comment
@albus522

albus522 Jun 13, 2014

Member

0% cpu usage would tend to indicate waiting on i/o. What type of network calls/libraries are you using to connect to these services?

Member

albus522 commented Jun 13, 2014

0% cpu usage would tend to indicate waiting on i/o. What type of network calls/libraries are you using to connect to these services?

@calvinl

This comment has been minimized.

Show comment Hide comment
@calvinl

calvinl Jun 13, 2014

I ended up changing my max_run_time from the default to 5 minutes, and it seems to "resolve" the issue, at least for now. On those jobs that stalls the worker, they end up finishing in 300 seconds and then the worker is able to process jobs again.

Example:

I, [2014-06-13T02:35:09.957134 #989]  INFO -- : 2014-06-13T02:35:09+0000: [Worker(delayed_job.2 host:##### pid:989)] Job Request#process_after_update (id=6705) COMPLETED after 300.1507

In cases where max_run_time is exceeded, do jobs fail and are deleted from the delayed_jobs table?

For the services (Pusher, Intercom.io), I'm using their gems specifically to make the network calls.

calvinl commented Jun 13, 2014

I ended up changing my max_run_time from the default to 5 minutes, and it seems to "resolve" the issue, at least for now. On those jobs that stalls the worker, they end up finishing in 300 seconds and then the worker is able to process jobs again.

Example:

I, [2014-06-13T02:35:09.957134 #989]  INFO -- : 2014-06-13T02:35:09+0000: [Worker(delayed_job.2 host:##### pid:989)] Job Request#process_after_update (id=6705) COMPLETED after 300.1507

In cases where max_run_time is exceeded, do jobs fail and are deleted from the delayed_jobs table?

For the services (Pusher, Intercom.io), I'm using their gems specifically to make the network calls.

@SirRawlins

This comment has been minimized.

Show comment Hide comment
@SirRawlins

SirRawlins Aug 3, 2014

@calvinl I'm seeing very similar behaviour to this under much the same circumstances (making network calls to a 3rd party API.)

These calls usually only take a few seconds at most when run synchronously however as soon as they're moved in to DJ they seem to just hang, stuck in 'running' forevermore.

Did you find a practical solution to the issue in the end? Extending the max_run_time seems like a bit of a band-aid on something more serious.

@calvinl I'm seeing very similar behaviour to this under much the same circumstances (making network calls to a 3rd party API.)

These calls usually only take a few seconds at most when run synchronously however as soon as they're moved in to DJ they seem to just hang, stuck in 'running' forevermore.

Did you find a practical solution to the issue in the end? Extending the max_run_time seems like a bit of a band-aid on something more serious.

@McGar

This comment has been minimized.

Show comment Hide comment
@McGar

McGar Aug 9, 2016

Same issue, any new updates guys? @calvinl

McGar commented Aug 9, 2016

Same issue, any new updates guys? @calvinl

@calvinl

This comment has been minimized.

Show comment Hide comment
@calvinl

calvinl Aug 9, 2016

@McGar IIRC I was never able to find a solution to this issue other than configuring Delayed::Worker.max_run_time = 5.minutes. It wasn't ideal, but it did the job for our use case.

I might suggest switching to ActiveJob with Sidekiq if you're having issues with delayed_job.

calvinl commented Aug 9, 2016

@McGar IIRC I was never able to find a solution to this issue other than configuring Delayed::Worker.max_run_time = 5.minutes. It wasn't ideal, but it did the job for our use case.

I might suggest switching to ActiveJob with Sidekiq if you're having issues with delayed_job.

@McGar

This comment has been minimized.

Show comment Hide comment
@McGar

McGar Aug 9, 2016

@calvinl Thanks.

McGar commented Aug 9, 2016

@calvinl Thanks.

@diegoesp

This comment has been minimized.

Show comment Hide comment
@diegoesp

diegoesp Apr 7, 2017

+1. I'm experiencing exactly the same issue

diegoesp commented Apr 7, 2017

+1. I'm experiencing exactly the same issue

@augustosamame

This comment has been minimized.

Show comment Hide comment
@augustosamame

augustosamame Apr 21, 2017

+1
Exactly the same issue when workers try to connect to external APIs, in my case, Twillio and Rollbar. They hang forever and have to be restarted manually. Will try the Delayed::Worker.max_run_time = 5.minutes as a stopgap solution.

+1
Exactly the same issue when workers try to connect to external APIs, in my case, Twillio and Rollbar. They hang forever and have to be restarted manually. Will try the Delayed::Worker.max_run_time = 5.minutes as a stopgap solution.

@jzornow

This comment has been minimized.

Show comment Hide comment
@jzornow

jzornow Aug 1, 2017

+1 Same problem. Trying to make an outgoing API call using a custom delayed job, but it hangs indefinitely. If I run the job asyncronously, it's near-instantaneous.

Anybody got anything?

jzornow commented Aug 1, 2017

+1 Same problem. Trying to make an outgoing API call using a custom delayed job, but it hangs indefinitely. If I run the job asyncronously, it's near-instantaneous.

Anybody got anything?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment