Browse files

Make raising signal exceptions a configurable option

Idea credit goes to @nevinera
  • Loading branch information...
1 parent a5ac868 commit 90579c3047099b6a58595d4025ab0f4b7f0aa67a @albus522 albus522 committed Jan 24, 2013
Showing with 20 additions and 2 deletions.
  1. +20 −2 lib/delayed/worker.rb
@@ -47,6 +47,15 @@ def self.reset
# (perhaps to inspect the reason for the failure), set this to false.
self.destroy_failed_jobs = true
+ # By default, Signals INT and TERM set @exit, and the worker exits upon completion of the current job.
+ # If you would prefer to raise a SignalException and exit immediately you can use this.
+ # Be aware daemons uses TERM to stop and restart
+ # false - No exceptions will be raised
+ # :term - Will only raise an exception on TERM signals but INT will wait for the current job to finish
+ # true - Will raise an exception on TERM and INT
+ cattr_accessor :raise_signal_exceptions
+ self.raise_signal_exceptions = false
self.logger = if defined?(Rails)
elsif defined?(RAILS_DEFAULT_LOGGER)
@@ -122,8 +131,17 @@ def name=(val)
def start
- trap('TERM') { say 'Exiting...'; stop }
- trap('INT') { say 'Exiting...'; stop }
+ trap('TERM') do
+ say 'Exiting...'
+ stop
+ raise'TERM') if self.class.raise_signal_exceptions
+ end
+ trap('INT') do
+ say 'Exiting...'
+ stop
+ raise'INT') if self.class.raise_signal_exceptions && self.class.raise_signal_exceptions != :term
+ end
say "Starting job worker"

14 comments on commit 90579c3

Hey folks. What's standard op for DJ on Heroku? Does Delayed::Worker.raise_signal_exceptions = :term correspond to SIGKILL when a dyno restart is fed up waiting?

Heroku 'restarts' processes for a variety of reasons (too much memory in use is the one i was mostly running into). When they issue a restart, they do it by sending a sigterm followed a few seconds later by a sigkill.

You can't do anything about sigkill, but if you rescue from sigterm by cleaning up and terminating quickly, the sigkill doesn't need to be issued.

Thanks nevinera,

So you use Delayed::Worker.raise_signal_exceptions = :term?

If I understand DJ correctly, the job will get flagged as failed and is unlocked immediately, so when the process spins back up the workers can pick it up again without having to wait until max_run_time has passed.

My implementation of this solution was subtly different, but yes that should work.

It doesn't have to get 'flagged as failed' - that's a question of whether you re-raise the exception (or some exception).
It will usually make more sense to do so though, unless your jobs are being started continuously, and future jobs will pick up all the work being missed by this one.

The point of it is to allow you to unlock any resources the job was using, instead of just terminating. If you don't have any resources being locked, it's unnecessary.

I tested it with foreman and it works beautifully. Terminating the worker immediately quits the job, flags the DJ record with an error, then gets picked up again when the worker restarts, all thanks to this single config line. Cheers guys.

How does this deal with issues where, for example, a mailer is communicating to an SMTP server, has completely sent the request and the server has received it, then a SignalException is raised before Ruby can close the connection and wrap up the response?

Delayed Job would then run the same job again. If there wasn't an emphasis on making jobs 100% atomic, with this feature, there surely will be. Is that assumption correct?

No, the whole point of getting exceptions is that you can handle them. You should be rescuing from SignalException, performing your cleanup, and then re-raising it.

Generally speaking, you will receive a SIGTERM, and then later a SIGKILL after you've ignored the SIGTERM for a while (times vary - it's a few seconds on Heroku). With the old model, you'd ignore the SIGTERM until you were done with the job, and if the SIGKILL came before you performed cleanup, you'd die in a dirty state. With this behavior though, you can receive the SIGTERM, initiate cleanup, and terminate the job all before the SIGKILL comes. That's the original motivation.

Can you offer up an example of what you are doing on cleanup? Just re-enqueuing the job?

Unlocking a resource in the database, deleting a file on disk, deleting rows from a database.. basically anything.
If your job has state that is external to its memory, then you usually have the choice of (a) making the job able to resume, (b) making the job detect and clean up prior attempts before running, or (c) making the job clean up after itself in the event of failure.

Hey @uberllama. Just curious about your comment. Are you doing anything special to cleanup your task or did you literally get all of this goodness by just adding this single line to your init file?

I have a long (5+ hours) running task that gets left in a dirty state, would I just need to add this one line to have it mark the DelayedJob record as failed, quit the worker and restart the worker properly?


@joshuapinter Sounds like you'd benefit from splitting up your very long job into multiple atomic jobs, so you have less surface area for interruption, and quicker recovery.

@uberllama You're probably right. It's just an export task that loops through lots of records, does a bunch of math and then generates a CSV with a row for each record. Not sure how I could split that up... any ideas?

Hmmm you could potentially use the initial job as just a kick off which queues multiple jobs using find_each and small batches. Maybe build your transformed rows that represent the CSV in a table, and have a final task that generates the actual CSV from that table. Plausible?

Yeah, that's what I was thinking as well, but the overhead seems a bit too heavy. Might be best just to reduce the time required for the task itself via caching or benchmarking.

Thanks for the tip.

Please sign in to comment.