configurable force_kill_waittime #320

jjb · 2011-12-19T21:32:12Z

Would you be open to a patch to allow Daemons force_kill_waittime to be configurable? The default in Daemons is 20 seconds. So if there is a job running that takes longer than 20 seconds, it will be killed when restarting jobs during a deploy. Daemons allows this to be configured.

bryckbost · 2011-12-22T02:55:14Z

Patches are always welcomed.

jjb · 2012-03-09T00:34:14Z

Here's my current solution. It lives in script/delayed_job and comes directly before Delayed::Command.new(ARGV).daemonize

class Daemons::Application
  alias_method :original_stop, :stop
  def stop(no_wait = false)
    @force_kill_waittime = 600 # 10 minutes
    original_stop(no_wait)
  end
end

The default from daemons is 20 seconds. It would make sense to at least just increase this, to perhaps 60 seconds.

thoughts?

yfeldblum · 2012-04-05T19:59:53Z

I understand the pain. But DelayedJob should not encourage writing jobs that block for a long time even after the process receives TERM.

If a job is blocking for more than 20 seconds - more than 5 seconds even - after the process receives TERM, then the job is wrong.

We must all get better at writing resilient code which tolerates failures gracefully, including such failures as a job process getting killed in the middle of a job.

We must write our jobs so that they can handle such a situation. Our jobs should either complete in very short order, within 5 seconds or less, or work in short chunks of 5 seconds or less, writing frequent savepoints after each chunk, and checking the corresponding savepoint before running each chunk. Our jobs should be using idempotent operations which can be retried repeatedly at any time until the corresponding job finally succeeds.

I would suggest bringing it down to 5 seconds, if anything.

jjb · 2012-04-05T20:27:24Z

writing jobs to be able to handle signals gracefully, sure

writing jobs to split their work into 5 seconds or less? that is removing
so much of the convenience of a delayed job system in the first place. what
about media encoding and other big file transfers or conversions?

yfeldblum · 2012-04-05T20:44:14Z

Sure. For those types of jobs, tolerate failure by allowing the job runner to kill jobs midstream but start them over from the beginning when the job runner is back up.

Job interruptions should typically happen infrequently enough that the vast majority of long-running jobs do not experience such interruptions.

Even for media encoding and file transfer jobs, it ought to be OK to clean up and terminate the job runner process within 5 seconds of receiving TERM. The way to do that is to ensure that it's OK to restart from the beginning the media encoding job that was cancelled, and to ensure that jobs are not interrupted frequently.

jjb · 2012-04-05T22:09:50Z

okay, we agree then :-D

I still think that applies to most jobs though… i can't see myself chunking
things into 5 seconds -- what's a good example of where one might do that?

yfeldblum · 2012-04-05T22:27:45Z

The general point of course is that we should write jobs that can tolerate being killed with no warning. In special cases, when we expect that being killed with no warning is sufficiently likely, we should even optimize jobs to deal with that failure case well.

This checkpointing scheme is an optimization for failure cases so that instead of starting from the beginning every time a job fails, we can start from the most recent checkpoint. It's not necessary, unless some jobs are so large that it is expected that they will be interrupted at whatever rate (20% of the time for some, 10% of the time for others) becomes annoying to deal with by such things as timeouts, app deployments, etc.

Concretely, a single job that runs over millions of objects in an object store such as S3 gets a bit of metadata from each of them. Each operation is fast but there are millions of such operations.

jjb · 2012-04-13T18:25:09Z

Upon further investigation, it looks like there is no way for a job to know that a signal has been sent to the worker -- the signal only tells the worker to stop processing jobs. This makes it quite a bit more difficult to have a job gracefully handle being killed.

Thoughts?

jjb · 2012-04-13T18:26:09Z

Here's the trap code: https://github.com/collectiveidea/delayed_job/blob/master/lib/delayed/worker.rb#L122

jjb · 2012-04-13T18:47:57Z

Ohhh, just learned that I can save and call the old trap code

old_handler = trap(:TERM){
  old_handler.call
  # --> get job ready to stop
}

yfeldblum · 2012-04-13T19:15:19Z

You can probably use the lifecycle callbacks.

class CurrentWorkerPlugin < ::Delayed::Plugin
  callbacks do |lifecycle|
    lifecycle.before :execute do |worker|
      Thread.current[:delayed_worker] = worker
    end

    lifecycle.after :execute do |_|
      Thread.current[:delayed_worker] = nil
    end
  end
end

Delayed::Worker.plugins << CurrentWorkerPlugin

class MySuperJob
  def worker
    Thread.current[:delayed_worker]
  end

  def stop?
    worker && worker.stop?
  end

  def perform
    (2**40).times do |i|
      sleep 1
      break if stop?
    end
  end
end

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configurable force_kill_waittime #320

configurable force_kill_waittime #320

jjb commented Dec 19, 2011

bryckbost commented Dec 22, 2011

jjb commented Mar 9, 2012

yfeldblum commented Apr 5, 2012

jjb commented Apr 5, 2012

yfeldblum commented Apr 5, 2012

jjb commented Apr 5, 2012

yfeldblum commented Apr 5, 2012

jjb commented Apr 13, 2012

jjb commented Apr 13, 2012

jjb commented Apr 13, 2012

yfeldblum commented Apr 13, 2012

configurable force_kill_waittime #320

configurable force_kill_waittime #320

Comments

jjb commented Dec 19, 2011

bryckbost commented Dec 22, 2011

jjb commented Mar 9, 2012

yfeldblum commented Apr 5, 2012

jjb commented Apr 5, 2012

yfeldblum commented Apr 5, 2012

jjb commented Apr 5, 2012

yfeldblum commented Apr 5, 2012

jjb commented Apr 13, 2012

jjb commented Apr 13, 2012

jjb commented Apr 13, 2012

yfeldblum commented Apr 13, 2012