Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

configurable force_kill_waittime #320

Open
jjb opened this issue Dec 19, 2011 · 11 comments
Open

configurable force_kill_waittime #320

jjb opened this issue Dec 19, 2011 · 11 comments

Comments

@jjb
Copy link

jjb commented Dec 19, 2011

Would you be open to a patch to allow Daemons force_kill_waittime to be configurable? The default in Daemons is 20 seconds. So if there is a job running that takes longer than 20 seconds, it will be killed when restarting jobs during a deploy. Daemons allows this to be configured.

@bryckbost
Copy link

Patches are always welcomed.

@jjb
Copy link
Author

jjb commented Mar 9, 2012

Here's my current solution. It lives in script/delayed_job and comes directly before Delayed::Command.new(ARGV).daemonize

class Daemons::Application
  alias_method :original_stop, :stop
  def stop(no_wait = false)
    @force_kill_waittime = 600 # 10 minutes
    original_stop(no_wait)
  end
end

The default from daemons is 20 seconds. It would make sense to at least just increase this, to perhaps 60 seconds.

thoughts?

@yfeldblum
Copy link

I understand the pain. But DelayedJob should not encourage writing jobs that block for a long time even after the process receives TERM.

If a job is blocking for more than 20 seconds - more than 5 seconds even - after the process receives TERM, then the job is wrong.

We must all get better at writing resilient code which tolerates failures gracefully, including such failures as a job process getting killed in the middle of a job.

We must write our jobs so that they can handle such a situation. Our jobs should either complete in very short order, within 5 seconds or less, or work in short chunks of 5 seconds or less, writing frequent savepoints after each chunk, and checking the corresponding savepoint before running each chunk. Our jobs should be using idempotent operations which can be retried repeatedly at any time until the corresponding job finally succeeds.

I would suggest bringing it down to 5 seconds, if anything.

@jjb
Copy link
Author

jjb commented Apr 5, 2012

writing jobs to be able to handle signals gracefully, sure

writing jobs to split their work into 5 seconds or less? that is removing
so much of the convenience of a delayed job system in the first place. what
about media encoding and other big file transfers or conversions?

@yfeldblum
Copy link

Sure. For those types of jobs, tolerate failure by allowing the job runner to kill jobs midstream but start them over from the beginning when the job runner is back up.

Job interruptions should typically happen infrequently enough that the vast majority of long-running jobs do not experience such interruptions.

Even for media encoding and file transfer jobs, it ought to be OK to clean up and terminate the job runner process within 5 seconds of receiving TERM. The way to do that is to ensure that it's OK to restart from the beginning the media encoding job that was cancelled, and to ensure that jobs are not interrupted frequently.

@jjb
Copy link
Author

jjb commented Apr 5, 2012

okay, we agree then :-D

I still think that applies to most jobs though… i can't see myself chunking
things into 5 seconds -- what's a good example of where one might do that?

@yfeldblum
Copy link

The general point of course is that we should write jobs that can tolerate being killed with no warning. In special cases, when we expect that being killed with no warning is sufficiently likely, we should even optimize jobs to deal with that failure case well.

This checkpointing scheme is an optimization for failure cases so that instead of starting from the beginning every time a job fails, we can start from the most recent checkpoint. It's not necessary, unless some jobs are so large that it is expected that they will be interrupted at whatever rate (20% of the time for some, 10% of the time for others) becomes annoying to deal with by such things as timeouts, app deployments, etc.

Concretely, a single job that runs over millions of objects in an object store such as S3 gets a bit of metadata from each of them. Each operation is fast but there are millions of such operations.

@jjb
Copy link
Author

jjb commented Apr 13, 2012

Upon further investigation, it looks like there is no way for a job to know that a signal has been sent to the worker -- the signal only tells the worker to stop processing jobs. This makes it quite a bit more difficult to have a job gracefully handle being killed.

Thoughts?

@jjb
Copy link
Author

jjb commented Apr 13, 2012

@jjb
Copy link
Author

jjb commented Apr 13, 2012

Ohhh, just learned that I can save and call the old trap code

old_handler = trap(:TERM){
  old_handler.call
  # --> get job ready to stop
}

@yfeldblum
Copy link

You can probably use the lifecycle callbacks.

class CurrentWorkerPlugin < ::Delayed::Plugin
  callbacks do |lifecycle|
    lifecycle.before :execute do |worker|
      Thread.current[:delayed_worker] = worker
    end

    lifecycle.after :execute do |_|
      Thread.current[:delayed_worker] = nil
    end
  end
end

Delayed::Worker.plugins << CurrentWorkerPlugin

class MySuperJob
  def worker
    Thread.current[:delayed_worker]
  end

  def stop?
    worker && worker.stop?
  end

  def perform
    (2**40).times do |i|
      sleep 1
      break if stop?
    end
  end
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants