New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
configurable force_kill_waittime #320
Comments
Patches are always welcomed. |
Here's my current solution. It lives in script/delayed_job and comes directly before class Daemons::Application
alias_method :original_stop, :stop
def stop(no_wait = false)
@force_kill_waittime = 600 # 10 minutes
original_stop(no_wait)
end
end The default from daemons is 20 seconds. It would make sense to at least just increase this, to perhaps 60 seconds. thoughts? |
I understand the pain. But DelayedJob should not encourage writing jobs that block for a long time even after the process receives If a job is blocking for more than 20 seconds - more than 5 seconds even - after the process receives We must all get better at writing resilient code which tolerates failures gracefully, including such failures as a job process getting killed in the middle of a job. We must write our jobs so that they can handle such a situation. Our jobs should either complete in very short order, within 5 seconds or less, or work in short chunks of 5 seconds or less, writing frequent savepoints after each chunk, and checking the corresponding savepoint before running each chunk. Our jobs should be using idempotent operations which can be retried repeatedly at any time until the corresponding job finally succeeds. I would suggest bringing it down to 5 seconds, if anything. |
writing jobs to be able to handle signals gracefully, sure writing jobs to split their work into 5 seconds or less? that is removing |
Sure. For those types of jobs, tolerate failure by allowing the job runner to kill jobs midstream but start them over from the beginning when the job runner is back up. Job interruptions should typically happen infrequently enough that the vast majority of long-running jobs do not experience such interruptions. Even for media encoding and file transfer jobs, it ought to be OK to clean up and terminate the job runner process within 5 seconds of receiving |
okay, we agree then :-D I still think that applies to most jobs though… i can't see myself chunking |
The general point of course is that we should write jobs that can tolerate being killed with no warning. In special cases, when we expect that being killed with no warning is sufficiently likely, we should even optimize jobs to deal with that failure case well. This checkpointing scheme is an optimization for failure cases so that instead of starting from the beginning every time a job fails, we can start from the most recent checkpoint. It's not necessary, unless some jobs are so large that it is expected that they will be interrupted at whatever rate (20% of the time for some, 10% of the time for others) becomes annoying to deal with by such things as timeouts, app deployments, etc. Concretely, a single job that runs over millions of objects in an object store such as S3 gets a bit of metadata from each of them. Each operation is fast but there are millions of such operations. |
Upon further investigation, it looks like there is no way for a job to know that a signal has been sent to the worker -- the signal only tells the worker to stop processing jobs. This makes it quite a bit more difficult to have a job gracefully handle being killed. Thoughts? |
Here's the trap code: https://github.com/collectiveidea/delayed_job/blob/master/lib/delayed/worker.rb#L122 |
Ohhh, just learned that I can save and call the old trap code old_handler = trap(:TERM){
old_handler.call
# --> get job ready to stop
} |
You can probably use the lifecycle callbacks.
|
Would you be open to a patch to allow Daemons force_kill_waittime to be configurable? The default in Daemons is 20 seconds. So if there is a job running that takes longer than 20 seconds, it will be killed when restarting jobs during a deploy. Daemons allows this to be configured.
The text was updated successfully, but these errors were encountered: