Permalink
Browse files

Fix issue with daemon hanging on shutdown

  • Loading branch information...
1 parent 282d6f5 commit 4ac9b9e2e07945702d1d0d2af5ae2d949df44ce3 @bkeepers bkeepers committed Sep 18, 2010
Showing with 17 additions and 9 deletions.
  1. +11 −8 lib/delayed/master.rb
  2. +6 −1 spec/master_spec.rb
View
@@ -13,11 +13,11 @@ def start
GC.copy_on_write_friendly = true
end
- # # Ensure new file permissions are set to a standard 0755
- # File.umask 0022
-
abort "Process is already running with pid #{pid}" if running?
- pid_file.delete if pid_file.file?
+ if pid_file.file?
+ warn "Deleting stale pid file at #{pid_file}"
+ pid_file.delete
+ end
require 'config/environment'
@@ -30,17 +30,20 @@ def start
trap sig do
# logger.info "SIG#{sig} received! Shutting down workers."
- # reset trap handlers so we don't get caught in a trap loop
- trap sig, 'DEFAULT'
-
# # kill the children and reap them before terminating
# Process.kill :TERM, *children.keys
# Process.waitall
pid_file.delete if pid_file.file?
+ # reset trap handlers so we don't get caught in a trap loop
+ trap sig, 'DEFAULT'
+
# propagate the signal like a proper process should
Process.kill sig, $$
+
+ # FIXME: I don't understand why, but process will not stop without following
+ Process.wait
@guns

guns Sep 18, 2010

Brandon, when you fork a process the parent has one responsibility: it must reap (collect the pid of) the child process after it dies. This is what the wait family of system calls accomplishes. (try man 2 wait on your system for more info).

If a child process dies, but nobody collects its status, it hangs around in a zombie state until somebody reaps it. If a parent dies without reaping its children, the master process of the system (pid 1 usually) inherits the children and terminates all children processes on shutdown.

Process.wait is a blocking call (see footnote) that reaps one dead child. This is an appropriate way to handle dead children, and a SIGCLD handler (that calls wait) is another. What you probably want here though is Process.waitall, which will reap all children, and not just one.

So this is my suggestion for this section:

Process.kill :TERM, *@children.keys # terminate children explicitly
Process.waitall                     # wait for all children to die
pid_file.delete if pid_file.file?   # remove pidfile
Process.kill sig, $$                # finally terminate self

Note that since the pidfile is essentially a lockfile, it should be removed after waitall, since the children may take some time to actually shut down.

Also, you should remember to be extremely careful to reset trap handlers when you don't need them. Since the child is a fork of the parent, it will inherit all the traps set in the master, which will produce undesirable results. Similarly, if the master has a SIGCLD handler, that should be reset before attempting to shut down children for good, or they will get restarted by the handler.

Sorry if none of this is news to you!

footnote: If there are no dead children, it blocks until one dies; alternately you can pass Process::WNOHANG to make it a non-blocking call

@bkeepers

bkeepers Sep 19, 2010

Collaborator

Thanks for the insights. This is all new territory to me.

I tried uncommenting the lines that kill all the children and call waitall, but that didn't seem to have any effect. Shouldn't sending kill to the current process kill it? Is that because I'm not detaching the daemon process from its parent? What happens in a non-forked process when I kill send kill to $$?

@bkeepers

bkeepers Sep 19, 2010

Collaborator

Hmm, I'm confused. I tried detaching the process and rearranging things like:

      trap :CLD, 'DEFAULT'
      trap sig, 'DEFAULT'
      Process.kill :TERM, *children.keys unless children.empty?
      Process.waitall
      pid_file.delete if pid_file.file?
      Process.kill sig, $$

and it still hangs. I event tried commenting out the code that forks the children, so I know this process has no child processes.

The strange thing is that if I add wait or puts after the kill, it exits. Does that make any sense to you?

@guns

guns Sep 20, 2010

I played around with your gist this morning and I've discovered the issue.

The problem isn't that SIGCLD is faulty, but rather that the handler is invoked asynchronously. We are using a global children hash to store our references to the children via pid => id, but one nasty little secret about Hash that I found out is that it doesn't like concurrent operations. Explicitly, ruby does not allow you to add or delete keys from a hash while the hash is being iterated. See http://redmine.ruby-lang.org/issues/show/1535

I believe what is happening is that since rand(5) returns 0..4, the probability of the children dying at the exact same time is extremely high, and hence multiple threads of the SIGCLD handler are likely to call children#delete and children#[] at the exact same time (not exactly exact of course, but close enough so that the hash is being mutated by one trap handler while another thread attempts the same).

For proof of this, consider this snippet:

#!/usr/bin/ruby

children = {}

trap :CLD do
  pid = Process.wait
  i = children.delete(pid)
  puts "Reaping #{i}: #{pid}"
  children[fork { sleep i }] = i
end

1.upto 3 do |i|
  pid = fork { sleep i }
  puts "Spawned #{i}: #{pid}"
  children[pid] = i
end


loop do
  sleep
end

It is identical to the snippet you posted, except that instead of sleeping by rand(5), it sleeps by a fixed, staggered number. It works consistently on my machine on ruby 1.8.6, 1.8.7 and 1.9.2.

As to why placing Process.wait in the main loop alleviates this problem, I believe that is because Process.wait will handle a dead child before the SIGCLD handler has an opportunity to. If two children die at the same time, the main loop handles one, and then the SIGCLD handler handles the other. This way, the hash is less likely to be mutated concurrently.

Your original snippet works on ruby 1.9, not because the improved Hash implementation does not have the concurrent write problem, but probably because the new implementation is much faster, and hence less susceptible to race conditions (but not immune!)

So the answer to this problem is to either:

a) Use a subclassed Hash that implements a mutation queue
b) Avoid SIGCLD, and just handle dead children synchronously in the main loop.

The first solution is faster, but more complicated, and the second solution is simpler, but if you want to do anything else besides reap children in the main process, you have to use a separate thread; the SIGCLD handler is an easy way to get asynchronous logic without multithreading.

Whew!

@guns

guns Sep 21, 2010

Okay, so I decided to take another look at my analysis this morning and I realized that the zombie issue isn't actually due to Hash errors, but more likely due to race condition with the SIGCLD handler threads. I posted a new snippet here:

https://gist.github.com/d62f5f2c664134700ad3#comments

The core of the new signal handler is:

pids = []
begin
  # Avoid a SIGCLD race condition by collecting all available children
  while p = Process.wait(-1, Process::WNOHANG)
    pids << p
  end
rescue
  retry
end

If multiple trap threads are tripping over each other, we can just have the first handler loop and reap all available children.

Hash misses are still an issue if we are dealing with alot of processes dying in a very short amount of time, so we can account for that as well.

When you have time, tell me your thoughts about all of this.

end
end
@@ -129,7 +132,7 @@ def pid_file
end
def pid
- @pid ||= pid_file.read.to_i if pid_file.file?
+ @pid ||= (pid_file.read.to_i if pid_file.file?)
end
def running?
View
@@ -35,7 +35,7 @@
end
it "should overwrite pid file" do
- @master.start
+ silence_stderr { @master.start }
@master.pid_file.read.to_i.should_not == @pid
end
end
@@ -68,6 +68,10 @@
silence_stderr { Process.wait(fork { @master.stop }) }
end
+ after do
+ @master.pid_file.delete
+ end
+
it "should abort" do
$?.exitstatus.should == 1
end
@@ -84,6 +88,7 @@
@master.stop
@master.should_not be_running
wait_until { @master.pid_file.should_not be_file }
+ @master.pid.should_not be_nil # don't forget pid
end
end
end

0 comments on commit 4ac9b9e

Please sign in to comment.