waitpid blocking on leader processes in ruby 2.6 #151

jdelStrother · 2019-05-30T10:14:16Z

I've run into a problem with waiting for leader-processes to exit, which only surfaces in convoluted circumstances.

If you want the details:

• Our test suite launches a long-lived background process which is supposed to stay running for the duration of the test suite. We call .wait on the process so we can tell if it exited early due to an error, but normally it'll be killed when we send TERM to it in the test suite's at_exit hook.

• We also launch chromedriver (via selenium). Selenium sets leader=true on the child process

• When the test suite is done, selenium calls .stop on the chromedriver process.

• In ruby 2.6.3p62 (but not ruby 2.5.3p105), childprocess.stop blocks indefinitely, stuck in waitpid2 here

childprocess/lib/childprocess/unix/process.rb

Line 53 in 982b1ae

_, status = ::Process.waitpid2 _pid

The problem reduces down to this script:

require "childprocess"

# Start a background thread that launches a long-lived process.
# The child process will be killed once this script finishes (via at_exit)
Thread.new do
  child1 = ChildProcess.build("yes")
  child1.start
  at_exit do
    puts "child1.stop"
    child1.stop
    puts "child1 is cleaned up"
  end
  puts "Waiting for child1 to finish"
  status = child1.wait
  puts "child1 is done (#{status})"
end

# give the bg thread some time to ensure child1.wait has happened
sleep 0.1

# Start a second child process
child2 = ChildProcess.build("sleep", "10")
child2.leader = true
child2.start
puts "child2.stop"
# On ruby 2.6, this blocks until child1 is killed by hand
child2.stop
puts "Child2 is done!"

In theory, this ought to launch two child processes child1 & child2, stop child2, then stop child1 in the at_exit handler. In ruby 2.6 though, it gets stuck at child2.stop.

If we remove the child1.wait line, or if child2 isn't a leader, child2.stop doesn't get blocked.

Is it correct that for leader processes, childprocess passes the negative pid to waitpid2 ? Shouldn't the negative pid only be used for sending signals? Or is this a bug in ruby itself?

The text was updated successfully, but these errors were encountered:

Issue #151 reported a problem where in the update from Ruby 2.5 to Ruby 2.6 some programs would block forever on a `Process.waitpid2` call. Changing the call to pass in a positive PID instead of the negative PID fixes the issue as it will call the `waitpid` system call right away instead of delaying it for later. See #151 for further discussion of the possible reasons for the behavior.

sds · 2019-06-04T17:21:39Z

Hey @jdelStrother, thanks for the report.

To summarize:

You've found a bug
The bug is a race condition that only seems to happen on Ruby 2.6 (can avoid by adding sleep 0.1 before calling ::Process.waitpid2 _pid)
Passing the non-negative signal value to Process.waitpid2 does fix the issue, but it's not clear why as the behavior should be identical based on the documentation for waitpid(2)

I can reproduce your behavior on a Linux system (it raises Errno::EPERM on macOS).

I modified your program to print out the process ID and process group ID (PGID) along with process hierarchy, and changed the commands to make them easier to identify (changing Child 1 and 2 to be sleep 10000 and sleep 20000, respectively).

require "childprocess"

# Start a background thread that launches a long-lived process.
# The child process will be killed once this script finishes (via at_exit)
Thread.new do
  child1 = ChildProcess.build("sleep", "10000")
  child1.start
  at_exit do
    puts "child1.stop"
    child1.stop
    puts "child1 is cleaned up"
  end
  puts "Waiting for child1 to finish"
  status = child1.wait
  puts "child1 is done (#{status})"
end

# give the bg thread some time to ensure child1.wait has happened
sleep 0.1

# Start a second child process
child2 = ChildProcess.build("sleep", "20000")
child2.leader = true
child2.start
puts %x{ps faxo pid,ppid,pgid,command}
puts "child2.stop"
# On ruby 2.6, this blocks until child1 is killed by hand
child2.stop
puts "Child2 is done!"

I then ran this program on both Ruby 2.5 and Ruby 2.6:

Ruby 2.5

Waiting for child1 to finish
  PID  PPID  PGID COMMAND
    1     0     1 bash
   77     1    77 ruby leader.rb
   80    77    77  \_ sleep 10000
   83    77    83  \_ sleep 20000
   86    77    77  \_ ps faxo pid,ppid,pgid,command
child2.stop
Child2 is done!
child1.stop
child1 is done (15)
child1 is cleaned up

Ruby 2.6

Waiting for child1 to finish
  PID  PPID  PGID COMMAND
    1     0     1 bash
   52     1    52 ruby leader.rb
   54    52    52  \_ sleep 10000
   55    52    55  \_ sleep 20000
   56    52    52  \_ ps faxo pid,ppid,pgid,command
child2.stop
# ...does not terminate

Notice that in both cases, Child 2 is in its own process group as expected.

Killing Child 1 results in a SIGCHLD being sent to the parent Ruby process. My guess here is the fact that Process::waitpid2 is not async-signal-safe causes the race condition. Since we send the KILL signal to Child 2 before we wait, it's possible that we'll receive a SIGCHLD in between those two events, resulting in the waitpid2 call blocking waiting on another SIGCHLD. When you terminate Child 1, another SIGCHLD is sent to the parent Ruby process, which is then unblocked since it can now confirm Child 2 is dead.

This is all speculation, however. Ruby's waitpid implementation is complex and would be best examined by Ruby maintainers to understand this behavior, but it does appear to handle the positive PID case differently from the negative PID case. A positive PID results in immediate execution of the waitpid system call, whereas negative results in the PGID being added to a list for later inspection.

The primary question we need to understand before we make the proposed change is whether it is semantically correct for ChildProcess::Unix::Process#wait to call waitpid2 with a negative PID (current implementation) or the original PID. I've opened #152 for further discussion.

Issue #151 reported a problem where in the update from Ruby 2.5 to Ruby 2.6 some programs would block forever on a `Process.waitpid2` call. Changing the call to pass in a positive PID instead of the negative PID fixes the issue as it will call the `waitpid` system call right away instead of delaying it for later. See #151 for further discussion of the possible reasons for the behavior.

sds · 2019-06-11T04:08:54Z

Didn't receive any feedback in #152. Going to close this as it solves the issue as described and seems to achieve the same semantic behavior.

sds mentioned this issue Jun 4, 2019

Fix ChildProcess::Unix::Process#wait method with leader processes #152

Merged

sds added the bug label Jun 11, 2019

sds closed this as completed Jun 11, 2019

jdelStrother mentioned this issue Jul 4, 2019

Fix ChildProcess::Unix::Process#exited? method with leader processes #154

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

waitpid blocking on leader processes in ruby 2.6 #151

waitpid blocking on leader processes in ruby 2.6 #151

jdelStrother commented May 30, 2019

sds commented Jun 4, 2019 •

edited

sds commented Jun 11, 2019

waitpid blocking on leader processes in ruby 2.6 #151

waitpid blocking on leader processes in ruby 2.6 #151

Comments

jdelStrother commented May 30, 2019

sds commented Jun 4, 2019 • edited

sds commented Jun 11, 2019

sds commented Jun 4, 2019 •

edited