Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

waitpid blocking on leader processes in ruby 2.6 #151

Closed
jdelStrother opened this issue May 30, 2019 · 2 comments
Closed

waitpid blocking on leader processes in ruby 2.6 #151

jdelStrother opened this issue May 30, 2019 · 2 comments
Labels

Comments

@jdelStrother
Copy link
Contributor

I've run into a problem with waiting for leader-processes to exit, which only surfaces in convoluted circumstances.

If you want the details:

• Our test suite launches a long-lived background process which is supposed to stay running for the duration of the test suite. We call .wait on the process so we can tell if it exited early due to an error, but normally it'll be killed when we send TERM to it in the test suite's at_exit hook.

• We also launch chromedriver (via selenium). Selenium sets leader=true on the child process

• When the test suite is done, selenium calls .stop on the chromedriver process.

• In ruby 2.6.3p62 (but not ruby 2.5.3p105), childprocess.stop blocks indefinitely, stuck in waitpid2 here

_, status = ::Process.waitpid2 _pid

The problem reduces down to this script:

require "childprocess"

# Start a background thread that launches a long-lived process.
# The child process will be killed once this script finishes (via at_exit)
Thread.new do
  child1 = ChildProcess.build("yes")
  child1.start
  at_exit do
    puts "child1.stop"
    child1.stop
    puts "child1 is cleaned up"
  end
  puts "Waiting for child1 to finish"
  status = child1.wait
  puts "child1 is done (#{status})"
end

# give the bg thread some time to ensure child1.wait has happened
sleep 0.1

# Start a second child process
child2 = ChildProcess.build("sleep", "10")
child2.leader = true
child2.start
puts "child2.stop"
# On ruby 2.6, this blocks until child1 is killed by hand
child2.stop
puts "Child2 is done!"

In theory, this ought to launch two child processes child1 & child2, stop child2, then stop child1 in the at_exit handler. In ruby 2.6 though, it gets stuck at child2.stop.

If we remove the child1.wait line, or if child2 isn't a leader, child2.stop doesn't get blocked.

Is it correct that for leader processes, childprocess passes the negative pid to waitpid2 ? Shouldn't the negative pid only be used for sending signals? Or is this a bug in ruby itself?

sds added a commit that referenced this issue Jun 4, 2019
Issue #151 reported a problem where in the update from Ruby 2.5 to Ruby
2.6 some programs would block forever on a `Process.waitpid2` call.

Changing the call to pass in a positive PID instead of the negative PID
fixes the issue as it will call the `waitpid` system call right away
instead of delaying it for later.

See #151 for further discussion of the possible reasons for the
behavior.
@sds
Copy link
Collaborator

sds commented Jun 4, 2019

Hey @jdelStrother, thanks for the report.

To summarize:

  • You've found a bug
  • The bug is a race condition that only seems to happen on Ruby 2.6 (can avoid by adding sleep 0.1 before calling ::Process.waitpid2 _pid)
  • Passing the non-negative signal value to Process.waitpid2 does fix the issue, but it's not clear why as the behavior should be identical based on the documentation for waitpid(2)

I can reproduce your behavior on a Linux system (it raises Errno::EPERM on macOS).

I modified your program to print out the process ID and process group ID (PGID) along with process hierarchy, and changed the commands to make them easier to identify (changing Child 1 and 2 to be sleep 10000 and sleep 20000, respectively).

require "childprocess"

# Start a background thread that launches a long-lived process.
# The child process will be killed once this script finishes (via at_exit)
Thread.new do
  child1 = ChildProcess.build("sleep", "10000")
  child1.start
  at_exit do
    puts "child1.stop"
    child1.stop
    puts "child1 is cleaned up"
  end
  puts "Waiting for child1 to finish"
  status = child1.wait
  puts "child1 is done (#{status})"
end

# give the bg thread some time to ensure child1.wait has happened
sleep 0.1

# Start a second child process
child2 = ChildProcess.build("sleep", "20000")
child2.leader = true
child2.start
puts %x{ps faxo pid,ppid,pgid,command}
puts "child2.stop"
# On ruby 2.6, this blocks until child1 is killed by hand
child2.stop
puts "Child2 is done!"

I then ran this program on both Ruby 2.5 and Ruby 2.6:

Ruby 2.5

Waiting for child1 to finish
  PID  PPID  PGID COMMAND
    1     0     1 bash
   77     1    77 ruby leader.rb
   80    77    77  \_ sleep 10000
   83    77    83  \_ sleep 20000
   86    77    77  \_ ps faxo pid,ppid,pgid,command
child2.stop
Child2 is done!
child1.stop
child1 is done (15)
child1 is cleaned up

Ruby 2.6

Waiting for child1 to finish
  PID  PPID  PGID COMMAND
    1     0     1 bash
   52     1    52 ruby leader.rb
   54    52    52  \_ sleep 10000
   55    52    55  \_ sleep 20000
   56    52    52  \_ ps faxo pid,ppid,pgid,command
child2.stop
# ...does not terminate

Notice that in both cases, Child 2 is in its own process group as expected.

Killing Child 1 results in a SIGCHLD being sent to the parent Ruby process. My guess here is the fact that Process::waitpid2 is not async-signal-safe causes the race condition. Since we send the KILL signal to Child 2 before we wait, it's possible that we'll receive a SIGCHLD in between those two events, resulting in the waitpid2 call blocking waiting on another SIGCHLD. When you terminate Child 1, another SIGCHLD is sent to the parent Ruby process, which is then unblocked since it can now confirm Child 2 is dead.

This is all speculation, however. Ruby's waitpid implementation is complex and would be best examined by Ruby maintainers to understand this behavior, but it does appear to handle the positive PID case differently from the negative PID case. A positive PID results in immediate execution of the waitpid system call, whereas negative results in the PGID being added to a list for later inspection.

The primary question we need to understand before we make the proposed change is whether it is semantically correct for ChildProcess::Unix::Process#wait to call waitpid2 with a negative PID (current implementation) or the original PID. I've opened #152 for further discussion.

sds added a commit that referenced this issue Jun 4, 2019
Issue #151 reported a problem where in the update from Ruby 2.5 to Ruby
2.6 some programs would block forever on a `Process.waitpid2` call.

Changing the call to pass in a positive PID instead of the negative PID
fixes the issue as it will call the `waitpid` system call right away
instead of delaying it for later.

See #151 for further discussion of the possible reasons for the
behavior.
sds added a commit that referenced this issue Jun 11, 2019
Issue #151 reported a problem where in the update from Ruby 2.5 to Ruby
2.6 some programs would block forever on a `Process.waitpid2` call.

Changing the call to pass in a positive PID instead of the negative PID
fixes the issue as it will call the `waitpid` system call right away
instead of delaying it for later.

See #151 for further discussion of the possible reasons for the
behavior.
@sds sds added the bug label Jun 11, 2019
@sds
Copy link
Collaborator

sds commented Jun 11, 2019

Didn't receive any feedback in #152. Going to close this as it solves the issue as described and seems to achieve the same semantic behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants