New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os.waitpid(pid) seems racy under libuv on linux #1104

Closed
jamadden opened this Issue Feb 16, 2018 · 2 comments

Comments

Projects
None yet
1 participant
@jamadden
Member

jamadden commented Feb 16, 2018

Seen on travis:

  Traceback (most recent call last):
    File "/home/travis/build/gevent/gevent/src/greentest/2.7/test_socketserver.py", line 201, in test_ForkingTCPServer
      self.stream_examine)
    File "/home/travis/.runtimes/versions/python2.7.14/lib/python2.7/contextlib.py", line 24, in __exit__
      self.gen.next()
    File "/home/travis/build/gevent/gevent/src/greentest/2.7/test_socketserver.py", line 68, in simple_subprocess
      testcase.assertEqual(pid2, pid)
  AssertionError: 26421 != 26430

where simple_subprocess is a context manager:

@contextlib.contextmanager
def simple_subprocess(testcase):
    pid = os.fork()
    if pid == 0:
        # Don't raise an exception; it would be caught by the test harness.
        os._exit(72)
    yield None
    pid2, status = os.waitpid(pid, 0)
    testcase.assertEqual(pid2, pid)
    testcase.assertEqual(72 << 8, status)

The pid we spawned doesn't match the pid we waited for, even though we explicitly passed exactly that pid to waitpid. It's not clear how this happens. I haven't seen this on macOS (and maybe not on Python 3?) but macOS delivers child signals differently than linux does.

@jamadden

This comment has been minimized.

Member

jamadden commented Feb 19, 2018

I have been unable to reproduce this on Ubuntu 16.04 (?) with kernel 4.4.0-112 in a virtual machine with 2 cpus and 4GB of memory on CPython 2.7.12. test_socketserver.py runs in about .7s (half what it takes on the failed example). If I reduce the memory by half, cut it back to one processer and throttle it so that test_socketserver takes 1.4s, I have been able to reproduce it one out of ten times, so that's a start.

@jamadden

This comment has been minimized.

Member

jamadden commented Feb 19, 2018

I've also been able to reproduce a hang waiting on a process.

The AssertionError above is just an implementation bug, but the hang is actually a race condition: starting a new child watcher for the pid at the same time the old child watcher runs. The new child watcher will never be called. This can be fixed by more carefully controlling when child watchers run (in libev they're batched, here they were called when the signal handler ran---we can defer that to a batch).

jamadden added a commit that referenced this issue Feb 19, 2018

jamadden added a commit that referenced this issue Feb 19, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment