Skip to content

Commit

Permalink
Prevent stdio deadlock in forked children (#79522)
Browse files Browse the repository at this point in the history
* background threads writing to stdout/stderr can cause children to deadlock if a thread in the parent holds the internal lock on the BufferedWriter wrapper
* prevent writes to std handles during fork by monkeypatching stdout/stderr during display startup to require a mutex lock with fork(); this ensures no background threads can hold the lock during a fork operation
* add integration test that fails reliably on Linux without this fix
  • Loading branch information
nitzmahone authored and anweshadas committed Dec 6, 2022
1 parent 80d2f8d commit 8a77b2f
Show file tree
Hide file tree
Showing 13 changed files with 400 additions and 15 deletions.
2 changes: 2 additions & 0 deletions changelogs/fragments/fork_safe_stdio.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
bugfixes:
- display - reduce risk of post-fork output deadlocks (https://github.com/ansible/ansible/pull/79522)
58 changes: 58 additions & 0 deletions docs/docsite/rst/porting_guides/porting_guide_6.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,58 @@ Networking

No notable changes

Porting Guide for v6.7.0
========================

Known Issues
------------

community.routeros
~~~~~~~~~~~~~~~~~~

- api_modify - when limits for entries in ``queue tree`` are defined as human readable - for example ``25M`` -, the configuration will be correctly set in ROS, but the module will indicate the item is changed on every run even when there was no change done. This is caused by the ROS API which returns the number in bytes - for example ``25000000`` (which is inconsistent with the CLI behavior). In order to mitigate that, the limits have to be defined in bytes (those will still appear as human readable in the ROS CLI) (https://github.com/ansible-collections/community.routeros/pull/131).
- api_modify, api_info - ``routing ospf area``, ``routing ospf area range``, ``routing ospf instance``, ``routing ospf interface-template`` paths are not fully implemeted for ROS6 due to the significat changes between ROS6 and ROS7 (https://github.com/ansible-collections/community.routeros/pull/131).

Major Changes
-------------

cisco.meraki
~~~~~~~~~~~~

- meraki_mr_l7_firewall - New module
- meraki_webhook_payload_template - New module

community.zabbix
~~~~~~~~~~~~~~~~

- all modules are opting away from zabbix-api and using httpapi ansible.netcommon plugin. We will support zabbix-api for backwards compatibility until next major release. See our README.md for more information about how to migrate
- zabbix_agent and zabbix_proxy roles are opting away from zabbix-api and use httpapi ansible.netcommon plugin. We will support zabbix-api for backwards compatibility until next major release. See our README.md for more information about how to migrate

containers.podman
~~~~~~~~~~~~~~~~~

- New become plugin - podman_unshare
- Podman generate systemd module

fortinet.fortimanager
~~~~~~~~~~~~~~~~~~~~~

- Fix compatibility issue for ansible 2.9.x and ansible-base 2.10.x.
- support Ansible changelogs.

fortinet.fortios
~~~~~~~~~~~~~~~~

- Support FortiOS v7.0.6, v7.0.7, v7.0.8, v7.2.1, v7.2.2.

Deprecated Features
-------------------

community.general
~~~~~~~~~~~~~~~~~

- Please note that some tools, like the VScode plugin (https://github.com/ansible/vscode-ansible/issues/573), or ``ansible-doc --list --type module``, suggest to replace the correct FQCNs for modules and actions in community.general with internal names that have more than three components. For example, ``community.general.ufw`` is suggested to be replaced by ``community.general.system.ufw``. While these longer names do work, they are considered **internal names** by the collection and are subject to change and be removed at all time. They **will** be removed in community.general 6.0.0 and result in deprecation messages. Avoid using these internal names, and use general three-component FQCNs (``community.general.<name_of_module>``) instead (https://github.com/ansible-collections/community.general/pull/5373).

Porting Guide for v6.6.0
========================

Expand Down Expand Up @@ -131,6 +183,12 @@ community.general

- newrelic_deployment - removed New Relic v1 API, added support for v2 API (https://github.com/ansible-collections/community.general/pull/5341).

fortinet.fortimanager
~~~~~~~~~~~~~~~~~~~~~

- Many fixes for Ansible sanity test warnings & errors.
- Support FortiManager Schema 7.2.0 , 98 new modules

Deprecated Features
-------------------

Expand Down
10 changes: 6 additions & 4 deletions lib/ansible/executor/process/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,10 +87,12 @@ def start(self):
'''

self._save_stdin()
try:
return super(WorkerProcess, self).start()
finally:
self._new_stdin.close()
# FUTURE: this lock can be removed once a more generalized pre-fork thread pause is in place
with display._lock:
try:
return super(WorkerProcess, self).start()
finally:
self._new_stdin.close()

def _hard_exit(self, e):
'''
Expand Down
29 changes: 29 additions & 0 deletions lib/ansible/utils/display.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
from ansible.utils.multiprocessing import context as multiprocessing_context
from ansible.utils.singleton import Singleton
from ansible.utils.unsafe_proxy import wrap_var
from functools import wraps


_LIBC = ctypes.cdll.LoadLibrary(ctypes.util.find_library('c'))
Expand Down Expand Up @@ -163,12 +164,33 @@ def filter(self, record):
)


def _synchronize_textiowrapper(tio, lock):
# Ensure that a background thread can't hold the internal buffer lock on a file object
# during a fork, which causes forked children to hang. We're using display's existing lock for
# convenience (and entering the lock before a fork).
def _wrap_with_lock(f, lock):
@wraps(f)
def locking_wrapper(*args, **kwargs):
with lock:
return f(*args, **kwargs)

return locking_wrapper

buffer = tio.buffer

# monkeypatching the underlying file-like object isn't great, but likely safer than subclassing
buffer.write = _wrap_with_lock(buffer.write, lock)
buffer.flush = _wrap_with_lock(buffer.flush, lock)


class Display(metaclass=Singleton):

def __init__(self, verbosity=0):

self._final_q = None

# NB: this lock is used to both prevent intermingled output between threads and to block writes during forks.
# Do not change the type of this lock or upgrade to a shared lock (eg multiprocessing.RLock).
self._lock = threading.RLock()

self.columns = None
Expand Down Expand Up @@ -199,6 +221,13 @@ def __init__(self, verbosity=0):

self._set_column_width()

try:
# NB: we're relying on the display singleton behavior to ensure this only runs once
_synchronize_textiowrapper(sys.stdout, self._lock)
_synchronize_textiowrapper(sys.stderr, self._lock)
except Exception as ex:
self.warning(f"failed to patch stdout/stderr for fork-safety: {ex}")

def set_queue(self, queue):
"""Set the _final_q on Display, so that we know to proxy display over the queue
instead of directly writing to stdout/stderr from forks
Expand Down
3 changes: 3 additions & 0 deletions test/integration/targets/fork_safe_stdio/aliases
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
shippable/posix/group3
context/controller
skip/macos
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
import atexit
import os
import sys

from ansible.plugins.callback import CallbackBase
from ansible.utils.display import Display
from threading import Thread

# This callback plugin reliably triggers the deadlock from https://github.com/ansible/ansible-runner/issues/1164 when
# run on a TTY/PTY. It starts a thread in the controller that spews unprintable characters to stdout as fast as
# possible, while causing forked children to write directly to the inherited stdout immediately post-fork. If a fork
# occurs while the spew thread holds stdout's internal BufferedIOWriter lock, the lock will be orphaned in the child,
# and attempts to write to stdout there will hang forever.

# Any mechanism that ensures non-main threads do not hold locks before forking should allow this test to pass.

# ref: https://docs.python.org/3/library/io.html#multi-threading
# ref: https://github.com/python/cpython/blob/0547a981ae413248b21a6bb0cb62dda7d236fe45/Modules/_io/bufferedio.c#L268


class CallbackModule(CallbackBase):
CALLBACK_VERSION = 2.0
CALLBACK_NAME = 'spewstdio'

def __init__(self):
super().__init__()
self.display = Display()

if os.environ.get('SPEWSTDIO_ENABLED', '0') != '1':
self.display.warning('spewstdio test plugin loaded but disabled; set SPEWSTDIO_ENABLED=1 to enable')
return

self.display = Display()
self._keep_spewing = True

# cause the child to write directly to stdout immediately post-fork
os.register_at_fork(after_in_child=lambda: print(f"hi from forked child pid {os.getpid()}"))

# in passing cases, stop spewing when the controller is exiting to prevent fatal errors on final flush
atexit.register(self.stop_spew)

self._spew_thread = Thread(target=self.spew, daemon=True)
self._spew_thread.start()

def stop_spew(self):
self._keep_spewing = False

def spew(self):
# dump a message so we know the callback thread has started
self.display.warning("spewstdio STARTING NONPRINTING SPEW ON BACKGROUND THREAD")

while self._keep_spewing:
# dump a non-printing control character directly to stdout to avoid junking up the screen while still
# doing lots of writes and flushes.
sys.stdout.write('\x1b[K')
sys.stdout.flush()

self.display.warning("spewstdio STOPPING SPEW")
5 changes: 5 additions & 0 deletions test/integration/targets/fork_safe_stdio/hosts
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[all]
local-[1:10]

[all:vars]
ansible_connection=local
11 changes: 11 additions & 0 deletions test/integration/targets/fork_safe_stdio/run-with-pty.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/usr/bin/env python
"""Run a command using a PTY."""

import sys

if sys.version_info < (3, 10):
import vendored_pty as pty
else:
import pty

sys.exit(1 if pty.spawn(sys.argv[1:]) else 0)
20 changes: 20 additions & 0 deletions test/integration/targets/fork_safe_stdio/runme.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/usr/bin/env bash

set -eu

echo "testing for stdio deadlock on forked workers (10s timeout)..."

# Enable a callback that trips deadlocks on forked-child stdout, time out after 10s; forces running
# in a pty, since that tends to be much slower than raw file I/O and thus more likely to trigger the deadlock.
# Redirect stdout to /dev/null since it's full of non-printable garbage we don't want to display unless it failed
ANSIBLE_CALLBACKS_ENABLED=spewstdio SPEWSTDIO_ENABLED=1 python run-with-pty.py timeout 10s ansible-playbook -i hosts -f 5 test.yml > stdout.txt && RC=$? || RC=$?

if [ $RC != 0 ]; then
echo "failed; likely stdout deadlock. dumping raw output (may be very large)"
cat stdout.txt
exit 1
fi

grep -q -e "spewstdio STARTING NONPRINTING SPEW ON BACKGROUND THREAD" stdout.txt || (echo "spewstdio callback was not enabled"; exit 1)

echo "PASS"
5 changes: 5 additions & 0 deletions test/integration/targets/fork_safe_stdio/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
- hosts: all
gather_facts: no
tasks:
- debug:
msg: yo

0 comments on commit 8a77b2f

Please sign in to comment.