Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang on integration/module_utils/adjacent_to_playbook.yml #547

Closed
dw opened this issue Feb 18, 2019 · 2 comments

Comments

@dw
Copy link
Owner

commented Feb 18, 2019

Custom modutils, so it's a forked job:

PLAY [localhost] *************************************************************************************************************************************************************

TASK [shell _raw_params=env | grep EVIL_VARS_PLUGIN] *************************************************************************************************************************
changed: [localhost]

PLAY [integration/module_utils/adjacent_to_playbook.yml] *********************************************************************************************************************

TASK [modrole : uses_external3 ] *********************************************************************************************************************************************
ok: [target-debian-1]
ok: [target-debian-2]
ok: [target-debian-4]

Process list (victim is 7155):

 3301 ?        Sl     0:00  |   \_ docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/9a54d963e2761d2c7cde9
 3382 ?        Ss     0:00  |   |   \_ /usr/sbin/sshd -D
 6275 ?        Ss     0:01  |   |       \_ sshd: mitogen__has_sudo_nopw [priv]
 6429 ?        S      0:00  |   |           \_ sshd: mitogen__has_sudo_nopw@notty
 6437 ?        Ssl    0:00  |   |               \_ /usr/bin/python(mitogen:dmw@k3:23175)
 6493 ?        Sl     0:00  |   |                   \_ /usr/bin/python(mitogen:dmw@k3:23175)
 7155 ?        Sl     0:00  |   |                   |   \_ /usr/bin/python(mitogen:dmw@k3:23175)
 6574 pts/0    Ss+    0:00  |   |                   \_ sudo -u root -H -- /usr/bin/python -c import codecs,os,sys;_=codecs.decode;exec(_(_("eNqFkcFPwyAUxu/9K3oDMrbRzSXahESzg/
 6575 pts/0    Sl+    0:00  |   |                   |   \_ /usr/bin/python(mitogen:mitogen__has_sudo_nopw@target-debian-3:9504)
 6618 pts/0    Sl+    0:00  |   |                   |       \_ /usr/bin/python(mitogen:mitogen__has_sudo_nopw@target-debian-3:9504)
 6703 pts/1    Ss+    0:00  |   |                   \_ sudo -u root -H -- /usr/bin/python -c import codecs,os,sys;_=codecs.decode;exec(_(_("eNqFkcFPwyAUxu/9K3oDMrbRzSXahESzg/
 6705 pts/1    Sl+    0:00  |   |                       \_ /usr/bin/python(mitogen:mitogen__has_sudo_nopw@target-debian-3:9504)
 6767 pts/1    Sl+    0:00  |   |                           \_ /usr/bin/python(mitogen:mitogen__has_sudo_nopw@target-debian-3:9504)

One thread waiting in futex:

strace: Process 7155 attached with 19 threads
[pid  7192] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid  7191] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid  7190] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid  7189] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid  7186] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid  7184] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid  7183] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid  7180] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid  7188] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid  7187] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid  7185] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid  7178] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid  7176] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid  7173] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid  7170] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid  7168] futex(0x5560209d7270, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid  7167] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid  7156] select(15, [3 5 7 14], [], [], NULL <unfinished ...>
[pid  7155] restart_syscall(<... resuming interrupted restart_syscall ...>

Seems either the SSH parent or the fork parent was in the importer when fork happened:

[Switching to thread 4 (LWP 7168)]
#0  0x00007f437156d536 in futex_abstimed_wait_cancelable (private=393, abstime=0x0, expected=0, futex_word=0x5560209d7270) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
205     ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) bt
#0  0x00007f437156d536 in futex_abstimed_wait_cancelable (private=393, abstime=0x0, expected=0, futex_word=0x5560209d7270) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=0x5560209d7270, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007f437156d5e4 in __new_sem_wait_slow (sem=0x5560209d7270, abstime=0x0) at sem_waitcommon.c:170
#3  0x000055601e8cca04 in PyThread_acquire_lock () at ../Python/thread_pthread.h:324
#4  0x000055601e8498eb in _PyImport_AcquireLock () at ../Python/import.c:309
#5  0x000055601e8dbe81 in PyImport_ImportModuleLevel () at ../Python/import.c:2311
#6  0x000055601e8de388 in builtin___import__ () at ../Python/bltinmodule.c:49
#7  0x000055601e8de0c3 in PyObject_Call () at ../Objects/abstract.c:2547
#8  0x000055601e8fba10 in PyEval_CallObjectWithKeywords () at ../Python/ceval.c:4221
#9  0x000055601e8f47bb in PyEval_EvalFrameEx () at ../Python/ceval.c:2624
#10 0x000055601e8ef9f5 in PyEval_EvalCodeEx () at ../Python/ceval.c:3584
#11 0x000055601e90c778 in function_call.lto_priv () at ../Objects/funcobject.c:523
#12 0x000055601e8de0c3 in PyObject_Call () at ../Objects/abstract.c:2547
#13 0x000055601e8f83a0 in ext_do_call (nk=0, na=1, flags=<optimized out>, pp_stack=0x7f436d3c07f0, func=<function at remote 0x7f43706681b8>) at ../Python/ceval.c:4666
#14 PyEval_EvalFrameEx () at ../Python/ceval.c:3028
#15 0x000055601e8f714f in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7f436d3c0938, func=<function at remote 0x7f437181bde8>)
    at ../Python/ceval.c:4437
#16 call_function (oparg=<optimized out>, pp_stack=0x7f436d3c0938) at ../Python/ceval.c:4372
#17 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
#18 0x000055601e8f714f in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7f436d3c0a88, func=<function at remote 0x7f437181bf50>)
    at ../Python/ceval.c:4437
#19 call_function (oparg=<optimized out>, pp_stack=0x7f436d3c0a88) at ../Python/ceval.c:4372
#20 PyEval_EvalFrameEx () at ../Python/ceval.c:2989
#21 0x000055601e8ef9f5 in PyEval_EvalCodeEx () at ../Python/ceval.c:3584
#22 0x000055601e90c5be in function_call.lto_priv () at ../Objects/funcobject.c:523
#23 0x000055601e8de0c3 in PyObject_Call () at ../Objects/abstract.c:2547
#24 0x000055601e922e1e in instancemethod_call.lto_priv () at ../Objects/classobject.c:2602
#25 0x000055601e8de0c3 in PyObject_Call () at ../Objects/abstract.c:2547
#26 0x000055601e8fba10 in PyEval_CallObjectWithKeywords () at ../Python/ceval.c:4221
#27 0x000055601e9c4722 in t_bootstrap () at ../Modules/threadmodule.c:620
#28 0x00007f4371565494 in start_thread (arg=0x7f436d3c1700) at pthread_create.c:456
#29 0x00007f4370982acf in __libc_ifunc_impl_list (name=<optimized out>, array=0x7f436d3c1700, max=<optimized out>) at ../sysdeps/x86_64/multiarch/ifunc-impl-list.c:387
#30 0x0000000000000000 in ?? ()
@dw

This comment has been minimized.

Copy link
Owner Author

commented Feb 25, 2019

easy to reproduce:

[18:53:44 eldil!46 ansible] while : ; do ./mitogen_ansible_playbook.py -i soak-inv integration/module_utils/adjacent_to_playbook.yml  ; done

with 4 debian containers as in mitogen_tests CI setup

@dw

This comment has been minimized.

Copy link
Owner Author

commented Aug 8, 2019

  • Custom module utils loader is running on the main thread
  • it is blocked on Latch.get() inside PushFileService
  • because that's what the test is for -- custom module utils test
  • it is holding the PyImport lock

meanwhile

  • a _stub_service_main ??? thread is also trying to acquire PyImport lock, in order to import 'mitogen.service' module

initial idea: _stub_service_main was called to deliver the small file needed to wake up the main thread, but the file will never arrive because the stub cannot take the importer lock

dw added a commit that referenced this issue Aug 8, 2019

issue #547: core/service: race/deadlock-free service pool init
The previous method of spinning up a transient thread to import the
service pool in a child context could deadlock with use of the importer
on the main thread. Therefore wake the main thread to handle import for
us, and use a regular Receiver to buffer messages to the stub, which is
inherited rather than replaced by the real service pool.

dw added a commit that referenced this issue Aug 8, 2019

@dw dw closed this Aug 8, 2019

dw added a commit that referenced this issue Aug 8, 2019

dw added a commit that referenced this issue Aug 8, 2019

Merge remote-tracking branch 'origin/dmw'
* origin/dmw:
  docs: merge signals.rst into internals.rst
  os_fork: do not attempt to cork the active thread.
  parent: fix get_log_level() for split out loggers.
  issue #547: fix service_test failures.
  issue #547: update Changelog.
  issue #547: core/service: race/deadlock-free service pool init
  docs: update Changelog.
  select: make Select.add() handle multiple buffered items.
  core/select: add {Select,Latch,Receiver}.size(), deprecate empty()
  parent: docstring fixes
  core: remove dead Router.on_shutdown() and Router "shutdown" signal
  testlib: use lsof +E for much clearer leaked FD output
  [stream-refactor] stop leaking FD 100 for the life of the child
  core: split preserve_tty_fp() out into a function
  parent: zombie reaping v3
  issue #410: fix test failure due to obsolete parentfp/childfp
  issue #170: replace Timer.cancelled with Timer.active
  core: more descriptive graceful shutdown timeout error
  docs: update changelog
  core: fix Python2.4 crash due to missing Logger.getChild().
  issue #410: automatically work around SELinux braindamage.
  core: cache stream reference in DelimitedProtocol
  parent: docstring formatting
  docs: remove fakessh from home page, it's been broken forever
  docs: add changelog thanks
  Disable Azure pipelines build for docs-master too.
  docs: udpate Changelog.
  docs: tweak Changelog wording
  [linear2] merge fallout: re-enable _send_module_forwards().
  docs: another round of docstring cleanups.
  master: allow filtering forwarded logs using logging package functions.
  docs: many more internals.rst tidyups
  tests: fix error in affinity_test
  service: centralize fetching thread name, and tidy up logs
  [stream-refactor] get caught up on internals.rst updates
  Stop using mitogen root logger in more modules, remove unused loggers
  tests: stop dumping Docker help output in the log.
  parent: move subprocess creation to mux thread too
  Split out and make readable more log messages across both packages
  ansible: log affinity assignments
  ci: log failed command line, and try enabling stdout line buffering
  ansible: improve docstring
  [linear2] simplify _listener_for_name()
  ansible: stop relying on SIGTERM to shut down service pool
  tests: move tty_create_child tests together
  ansible: cleanup various docstrings
  parent: define Connection behaviour during Broker.shutdown()
  issue #549: ansible: reduce risk by capping RLIM_INFINITY

dw added a commit that referenced this issue Aug 18, 2019

Merge remote-tracking branch 'origin/v028' into stable
* origin/v028: (383 commits)
  Bump version for release.
  docs: update Changelog for 0.2.8.
  issue #627: add test and tweak Reaper behaviour.
  docs: lots more changelog concision
  docs: changelog concision
  docs: more changelog tweaks
  docs: reorder chapters
  docs: versionless <title>
  docs: update supported Ansible version, mention unsupported features
  docs: changelog fixes/tweaks
  issue #590: update Changelog.
  issue #621: send ADD_ROUTE earlier and add test for early logging.
  issue #590: whoops, import missing test modules
  issue #590: rework ParentEnumerationMethod to recursively handle bad modules
  issue #627: reduce the default pool size in a child to 2.
  tests: add a few extra service tests.
  docs: some more hyperlink joy
  docs: more hyperlinks
  docs: add domainrefs plugin to make link aliases everywhere \o/
  docs: link IS_DEAD in changelog
  docs: tweaks to better explain changelog race
  issue #533: update routing to account for DEL_ROUTE propagation race
  tests: use defer_sync() Rather than defer() + ancient sync_with_broker()
  tests: one case from doas_test was invoking su
  tests: hide memory-mapped files from lsof output
  issue #615: remove meaningless test
  issue #625: ignore SIGINT within MuxProcess
  issue #625: use exec() instead of subprocess in mitogen_ansible_playbook
  issue #615: regression test
  issue #615: update Changelog.
  issue #615: ensure 4GB max_message_size is configured for task workers.
  issue #615: update Changelog.
  issue #615: route a dead message to recipients when no reply is expected
  issue #615: fetch_file() might be called with AnsibleUnicode.
  issue #615: redirect 'fetch' action to 'mitogen_fetch'.
  issue #615: extricate slurp brainwrong from mitogen_fetch
  issue #615: ansible: import Ansible fetch.py action plug-in
  issue #533: include object identity of Stream in repr()
  docs: lots more changelog
  issue #595: add buildah to docs and changelog.
  docs: a few more internals.rst additions
  ci: update to Ansible 2.8.3
  tests: another random string changed in 2.8.3
  tests: fix sudo_flags_failure for Ansible 2.8.3
  ci: fix procps command line format warning
  Whoops, merge together lgtm.yml and .lgtm.yml
  issue #440: log Python version during bootstrap.
  docs: update changelog
  issue #558: disable test on OSX to cope with boundless mediocrity
  issue #558, #582: preserve remote tmpdir if caller did not supply one
  issue #613: must await 'exit' and 'disconnect' in wait=False test
  Import LGTM config to disable some stuff
  Fix up another handful of LGTM errors.
  tests: work around AnsibleModule.run_command() race.
  docs: mention another __main__ safeguard
  docs: tweaks
  formatting error
  docs: make Sphinx install soft fail on Python 2.
  issue #598: allow disabling preempt in terraform
  issue #598: update Changelog.
  issue #605: update Changelog.
  issue #605: ansible: share a sem_t instead of a pthread_mutex_t
  issue #613: add tests for all the weird shutdown methods
  Add mitogen.core.now() and use it everywhere; closes #614.
  docs: move decorator docs into core.py and use autodecorator
  preamble_size: make it work on Python 3.
  docs: upgrade Sphinx to 2.1.2, require Python 3 to build docs.
  docs: fix Sphinx warnings, add LogHandler, more docstrings
  docs: tidy up some Changelog text
  issue #615: fix up FileService tests for new logic
  issue #615: another Py3x fix.
  issue #615: Py3x fix.
  issue #615: update Changelog.
  issue #615: use FileService for target->controll file transfers
  issue #482: another Py3 fix
  ci: try removing exclude: to make Azure jobs work again
  compat: fix Py2.4 SyntaxError
  issue #482: remove 'ssh' from checked processes
  ci: Py3 fix
  issue #279: add one more test for max_message_size
  issue #482: ci: add stray process checks to all jobs
  tests: fix format string error
  core: MitogenProtocol.is_privileged was not set in children
  issue #482: tests: fail DockerMixin tests if stray processes exist
  docs: update Changelog.
  issue #586: update Changelog.
  docs: update Changelog.
  [security] core: undirectional routing wasn't respected in some cases
  docs: tidy up Select.all()
  issue #612: update Changelog.
  master: fix TypeError
  pkgutil: fix Python3 compatibility
  parent: use protocol for getting remote_id
  docs: merge signals.rst into internals.rst
  os_fork: do not attempt to cork the active thread.
  parent: fix get_log_level() for split out loggers.
  issue #547: fix service_test failures.
  issue #547: update Changelog.
  issue #547: core/service: race/deadlock-free service pool init
  docs: update Changelog.
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.