Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integration/{async/runner_one_job,module_utils/*}.yml random failure #605

Closed
dw opened this issue Jul 31, 2019 · 1 comment

Comments

@dw
Copy link
Owner

commented Jul 31, 2019

See also #547, same issue

MODE=ansible VER=2.8.0

FAILED - RETRYING: busy-poll up to 100000 times (99989 retries left).
changed: [target-debian-1]
FAILED - RETRYING: busy-poll up to 100000 times (99989 retries left).
FAILED - RETRYING: busy-poll up to 100000 times (99988 retries left).
changed: [target-centos6-2]
FAILED - RETRYING: busy-poll up to 100000 times (99988 retries left).
changed: [target-centos7-3]

TASK [assert that=[u'result1.ansible_job_id == job1.ansible_job_id', u'result1.attempts <= 100000', u'result1.changed == True', u'(ansible_version.full >= \'2.8\' and\n result1.cmd == "sleep 1;\\necho alldone\\n") or\n(ansible_version.full < \'2.8\' and\n result1.cmd == "sleep 1;\\n echo alldone")\n', u'result1.delta|length == 14', u'result1.start|length == 26', u'result1.finished == 1', u'result1.rc == 0', u'result1.start|length == 26']] ***
ok: [target-debian-1] => {
  changed: False
  msg: All assertions passed
}
ok: [target-centos6-2] => {
  changed: False
  msg: All assertions passed
}
ok: [target-centos7-3] => {
  changed: False
  msg: All assertions passed
}

TASK [assert that=[u'result1.stderr == ""', u'result1.stderr_lines == []', u'result1.stdout == "alldone"', u'result1.stdout_lines == ["alldone"]']] ***
ok: [target-debian-1] => {
  changed: False
  msg: All assertions passed
}
fatal: /home/travis/build/dw/mitogen/tests/ansible/integration/async/runner_one_job.yml:52: [target-centos6-2]: FAILED! => {
  assertion: result1.stdout == "alldone"
  changed: False
  evaluated_to: False
  msg: Assertion failed
}
ok: [target-centos7-3] => {
  changed: False
  msg: All assertions passed
}

NO MORE HOSTS LEFT *************************************************************

@dw dw added bug evergreen labels Jul 31, 2019

@dw

This comment has been minimized.

Copy link
Owner Author

commented Jul 31, 2019

see also #487, #425

custom modutils is also failing, all of these failures point to a problem with forking

@dw dw changed the title integration/async/runner_one_job.yml random failure integration/{async/runner_one_job,module_utils/*}.yml random failure Aug 1, 2019

dw added a commit that referenced this issue Aug 10, 2019

issue #605: ansible: share a sem_t instead of a pthread_mutex_t
The previous version quite reliably causes worker deadlocks within 10
minutes running:

    # 100 times:
    - import_playbook: integration/async/runner_one_job.yml
    # 100 times:
    - import_playbook: integration/module_utils/adjacent_to_playbook.yml

via .ci/soak/mitogen.sh with PLAYBOOK= set to the above playbook.

Attaching to the worker with gdb reveals it in an instruction
immediately following a futex() call, which likely returned EINTR due to
attaching gdb. Examining the pthread_mutex_t state reveals it to be
completely unlocked.

pthread_mutex_t on Linux should have zero trouble living in shmem, so
it's not clear how this deadlock is happening. Meanwhile POSIX
semaphores are explicitly designed for cross-process use and have a
completely different internal implementation, so try those instead. 1
hour of soaking reveals no deadlock.

This is about avoiding managing a lockable temporary file on disk to
contain our counter, and somehow communicating a reference to it into
subprocesses (despite the subprocess module closing inherited fds, etc),
somehow deleting it reliably at exit, and somehow avoiding concurrent
Ansible runs stepping on the same file. For now ctypes is still less
pain.

A final possibility would be to abandon a shared counter and instead
pick a CPU based on the hash of e.g. the new child's process ID. That
would likely balance equally well, and might be worth exploring when
making this code work on BSD.

dw added a commit that referenced this issue Aug 10, 2019

@dw dw closed this Aug 10, 2019

dw added a commit that referenced this issue Aug 11, 2019

Merge remote-tracking branch 'origin/dmw'
* origin/dmw:
  issue #605: update Changelog.
  issue #605: ansible: share a sem_t instead of a pthread_mutex_t
  issue #613: add tests for all the weird shutdown methods
  Add mitogen.core.now() and use it everywhere; closes #614.
  docs: move decorator docs into core.py and use autodecorator
  preamble_size: make it work on Python 3.
  docs: upgrade Sphinx to 2.1.2, require Python 3 to build docs.
  docs: fix Sphinx warnings, add LogHandler, more docstrings
  docs: tidy up some Changelog text

dw added a commit that referenced this issue Aug 18, 2019

Merge remote-tracking branch 'origin/v028' into stable
* origin/v028: (383 commits)
  Bump version for release.
  docs: update Changelog for 0.2.8.
  issue #627: add test and tweak Reaper behaviour.
  docs: lots more changelog concision
  docs: changelog concision
  docs: more changelog tweaks
  docs: reorder chapters
  docs: versionless <title>
  docs: update supported Ansible version, mention unsupported features
  docs: changelog fixes/tweaks
  issue #590: update Changelog.
  issue #621: send ADD_ROUTE earlier and add test for early logging.
  issue #590: whoops, import missing test modules
  issue #590: rework ParentEnumerationMethod to recursively handle bad modules
  issue #627: reduce the default pool size in a child to 2.
  tests: add a few extra service tests.
  docs: some more hyperlink joy
  docs: more hyperlinks
  docs: add domainrefs plugin to make link aliases everywhere \o/
  docs: link IS_DEAD in changelog
  docs: tweaks to better explain changelog race
  issue #533: update routing to account for DEL_ROUTE propagation race
  tests: use defer_sync() Rather than defer() + ancient sync_with_broker()
  tests: one case from doas_test was invoking su
  tests: hide memory-mapped files from lsof output
  issue #615: remove meaningless test
  issue #625: ignore SIGINT within MuxProcess
  issue #625: use exec() instead of subprocess in mitogen_ansible_playbook
  issue #615: regression test
  issue #615: update Changelog.
  issue #615: ensure 4GB max_message_size is configured for task workers.
  issue #615: update Changelog.
  issue #615: route a dead message to recipients when no reply is expected
  issue #615: fetch_file() might be called with AnsibleUnicode.
  issue #615: redirect 'fetch' action to 'mitogen_fetch'.
  issue #615: extricate slurp brainwrong from mitogen_fetch
  issue #615: ansible: import Ansible fetch.py action plug-in
  issue #533: include object identity of Stream in repr()
  docs: lots more changelog
  issue #595: add buildah to docs and changelog.
  docs: a few more internals.rst additions
  ci: update to Ansible 2.8.3
  tests: another random string changed in 2.8.3
  tests: fix sudo_flags_failure for Ansible 2.8.3
  ci: fix procps command line format warning
  Whoops, merge together lgtm.yml and .lgtm.yml
  issue #440: log Python version during bootstrap.
  docs: update changelog
  issue #558: disable test on OSX to cope with boundless mediocrity
  issue #558, #582: preserve remote tmpdir if caller did not supply one
  issue #613: must await 'exit' and 'disconnect' in wait=False test
  Import LGTM config to disable some stuff
  Fix up another handful of LGTM errors.
  tests: work around AnsibleModule.run_command() race.
  docs: mention another __main__ safeguard
  docs: tweaks
  formatting error
  docs: make Sphinx install soft fail on Python 2.
  issue #598: allow disabling preempt in terraform
  issue #598: update Changelog.
  issue #605: update Changelog.
  issue #605: ansible: share a sem_t instead of a pthread_mutex_t
  issue #613: add tests for all the weird shutdown methods
  Add mitogen.core.now() and use it everywhere; closes #614.
  docs: move decorator docs into core.py and use autodecorator
  preamble_size: make it work on Python 3.
  docs: upgrade Sphinx to 2.1.2, require Python 3 to build docs.
  docs: fix Sphinx warnings, add LogHandler, more docstrings
  docs: tidy up some Changelog text
  issue #615: fix up FileService tests for new logic
  issue #615: another Py3x fix.
  issue #615: Py3x fix.
  issue #615: update Changelog.
  issue #615: use FileService for target->controll file transfers
  issue #482: another Py3 fix
  ci: try removing exclude: to make Azure jobs work again
  compat: fix Py2.4 SyntaxError
  issue #482: remove 'ssh' from checked processes
  ci: Py3 fix
  issue #279: add one more test for max_message_size
  issue #482: ci: add stray process checks to all jobs
  tests: fix format string error
  core: MitogenProtocol.is_privileged was not set in children
  issue #482: tests: fail DockerMixin tests if stray processes exist
  docs: update Changelog.
  issue #586: update Changelog.
  docs: update Changelog.
  [security] core: undirectional routing wasn't respected in some cases
  docs: tidy up Select.all()
  issue #612: update Changelog.
  master: fix TypeError
  pkgutil: fix Python3 compatibility
  parent: use protocol for getting remote_id
  docs: merge signals.rst into internals.rst
  os_fork: do not attempt to cork the active thread.
  parent: fix get_log_level() for split out loggers.
  issue #547: fix service_test failures.
  issue #547: update Changelog.
  issue #547: core/service: race/deadlock-free service pool init
  docs: update Changelog.
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.