Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plays with many hosts fail with `too many open files` error on macOS #549

Closed
peshay opened this issue Feb 21, 2019 · 5 comments

Comments

@peshay
Copy link

commented Feb 21, 2019

  • Which version of Ansible are you running?
ansible 2.7.7
  ansible python module location = /usr/local/lib/python3.7/site-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.7.2 (default, Feb 19 2019, 14:40:12) [Clang 10.0.0 (clang-1000.11.45.5)]
  • Is your version of Ansible patched in any way?
    No
  • Are you running with any custom modules, or module_utils loaded?
    No
  • Have you tried the latest master version from Git?
    Yes
  • Do you have some idea of what the underlying problem may be?
    https://mitogen.rtfd.io/en/stable/ansible.html#common-problems has
    instructions to help figure out the likely cause and how to gather relevant
    logs.
    I have no problem when I do it to one host. With -f 1 it worked till host 48 and then all hosts failed with
fatal: [some-host]: UNREACHABLE! => {"changed": false, "msg": "Child start failed: [Errno 24] Too many open files. Command was: ssh -o \"LogLevel ERROR\" -l root -o \"Compression yes\" -o \"ServerAliveInterval 15\" -o \"ServerAliveCountMax 3\" -o \"BatchMode yes\" -o \"StrictHostKeyChecking yes\" -C -o ControlMaster=auto -o ControlPersist=60s some-host /usr/bin/python -c \"'import codecs,os,sys;_=codecs.decode;exec(_(_(\\\"eNqFkU1Lw0AQhs/Nr8htdunSbmr9CiwoPYgHEYKYgxbJx8YuprvLJmmsv95pIjapBw8L8zDvzDu8G7FYmGpmlZWEeo61A1KFj1AY90Fo6E2wzhu7IJwFnNMjR2xIDrtBz1lpKkmiIbghxENoEdCw2qN9mdTouvWF8CFPXKs0+InOu6b8lFlTJ2kpu/a8qdw8VXpu9/XGaMA7JyeyqegGd9JVyuiX8Gzd2Uq9Uw4ZbqO7Zw5rMR7rNYglGTfYGKdAtqo271KHyaa5wRcul8uLKwrUww2tU7UkAYOH+6dHzvmrBnTOTI4BU28l3sgh4txYqTFYcCnQmZNJToLzy2tOGXwpi5sKK466mEGbwiH1wv4YrLq6T/JE3f6n/ntlML7y948W9BvKaq7Q\\\".encode(),\\\"base64\\\"),\\\"zip\\\"))'\"", "unreachable": true}
  • Mention your host and target OS and versions
    host: macOS 10.14.2
    targets: mostly CentOS Linux release 7.6.1810 (Core)
  • Mention your host and target Python versions
    host: 3.7.2
    targets: 2.7.5
@dw

This comment has been minimized.

Copy link
Owner

commented Feb 21, 2019

This is vaguely related to #470. How many hosts are you targeting?

It's not mentioned in the docs anywhere, and I simply haven't had a chance to add a (simple) error message for it yet. Quite a lot of file descriptors are currently required

Increase your ulimit -n to at least 32 + 8 + (2 * number of hosts), and preferably double that again.

@dw

This comment has been minimized.

Copy link
Owner

commented Feb 21, 2019

This happens for two reasons, the first already mentioned: due to some messy internal handling that is very subtle to fix without breaking stuff, we currently use 2 file descriptors for every connection. It's "simple" to fix this, but anything relating to changing file descriptor lifecycle must be done very carefully :)

The second reason is because unlike with normal Ansible, there is a single process that really has connections to every machine, and file descriptor limits are per-process, not per-user. In normal Ansible, those descriptors are usually spread out across many child tasks, but for that reason, Ansible cannot keep the connections it creates open across the lifetime of a single task

@peshay

This comment has been minimized.

Copy link
Author

commented Mar 1, 2019

Yes, thanks! Setting the ulimit -n higher helped.

@peshay peshay closed this Mar 1, 2019

@dw

This comment has been minimized.

Copy link
Owner

commented Mar 1, 2019

Going to keep this open for now as a reminder to add a useful error hint! Thanks for reporting this

@dw dw reopened this Mar 1, 2019

@dw dw pinned this issue Mar 1, 2019

@dw dw unpinned this issue May 14, 2019

dw added a commit that referenced this issue Jul 31, 2019

issue #549: increase open file limit automatically if possible
While catching every possible case where "open file limit exceeded" is
not possible, we can at least increase the soft limit to the available
hard limit without any user effort.

Do this in Ansible top-level process, even though we probably only need
it in the MuxProcess. It seems there is no reason this could hurt

dw added a commit that referenced this issue Jul 31, 2019

dw added a commit that referenced this issue Jul 31, 2019

@dw

This comment has been minimized.

Copy link
Owner

commented Jul 31, 2019

I've started adding hints for the most easy to reproduce cases, but there is no central place where all possible file limit errors can be trapped -- they happen all over the Python standard library, for example.

I've also added a change to automatically increase the Ansible run's soft file limit to the configured hard limit. At least on distributions like Ubuntu, the default soft limit is 1024 while the default hard limit is closer to 1 million. That should drastically cut down on the number of users who ever bump into this.

Thanks again for reporting!

@dw dw closed this Jul 31, 2019

dw added a commit that referenced this issue Jul 31, 2019

Merge remote-tracking branch 'origin/549-open-files'
* origin/549-open-files:
  issue #603: Revert "ci: update to Ansible 2.8.3"
  Fix unit_Test.ClientTest following 108015a
  service: clean up log messages, especially at shutdown
  remove unused imports flagged by lgtm
  [linear2]: merge fallout flaggged by LGTM
  issue #549: docs: update Changelog
  issue #549: increase open file limit automatically if possible
  ansible: improve process.py docs
  docs: remove old list link.
  docs: migrate email list
  docs: changelog tweaks
  parent: decode logged stdout as UTF-8.
  scripts: import affin.sh
  ci: update to Ansible 2.8.3
  tests: terraform tweaks
  unix: include more IO in the try/except for connection failure
  tests: update gcloud.py to match terraform config
  tests: hide ugly error during Ansible tests
  tests/ansible/gcloud: terraform conf for load testing
  ansible: gracefully handle failure to connect to MuxProcess
  ansible: fix affinity tests for 5ae45f6
  ansible: pin per-CPU muxes to their corresponding CPU
  ansible: reap mux processes on shut down

dw added a commit that referenced this issue Aug 3, 2019

issue #549: fix setrlimit() crash and hard-wire OS X default
OS X advertised unlimited, but really it means kern.maxfilesperproc.

dw added a commit that referenced this issue Aug 3, 2019

issue #549: remove Linux-specific assumptions from create_child_test
Some stat fields are implementation-specific, little value even testing
them on Linux

dw added a commit that referenced this issue Aug 3, 2019

dw added a commit that referenced this issue Aug 3, 2019

dw added a commit that referenced this issue Aug 3, 2019

Merge remote-tracking branch 'origin/osx-ci-fixes'
* origin/osx-ci-fixes:
  issue #573: guard against a forked top-level Ansible process
  [linear2] simplify ClassicWorkerModel and fix repeat initialization
  issue #549 / [stream-refactor]: fix close/poller deregister crash on OSX
  issue #549: skip Docker tests if Docker is unavailable
  issue #549: remove Linux-specific assumptions from create_child_test
  issue #549: fix setrlimit() crash and hard-wire OS X default

dw added a commit that referenced this issue Aug 3, 2019

dw added a commit that referenced this issue Aug 8, 2019

Merge remote-tracking branch 'origin/dmw'
* origin/dmw:
  docs: merge signals.rst into internals.rst
  os_fork: do not attempt to cork the active thread.
  parent: fix get_log_level() for split out loggers.
  issue #547: fix service_test failures.
  issue #547: update Changelog.
  issue #547: core/service: race/deadlock-free service pool init
  docs: update Changelog.
  select: make Select.add() handle multiple buffered items.
  core/select: add {Select,Latch,Receiver}.size(), deprecate empty()
  parent: docstring fixes
  core: remove dead Router.on_shutdown() and Router "shutdown" signal
  testlib: use lsof +E for much clearer leaked FD output
  [stream-refactor] stop leaking FD 100 for the life of the child
  core: split preserve_tty_fp() out into a function
  parent: zombie reaping v3
  issue #410: fix test failure due to obsolete parentfp/childfp
  issue #170: replace Timer.cancelled with Timer.active
  core: more descriptive graceful shutdown timeout error
  docs: update changelog
  core: fix Python2.4 crash due to missing Logger.getChild().
  issue #410: automatically work around SELinux braindamage.
  core: cache stream reference in DelimitedProtocol
  parent: docstring formatting
  docs: remove fakessh from home page, it's been broken forever
  docs: add changelog thanks
  Disable Azure pipelines build for docs-master too.
  docs: udpate Changelog.
  docs: tweak Changelog wording
  [linear2] merge fallout: re-enable _send_module_forwards().
  docs: another round of docstring cleanups.
  master: allow filtering forwarded logs using logging package functions.
  docs: many more internals.rst tidyups
  tests: fix error in affinity_test
  service: centralize fetching thread name, and tidy up logs
  [stream-refactor] get caught up on internals.rst updates
  Stop using mitogen root logger in more modules, remove unused loggers
  tests: stop dumping Docker help output in the log.
  parent: move subprocess creation to mux thread too
  Split out and make readable more log messages across both packages
  ansible: log affinity assignments
  ci: log failed command line, and try enabling stdout line buffering
  ansible: improve docstring
  [linear2] simplify _listener_for_name()
  ansible: stop relying on SIGTERM to shut down service pool
  tests: move tty_create_child tests together
  ansible: cleanup various docstrings
  parent: define Connection behaviour during Broker.shutdown()
  issue #549: ansible: reduce risk by capping RLIM_INFINITY
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.