Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ansible tasks slow down as host facts grow #73654

Open
mwhahaha opened this issue Feb 18, 2021 · 30 comments
Open

Ansible tasks slow down as host facts grow #73654

mwhahaha opened this issue Feb 18, 2021 · 30 comments
Labels
affects_2.15 affects_2.16 bug This issue/PR relates to a bug. has_pr This issue has an associated PR. on_hold python3 support:core This issue/PR relates to code supported by the Ansible Engineering Team.

Comments

@mwhahaha
Copy link
Contributor

SUMMARY

As the size of host facts is increased, there is a serious slow down in the number of tasks that can be processed per second. This can be viewed at smaller scale, but is dramatic at ~150 nodes. This can happen because of network facts (with many interfaces 100+), service facts, or package facts. This can also be easily reproduced with loading a large static fact structure (>100KB) and filtering to ansible_local.

I was able to pin point one of the issues coming from the get_vars in the host task processing:
https://github.com/ansible/ansible/blob/stable-2.9/lib/ansible/plugins/strategy/linear.py#L282-L283

This was seen to take <10ms with no facts and ~200ms with facts. Since host task processing is performed serially, the result is that <= 5 tasks can be processed per second at this scale.

There is another issue with the amount of memory and the workers being spawned, but the task execution slowdown seems to be more consistent.

ISSUE TYPE
  • Bug Report
COMPONENT NAME

lib/ansible/vars/manager.py
lib/ansible/plugins/strategy/linear.py

ANSIBLE VERSION

ansible-2.9.17-1.el8ae.noarch

$ ansible --version
ansible 2.9.17
  config file = /home/stack/scale-ssh-ansible/ansible.cfg
  configured module search path = ['/home/stack/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3.6/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 3.6.8 (default, Oct 11 2019, 15:04:54) [GCC 8.3.1 20190507 (Red Hat 8.3.1-4)]
CONFIGURATION
ANSIBLE_PIPELINING(/home/stack/scale-ssh-ansible/ansible.cfg) = True
ANSIBLE_SSH_ARGS(/home/stack/scale-ssh-ansible/ansible.cfg) = -o UserKnownHostsFile=/dev/null -o StrictHostKey>
ANSIBLE_SSH_CONTROL_PATH_DIR(/home/stack/scale-ssh-ansible/ansible.cfg) = /tmp/scale-ansible-ssh
ANSIBLE_SSH_RETRIES(/home/stack/scale-ssh-ansible/ansible.cfg) = 8
DEFAULT_CALLBACK_WHITELIST(/home/stack/scale-ssh-ansible/ansible.cfg) = ['profile_tasks']
DEFAULT_FORKS(/home/stack/scale-ssh-ansible/ansible.cfg) = 32
DEFAULT_INTERNAL_POLL_INTERVAL(/home/stack/scale-ssh-ansible/ansible.cfg) = 0.005
DEFAULT_TIMEOUT(/home/stack/scale-ssh-ansible/ansible.cfg) = 30
INTERPRETER_PYTHON(/home/stack/scale-ssh-ansible/ansible.cfg) = auto
OS / ENVIRONMENT

RHEL8.2

This has been seen on OpenStack compute nodes with an average of 200 network interfaces. This same effect can be reproduced without network facts and using service facts or package facts.

STEPS TO REPRODUCE
# baseline.yml
- hosts: all
  gather_facts: false
  tasks:
    - meta: clear_facts
    - name: Gather facts
      setup:
        gather_subset:
          - '!all'
          - min
    - name: Sleep 1
      shell: sleep 1
# package-facts.yml
- hosts: all
  gather_facts: false
  tasks:
    - meta: clear_facts
    - name: Gather facts
      setup:
        gather_subset:
          - '!all'
          - min
    - name: Gather package facts
      package_facts:
        manager: auto
    - name: Sleep 1
      shell: sleep 1
# service-facts.yml
- hosts: all
  gather_facts: false
  tasks:
    - meta: clear_facts
    - name: Gather facts
      setup:
        gather_subset:
          - '!all'
          - min
    - name: Gather service facts
      service_facts: # sup
    - name: Sleep 1
      shell: sleep 1
EXPECTED RESULTS

Task execution should not slow down when the number of Ansible host facts increase.

ACTUAL RESULTS

Execution against 179 hosts

$ ansible-playbook -i inventory.ini baseline.yml
...SNIP...
Thursday 18 February 2021  15:32:19 +0000 (0:00:04.784)       0:00:08.455 ***** 
=============================================================================== 
Sleep 1 ----------------------------------------------------------------- 4.78s
Gather facts ------------------------------------------------------------ 3.58s
$ ansible-playbook -i inventory.ini package-facts.yml
...SNIP...
Thursday 18 February 2021  15:34:26 +0000 (0:00:26.821)       0:00:52.222 ***** 
=============================================================================== 
Sleep 1 ---------------------------------------------------------------- 26.82s
Gather package facts --------------------------------------------------- 21.53s
Gather facts ------------------------------------------------------------ 3.78s
$ ansible-playbook -i inventory.ini service-facts.yml
...SNIP...
Thursday 18 February 2021  15:36:00 +0000 (0:00:08.965)       0:00:26.258 ***** 
=============================================================================== 
Gather service facts --------------------------------------------------- 12.44s
Sleep 1 ----------------------------------------------------------------- 8.97s
Gather facts ------------------------------------------------------------ 3.94s
@ansibot
Copy link
Contributor

ansibot commented Feb 18, 2021

Files identified in the description:

If these files are incorrect, please update the component name section of the description or use the !component bot command.

click here for bot help

@ansibot ansibot added affects_2.9 This issue/PR affects Ansible v2.9 bug This issue/PR relates to a bug. needs_triage Needs a first human triage before being processed. python3 support:core This issue/PR relates to code supported by the Ansible Engineering Team. labels Feb 18, 2021
@jborean93 jborean93 removed the needs_triage Needs a first human triage before being processed. label Feb 18, 2021
@samdoran
Copy link
Contributor

samdoran commented Feb 18, 2021

This is most likely due to INJECT_FACTS_AS_VARS, which is currently set to True by default. Since all the facts are reevaluated when injecting them as vars, this can be quite slow if there is a large amount of data to process. Eliminating this extra step may speed things up considerably.

Another thing that could help is changing the stage at which variables are evaluated for the host_group_vars inventory plugin. By default, host_group_vars is run for every task ("demand"). Try setting ANSIBLE_VARS_PLUGIN_STAGE to "inventory" to prevent vars from being reevaluated for every task.

@mwhahaha
Copy link
Contributor Author

Looks like the plugin stage configuration per https://docs.ansible.com/ansible/latest/plugins/vars.html#using-vars-plugins is:

[vars_host_group_vars]
stage = inventory

This didn't seem to improve anything.

$ grep vars_host_group_vars ansible.cfg -A1
[vars_host_group_vars]
stage = inventory
$ time ansible-playbook -i tripleo-ansible-inventory.yaml baseline.yml
...SNIP...
Thursday 18 February 2021  20:17:21 +0000 (0:00:05.508)       0:00:08.781 ***** 
=============================================================================== 
Sleep 1 ----------------------------------------------------------------- 5.51s
Gather facts ------------------------------------------------------------ 3.22s

real	0m9.964s
user	0m11.818s
sys	0m10.913s

$ time ansible-playbook -i tripleo-ansible-inventory.yaml package-facts.yml
...SNIP...
Thursday 18 February 2021  20:18:21 +0000 (0:00:25.774)       0:00:47.194 ***** 
=============================================================================== 
Sleep 1 ---------------------------------------------------------------- 25.77s
Gather package facts --------------------------------------------------- 18.13s
Gather facts ------------------------------------------------------------ 3.24s

real	0m48.969s
user	1m32.909s
sys	0m22.167s

ANSIBLE_INJECT_FACT_VARS does improve it a bit, but still a significant slow down.

$ export ANSIBLE_INJECT_FACT_VARS=False
$ time ansible-playbook -i tripleo-ansible-inventory.yaml baseline.yml
...SNIP...
Thursday 18 February 2021  20:11:48 +0000 (0:00:05.635)       0:00:08.956 ***** 
=============================================================================== 
Sleep 1 ----------------------------------------------------------------- 5.64s
Gather facts ------------------------------------------------------------ 3.27s

real	0m10.098s
user	0m11.356s
sys	0m10.951s

$ time ansible-playbook -i tripleo-ansible-inventory.yaml package-facts.yml
...SNIP...
Thursday 18 February 2021  20:12:41 +0000 (0:00:13.446)       0:00:35.216 ***** 
=============================================================================== 
Gather package facts --------------------------------------------------- 18.65s
Sleep 1 ---------------------------------------------------------------- 13.45s
Gather facts -------------------------------------------------------------3.06s

real	0m37.053s
user	1m11.846s
sys	0m22.419s

@samdoran
Copy link
Contributor

samdoran commented Feb 18, 2021

That's good that disabling fact inject made a difference.

Ultimately, I'm not sure there is much we can do without significantly rewriting how we process the data. More data takes more time to process, and we process it a lot.

I am working on some things for the 2.11 release right now. I will dig into this again once that work is complete.

@mwhahaha
Copy link
Contributor Author

Sure. Ping me when you have time and I can share what I've found. Package facts brings the fact size (at least in cache) up from 5.5k to ~313k so we're dealing with 1.6M vs 58M of facts. Network facts for ended up being like 123M

@egmar
Copy link
Contributor

egmar commented Feb 24, 2021

I'm seeing similar behavior and is mostly memory which worries us. After doing some investigation with the help of memory_profiler in VariableManager.get_vars method

mem_usage = memory_usage(-1, interval=.2, timeout=1)
display.warning(mem_usage)

the memory growth is quite noticeable

[WARNING]: [48.19921875, 48.19921875, 48.19921875, 48.19921875, 48.19921875]
[WARNING]: [48.2265625, 48.2265625, 48.2265625, 48.2265625, 48.2265625]

PLAY [test] ********************************************************************
[WARNING]: [48.49609375, 48.49609375, 48.49609375, 48.49609375, 48.49609375]
[WARNING]: [61.25390625, 61.25390625, 61.25390625, 61.25390625, 61.25390625]
[WARNING]: [71.4296875, 71.484375, 71.484375, 71.484375, 71.484375]
[WARNING]: [82.0390625, 82.0390625, 82.0390625, 82.0390625, 82.0390625]
[WARNING]: [92.5546875, 92.5546875, 92.5546875, 92.5546875, 92.5546875]
[WARNING]: [103.05078125, 103.05078125, 103.05078125, 103.05078125,
103.05078125]
[WARNING]: [113.7734375, 113.7734375, 113.7734375, 113.7734375, 113.7734375]
[WARNING]: [124.0078125, 124.0078125, 124.0078125, 124.0078125, 124.0078125]
[WARNING]: [134.8359375, 134.8359375, 134.8359375, 134.8359375, 134.8359375]
[WARNING]: [145.2578125, 145.2578125, 145.2578125, 145.2578125, 145.2578125]
[WARNING]: [156.21484375, 156.21484375, 156.21484375, 156.21484375,

I also noticed that a lot of memory is consumed by AnsibleUnsafe objects (it seems Ansible is converting all facts, both keys and values to be of AnsibleUnsafe type. I used pympler in same place to capture a summary of memory used by all objects with following results (this is a snapshot, it generally increases over time as playbook runs):

                                         types |   # objects |   total size
============================================== | =========== | ============
  ansible.utils.unsafe_proxy.AnsibleUnsafeText |      343564 |     35.34 MB
                                          dict |      141330 |     33.67 MB
                                           str |       35563 |      3.64 MB
                                          code |       11488 |      1.96 MB
                                          type |        1727 |      1.57 MB
                                          list |        5090 |      1.45 MB
                                         tuple |        9473 |    523.06 KB
                                   abc.ABCMeta |         349 |    359.41 KB
                                           set |         348 |    234.16 KB
   ansible.parsing.yaml.objects.AnsibleUnicode |        1980 |    218.72 KB
                                       weakref |        2775 |    195.12 KB
                    builtin_function_or_method |        2482 |    174.52 KB
                                     frozenset |         447 |    152.04 KB
                                           int |        4223 |    127.19 KB
   ansible.parsing.yaml.objects.AnsibleMapping |         459 |    117.28 KB

This is with each host in inventory having ~1MB of facts and strategy: free.
I'm also attaching the call graph (perhaps will be helpful) which I collected by running the playbook with PyCharm profiler and yappi
ansible3

openstack-mirroring pushed a commit to openstack/openstack-ansible-apt_package_pinning that referenced this issue Feb 24, 2021
See ansible/ansible#73654

Change-Id: Ib7e1fa820767fe419eadd6309311204433e9bf20
openstack-mirroring pushed a commit to openstack/openstack that referenced this issue Feb 24, 2021
* Update openstack-ansible-apt_package_pinning from branch 'master'
  to 7c8888ac6de9e0be549270247e849a6c09031158
  - Use ansible_facts[] instead of fact variables
    
    See ansible/ansible#73654
    
    Change-Id: Ib7e1fa820767fe419eadd6309311204433e9bf20
openstack-mirroring pushed a commit to openstack/openstack-ansible-haproxy_server that referenced this issue Feb 24, 2021
See ansible/ansible#73654

Change-Id: Id98f052df73587f209b9621da0874ce506899f9b
openstack-mirroring pushed a commit to openstack/openstack that referenced this issue Feb 24, 2021
* Update openstack-ansible-haproxy_server from branch 'master'
  to d78f6d55282e52be919209bd18220eea1a41a2f7
  - Use ansible_facts[] instead of fact variables
    
    See ansible/ansible#73654
    
    Change-Id: Id98f052df73587f209b9621da0874ce506899f9b
openstack-mirroring pushed a commit to openstack/ansible-role-systemd_mount that referenced this issue Feb 24, 2021
See ansible/ansible#73654

Change-Id: I5a83a7f237e4b7d86e8d7ee08b0e8c6b4f8a44f8
openstack-mirroring pushed a commit to openstack/openstack that referenced this issue Feb 24, 2021
* Update ansible-role-systemd_mount from branch 'master'
  to e5cc6f8990e82b92bc489c701f849120ec23f2fa
  - Use ansible_facts[] instead of fact variables
    
    See ansible/ansible#73654
    
    Change-Id: I5a83a7f237e4b7d86e8d7ee08b0e8c6b4f8a44f8
@egmar
Copy link
Contributor

egmar commented Feb 25, 2021

Small update, I've run some more tests and it could be pointed that facts content (keys and values) wrapping into AnsibleUnsafeText is mostly the cause of increased memory consumption.

elif isinstance(v, text_type):
v = AnsibleUnsafeText(v)

openstack-mirroring pushed a commit to openstack/openstack that referenced this issue Feb 25, 2021
* Update openstack-ansible-lxc_container_create from branch 'master'
  to 5d22a389c257f189a6a8ab42973e8c142df08169
  - Use ansible_facts[] instead of fact variables
    
    See ansible/ansible#73654
    
    Change-Id: If6b30e40bf71f60ee72fdc1459ce5e5a9f4f7d93
openstack-mirroring pushed a commit to openstack/openstack-ansible-lxc_container_create that referenced this issue Feb 25, 2021
See ansible/ansible#73654

Change-Id: If6b30e40bf71f60ee72fdc1459ce5e5a9f4f7d93
openstack-mirroring pushed a commit to openstack/openstack that referenced this issue Feb 25, 2021
* Update ansible-role-uwsgi from branch 'master'
  to 15f24d11f34bfab6c4170de5b1b2c65363794dc5
  - Use ansible_facts[] instead of fact variables
    
    See ansible/ansible#73654
    
    Change-Id: Ibadca3bdce6d7adf51ddce9591a0d404e38d5e71
openstack-mirroring pushed a commit to openstack/ansible-role-uwsgi that referenced this issue Feb 25, 2021
See ansible/ansible#73654

Change-Id: Ibadca3bdce6d7adf51ddce9591a0d404e38d5e71
openstack-mirroring pushed a commit to openstack/openstack-ansible-rabbitmq_server that referenced this issue Feb 25, 2021
See ansible/ansible#73654

Change-Id: Ia6dac13c7e2206f4a86ef8e21c5b1cd80eb16e0e
openstack-mirroring pushed a commit to openstack/openstack that referenced this issue Feb 25, 2021
* Update openstack-ansible-rabbitmq_server from branch 'master'
  to b8fc8721133157d958344147a7d99f0f6cc8e559
  - Use ansible_facts[] instead of fact variables
    
    See ansible/ansible#73654
    
    Change-Id: Ia6dac13c7e2206f4a86ef8e21c5b1cd80eb16e0e
openstack-mirroring pushed a commit to openstack/openstack-ansible-memcached_server that referenced this issue Feb 25, 2021
See ansible/ansible#73654

Change-Id: I8c5d46a9d612b6800b6dde37a324e2357cf99217
@samdoran
Copy link
Contributor

I have been moved to another team and will unfortunately not be able to look into this. 😞

openstack-mirroring pushed a commit to openstack/openstack that referenced this issue Nov 26, 2021
* Update openstack-ansible-os_murano from branch 'master'
  to 365c00be6a6dfc96a5fe8d5d7485a638063288e7
  - Merge "Use ansible_facts[] instead of fact variables"
  - Use ansible_facts[] instead of fact variables
    
    See ansible/ansible#73654
    
    Change-Id: Icea956af90672dae1d50e84f727b2772fd35e26f
openstack-mirroring pushed a commit to openstack/openstack-ansible-os_murano that referenced this issue Nov 26, 2021
See ansible/ansible#73654

Change-Id: Icea956af90672dae1d50e84f727b2772fd35e26f
@ryanmerolle
Copy link
Contributor

ryanmerolle commented Apr 9, 2022

Hopefully not too off topic, but I’ve been on a recent kick to optimize some plays and workflows I have, and as hostvars increase I do see a correlation to runtimes increase especially when run against large host inventories. I’m going to test this, but if I use set_facts just before I need variables to be defined, and clear_facts after I no longer need said variables in a play, would that help in resource usage and runtime?

This should not be too far off topic given I’m trying to workaround the same root cause of slowdowns. I was just not clear if clear_facts cleared all facts or just gathered facts. The plays in question are not using gather_facts at all. Thanks!

@ryanmerolle
Copy link
Contributor

clear_factsseems to only clear gathered facts and not all facts. If it cleared all facts, introduced an option under clear_facts to set what type of facts to clear, or introduced a whole new meta option to clear all facts, this could give users a new set of tools to keep resource usage and play runtime down.

@bcoca
Copy link
Member

bcoca commented Apr 11, 2022

clear_facts clears 'facts' .. set_facts does NOT create 'facts' by default (only when using cacheable=true), even when it does it actually creates 2 different variables scopes with the same value, a 'fact' and a hostvar (what set_fact creates by default). So you are safe using it this way.

If you were using vars it would not be safe since 'lazy evaluation' would require the original variables to be available to template on each use.

In the end 'clear all facts' would clear 'all hostvars' for your case, which might be a lot more than intended (once set_fact sets a hostvar there is no distinction).

@ryanmerolle
Copy link
Contributor

Good insight @bcoca .

Thanks

@vadirajks

This comment was marked as off-topic.

@Vladimir-csp
Copy link

Vladimir-csp commented May 31, 2022

After completing role_1, ansible seemingly hangs. Debug reveals that it starts very slowly to run iterations of get_vars.
At the end of role_1 messages were printed to the log for each host META: role_complete for hostname, after that a large amount of empty META: messages started pouring.

I found what that silent var crunching is. Ansible silently processes vars for tasks that will not run due to not being matched by --tags (times undeduplicated task list). This either did not happen in 2.9 at all, or it handled excluded tasks much faster.

@bcoca
Copy link
Member

bcoca commented May 31, 2022

since 'tags' can be dynamic (set to a variable) ... we don't know until they are processed and templated, it could be different it tags were 'static'

@fungusakafungus
Copy link
Contributor

I noticed that in my setup(fact_caching = json), a lot of time is spent deserializing json, so I changed it to fact_caching = pickle and got an almost 2x improvement.

@Vladimir-csp

This comment was marked as off-topic.

@bcoca
Copy link
Member

bcoca commented Jun 8, 2022

@Vladimir-csp you are basically asking for us to code 2 diff parsing engines and use them on a toggle, also that would require opening a specific ticket (if not a proposal, since this is far from trivial)

fungusakafungus added a commit to fungusakafungus/kubespray that referenced this issue Jun 11, 2022
to prepare for turning off inject_facts_as_vars in order to speed up ansible.

for reasoning see ansible/ansible#73654
fungusakafungus added a commit to fungusakafungus/kubespray that referenced this issue Jun 11, 2022
to prepare for turning off inject_facts_as_vars in order to speed up ansible.

for reasoning see ansible/ansible#73654
fungusakafungus added a commit to fungusakafungus/kubespray that referenced this issue Jun 15, 2022
to prepare for turning off inject_facts_as_vars in order to speed up ansible.

for reasoning see ansible/ansible#73654
fungusakafungus added a commit to fungusakafungus/kubespray that referenced this issue Jun 27, 2022
to prepare for turning off inject_facts_as_vars in order to speed up ansible.

for reasoning see ansible/ansible#73654
fungusakafungus added a commit to fungusakafungus/kubespray that referenced this issue Jul 6, 2022
to prepare for turning off inject_facts_as_vars in order to speed up ansible.

for reasoning see ansible/ansible#73654
fungusakafungus added a commit to fungusakafungus/kubespray that referenced this issue Jul 13, 2022
to prepare for turning off inject_facts_as_vars in order to speed up ansible.

for reasoning see ansible/ansible#73654
fungusakafungus added a commit to fungusakafungus/kubespray that referenced this issue Jul 13, 2022
to prepare for turning off inject_facts_as_vars in order to speed up ansible.

for reasoning see ansible/ansible#73654
fungusakafungus added a commit to fungusakafungus/kubespray that referenced this issue Jul 31, 2022
to prepare for turning off inject_facts_as_vars in order to speed up ansible.

for reasoning see ansible/ansible#73654
@jborean93 jborean93 added affects_2.15 and removed affects_2.9 This issue/PR affects Ansible v2.9 labels Sep 28, 2022
rayrobert398 added a commit to rayrobert398/swift-openstack-ansible-os_developer that referenced this issue Oct 3, 2022
See ansible/ansible#73654

Change-Id: Ia3c80d5c0e63ee9c38868c41c9235c51a2498971
@bcoca bcoca added the on_hold label Apr 5, 2023
@bcoca
Copy link
Member

bcoca commented Apr 5, 2023

waiting on data tagging

@JakeCooper
Copy link

We run hundreds of hosts and the timing blow up is incredibly painful

Restarting a systemd process accounts for ~3 minutes of time

CleanShot 2023-12-04 at 13 55 20@2x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects_2.15 affects_2.16 bug This issue/PR relates to a bug. has_pr This issue has an associated PR. on_hold python3 support:core This issue/PR relates to code supported by the Ansible Engineering Team.
Projects
None yet
Development

No branches or pull requests