-
Notifications
You must be signed in to change notification settings - Fork 23.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Allocated for Included Tasks is Never Recovered #31673
Comments
We seem to be having Memory Leak issues with 2.4. I will try and get more details. From what I have been told, on our clients that we upgraded to 2.4, after the memory leak, we downgraded to 2.3.something, and the problem remains. Apparently we are using lots of includes. Perhaps the issues is Python? ( Was a thought that my coworker had ) |
Besides 2.4, we're seeing the issue in 2.3.1. |
I can reproduce this sort of with: I put the repro test case info at https://github.com/alikins/ansible-bug-repro/tree/master/mem_usage_a24_31673 Reproducing with f27/python2.7, the memory profile info grew 55MB to 60MB at the last memory profile output. So u
|
I've been hit by this same issue. It rendered our main playbook inoperable. (vs running on Ansible 2.1, in which it ran fine). We run this playbook on an admin host that has 4G of RAM, and normally, we're never even vaguely close to any limits, and now the process fails with OOM. I think it's because we have a lot of tasks that are included in loops. We need them because we set up hundreds of services on dozens of servers. (So I can't use import_tasks, which I suspect doesn't have such a memory problem.) On a test system with a simpler run of the same playbook (less services and hosts) I validated that Ansible 2.4 is using many times the RAM of Ansible 2.1, I used /usr/bin/time -v on the whole ansible-playbook invocation, which gives a very clear report of the max RAM (RSS) used. Is there anything I could do to help here? I don't know anything about developing/debugging Ansible internals, so probably not, but if there are any things that I could do to help move this forward/make it easier for it to be debugged/resolved, I'm motivated. |
I did a bunch of test runs against various ansible versions. (Everything that pip would install for me.) It seems like there were two jumps in RSS memory usage, one in 2.2 and one in 2.4:
I don't think that anyone should read anything into the magnitude of the jumps. It's just when they happened that I imagine is most relevant. Next stop is looking at commits, seeing if I can narrow down the commit that caused each jump. I imagine that I will "compile" from source to do that. |
Can we test again now that this has been merged? #34461 |
Ah, I had hope that this was going to solve the problem, but alas, it seems
it did not.
I used a clean repo, which means I’m operating off of the ‘devel’ branch.
[ec2-user@ip-10-32-2-176 ansible]$ ansible-playbook --version
ansible-playbook 2.5.0 (devel 46263d7) last updated 2018/01/08 16:15:23
(GMT +000)
config file = None
configured module search path =
[u'/home/ec2-user/.ansible/plugins/modules',
u'/usr/share/ansible/plugins/modules']
ansible python module location = /bigdisk/ansible/lib/ansible
executable location = /bigdisk/ansible/bin/ansible-playbook
python version = 2.7.11 (default, Feb 1 2017, 18:42:03) [GCC 4.1.2
20080704 (Red Hat 4.1.2-54)]
[ec2-user@ip-10-32-2-176 ansible_memory_leak]$ /usr/bin/time -v
ansible-playbook memory_leak.yml |& grep 'Maximum resident set size'
Maximum resident set size (kbytes): 56992
[ec2-user@ip-10-32-2-176 ansible_memory_leak]$ /usr/bin/time -v
ansible-playbook memory_leak.yml |& grep 'Maximum resident set size'
Maximum resident set size (kbytes): 56656
[ec2-user@ip-10-32-2-176 ansible_memory_leak]$ /usr/bin/time -v
ansible-playbook memory_leak.yml |& grep 'Maximum resident set size'
Maximum resident set size (kbytes): 56772
This is similar to other the output of the testing against other 2.4
releases.
Sean
On January 8, 2018 at 8:08:48 AM, Adam Miller (notifications@github.com) wrote:
Can we test again now that this has been merged? #34461
<#34461>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#31673 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB2E29vbExX4QgEAA2BtSJcrqIOJksygks5tIj2PgaJpZM4P3xfO>
.
|
@ssuchter thanks for checking |
This is against various branches in a checkout:
|
Ok, I definitely don't understand why but I have found one change that significantly increases the RAM footprint of ansible-playbook. The thing that mystifies me is which change it is:
Here's my evidence that this commit causes significant memory usage. I checked out various revisions and narrowed down a range until I got to the exact commit that caused memory usage increase. I tried many commits immediately before and immediately after this one, and it was exactly this one that caused a significant change, it seems. In the below table, rev_idx is how many back from the head (currently 46263d7) the revision is. I used this to narrow down which revisions to look at.
There seem to be a couple newer jumps too, I'm going to look for them. |
I found a smaller, but significant jump in memory due to this commit:
That per-uuid change seems like it could cause memory management changes, that might be something to look at. Evidence:
|
There's another jump (although small) in this commit:
|
Another jump (small) in this commit:
|
Here's a commit that made a small but meaningful RAM improvement (uses less RAM):
|
This commit increases RAM usage somewhat:
|
This (very recent) commit improves (lowers) RAM usage somewhat:
|
Ok, I re-looked through all of these commits and ran a more extensive test case (5x number of loops for playbook includes). This was designed to exacerbate the memory "leak" as opposed to things that just increase memory usage by something more fixed. I'm convinced that we should be looking at this commit:
I have no idea why such a change causes such a huge difference, but it's clearly gigantic. This change increased the memory for my test case (that does 500 includes inside 3 nested loops of a trivial task list (just calls debug once and prints out the loop index numbers) by over 4x. (88Mb) These memory numbers are not comparable to the earlier ones I was posting. Those were all with 100 includes, this is with 500 includes:
I also checked that there aren't any weird cross-branch things going on. E.g. there is a direct parent commit chain going on: 1c9a58a -> fa5386c -> 6ba6819 -> 0b47c90 -> 250b0df -> cb18881 -> 5628e26 -> 573078f |
Attached is the trivial playbook setup that I used to reproduce this issue. memory_leak.yml does 100 includes, memory_leak2.yml does 600 includes (not 500 as I noted above, oops) |
@ssuchter - thank you for your time in putting together these tests! Hope that this will make fixing the issue faster (since I can't upgrade Ansible until it's fixed). |
Ok, some of those previous results (in particular, the pointing at commit 6ba6819) was wrong. That commit was showing up as a big user because ansible was actually crashing without it, but my test harness didn't catch that and so thought the crashed ansible didn't take much RAM. I re-ran everything, and found (with 600 includes) some commits that appear to be very costly: Commit e238ae9 hurts to the tune of 6%. Not huge. This is commit 47acf55:
Here's the memory usage before and after that commit:
The obvious thing to look at in that commit is if anything gets cached in self._task_uuid_cache that didn't get added to self._blocks. There are two bits of code I'm suspicious of: These lines in 192-193 - they seem to cache something from self._play.handlers that isn't in self._blocks:
These lines in add_tasks, lines 573-574. It doesn't seem like these gets added to self._blocks either:
|
This also moves the task caching from the PlayIterator to the StrategyBase class, where it makes more sense (and makes it easier to not have to change the strategy class methods leading to an API change). Fixes ansible#31673
* Cache tasks as they are queued instead of en masse This also moves the task caching from the PlayIterator to the StrategyBase class, where it makes more sense (and makes it easier to not have to change the strategy class methods leading to an API change). Fixes #31673 * Cleaning up unit tests due to 502ca78
* Cache tasks as they are queued instead of en masse This also moves the task caching from the PlayIterator to the StrategyBase class, where it makes more sense (and makes it easier to not have to change the strategy class methods leading to an API change). Fixes ansible#31673 * Cleaning up unit tests due to 502ca78 (cherry picked from commit b107e39)
* Cache tasks as they are queued instead of en masse This also moves the task caching from the PlayIterator to the StrategyBase class, where it makes more sense (and makes it easier to not have to change the strategy class methods leading to an API change). Fixes #31673 * Cleaning up unit tests due to 502ca78 (cherry picked from commit b107e39)
My 2 cents, an educated guess: given that the ansible uses multiprocess fork-based model when another fork is created the memory is COW'ed (copy-on-write). That effectively is a cheap operation, both in terms of cpu and memory actually mapped/allocated to the process. What happens next is - every fork uses their "cloned" memory, which leads to the actual memory copy operation, which is not bad still. But, then GC triggers and entirely re-shuffles the python's process (each of those, obviously at random points in time, depending on a particular fork memory usage patterns) managed memory, which leads to a huge spike, since a lot of pages are touched and modified. |
The inventory file has a channel variable set for each host, and then we have different playbooks that we include depending on the host channel. playbook.yml - hosts: panels
serial: 10
tasks:
- include_role:
name: panels roles/panels/main.yaml # Run playbook specific to the channel
- include_tasks: "{{ channel }}.yaml" production.yaml - include_tasks: fstrim-cron.yaml
- include_tasks: sd-info.yaml fstrim-cron.yaml - name: copy fstrim timer unit
copy: src=etc/systemd/system/fstrim.timer dest=/etc/systemd/system/fstrim.timer owner=root group=root mode=0644
become: true
- name: copy fstrim service unit
copy: src=etc/systemd/system/fstrim.service dest=/etc/systemd/system/fstrim.service owner=root group=root mode=0644
become: true
- name: enable the fstrim timer
systemd:
name: fstrim.timer
state: started
enabled: true
register: enable_timer
become: true
- name: run first fstrim
systemd:
name: fstrim.service
state: started
when: enable_timer.changed
become: true sd-info.yaml - name: Get SD card oemid
shell: cat /sys/class/mmc_host/mmc0/mmc0\:*/oemid
register: sd_oemid
- name: Get SD card name
shell: cat /sys/class/mmc_host/mmc0/mmc0\:*/name
register: sd_name
- name: Save SD card info in panel meta
shell: "python scripts/save_sd_info.py {{panel_id}} \"{{sd_oemid.stdout_lines[0]}}\" \"{{sd_name.stdout_lines[0]}}\""
delegate_to: 127.0.0.1 |
@zerkms One thing I did neglect to mention is that the spikes correlated with the |
@joshperry my personal research shows that |
@zerkms - That's certainly a good guess. I think Instagram battled a similar issue with their Django deployment which ended up with adding a new API to CPython: https://engineering.instagram.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf (unfortunately only 3.7). |
@zerkms Interesting, I'm going to try to put together a statically imported version of my playbooks and see if that helps with the GC. I'm assuming that even though these includes only distill down to 7 tasks, each dynamic task is considered unique since it is included, causing it to be more like 600 hosts * 7 tasks that GC has to grind through. |
@joshperry there was another issue fixed by c30ee42 which should address this and is being included in 2.4.3 rc3. You can test it there, or give devel a try if possible. |
Just FYI those who are running out of memory as a result of Ansible 2.4.3 on Raspberry Pi ("OSError: [Errno 12] Cannot allocate memory") should revert to Ansible 2.4.2 as Internet-in-a-Box has done successfully. |
My impression is also that the memory issues pop up when using include_role. Preparing for what's coming in the future I took the latest today's devel 9018c63 and I see the following. While CentOS7 ansible 2.4.2.0 manages most of the time (apart from a rare "A worker was found in a dead state") to execute the same playbook over and over again (against 2 hosts) using 2GB or RAM, the latest ansible devel only passes without "Cannot allocate memory" on a 6GB ansible host. Can anyone confirm/disprove the current state of the "Cannot allocate memory" worsening in devel? |
@sivel this issue (or #35796 or similar?) has returned on Ansible 2.5.0 on Raspbian 2018-03-13 on Raspberry Pi 3. As documented at iiab/iiab#669 (comment) :
|
ISSUE TYPE
COMPONENT NAME
include (but probably any task as well)
ANSIBLE VERSION
CONFIGURATION
For this test playbook, added the following to ansible.cfg:
OS / ENVIRONMENT
Control host: Red Hat Enterprise Linux Server release 7.3 (Maipo)
Targets: NA
SUMMARY
On the control machine, Ansible process memory increases with the number of tasks in the playbook. The most noticeable jumps are when an include task is processed and several tasks are read in. The problem is that the memory is never released. For our production playbook, if the server does not have enough memory, Ansible crashes with this error:
In our production playbook, each include task increases the memory by about 100MB.
STEPS TO REPRODUCE
memory_leak.yml
foo/some_tasks.yml
callback_plugins/profile_memory.py
EXPECTED RESULTS
My expectation is that once the life cycle of a task is complete (task has run and results returned), across all applicable hosts, all memory allocated for that task and supporting objects should be released, such that memory use is mostly flat, instead of ever-increasing, for any number of tasks.
ACTUAL RESULTS
The text was updated successfully, but these errors were encountered: