ec2_asg: fix #28087 and #35993 #36679

msven · 2018-02-24T20:56:51Z

Fixes #35993 - Changes to update_size in commit eb4cc31 made it so
the group dict passed into update_size was not modified. As a result,
the 'replace' call does not see an updated min_size like it previously
did and doesn't pause to wait for any new instances to spin up. Instead,
it moves straight into terminating old instances. Fix is to add batch_size
to min_size when calling wait_for_new_inst.

Fixes #28087 - Make replace_all_instances and replace_instances behave
exactly the same by setting replace_instances = current list of instances
when replace_all_instances used. Root cause of issue was that without lc_check
terminate_batch will terminate all instances passed to it and after updating
the asg size we were querying the asg again for the list of instances - so terminate batch
saw the list including new ones just spun up.

SUMMARY

Fixed issue "Have the ec2_asg module stop overshooting instances. #28087"
This was caused by the combination of lc_check: false and replace_all_instances: true.
Since lc_check is false terminate_batch will consider all instances passed into it as available
for termination (expected behavior). However, in the replace method just prior to the following loop
for i in get_chunks(instances, batch_size): we fetch the as_group again and update the list
of instances. This list of instances now includes additional instances that we just spun up earlier
in the replace call. This list is passed to terminate_batch and with lc_check false we end up thinking we need to terminate all of them. There are a couple of routes that could be taken to address this. I
chose to set replace_instances equal to the current set of instances attached to the asg at the start
of the replace call. This makes replace_instances and replace_all_instances logic behave the same
from that point on.

Fixed issue "ec2_asg terminates an instance before creating a replacement. #35993"
Commit eb4cc31 made changes to the update_size method which introduced this bug.
eb4cc31 made it so update_size didn't modify the group dict passed to it, but the replace
method expected group to be updated. The fix was to simply add the batch_size to
wait_for_new_inst so we wait for the new instances to spin up. Otherwise, we see
the original min which is already met and we proceed to immediately terminate instances.

ISSUE TYPE

Bugfix Pull Request

COMPONENT NAME

module ec2_asg

ANSIBLE VERSION

ansible 2.4.0.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/usr/share/ansible']
  ansible python module location = /home/msven/github/test/env/local/lib/python2.7/site-packages/ansible
  executable location = /home/msven/github/test/env/bin/ansible
  python version = 2.7.6 (default, Nov 23 2017, 15:49:48) [GCC 4.8.4]

ADDITIONAL INFORMATION

The following playbook can be used to reproduce both issues (assuming you have the launch cfg setup)

---

- hosts: localhost
  gather_facts: false
  tasks:
    - ec2_asg:
        name: test
        launch_config_name: test-lc
        replace_all_instances: true
        wait_for_instances: true
        lc_check: false
        min_size: 2
        max_size: 2
        desired_capacity: 2
        replace_batch_size: 2

ansibot · 2018-02-24T21:04:17Z

cc @garethr @ryansb @s-hertel @willthames
click here for bot help

s-hertel · 2018-02-26T13:56:16Z

@msven Since you've been able to reproduce the problems do you mind adding a test case to https://github.com/ansible/ansible/blob/devel/test/integration/targets/ec2_asg/tasks/main.yml so these don't get broken again in the future?

msven · 2018-02-27T01:34:53Z

@s-hertel - Sure, I'll take a look

ansibot · 2018-03-04T02:32:11Z

The test ansible-test sanity --test pep8 [explain] failed with 2 errors:

lib/ansible/modules/cloud/amazon/ec2_asg.py:1421:1: E302 expected 2 blank lines, found 1
lib/ansible/modules/cloud/amazon/ec2_asg.py:1430:1: E302 expected 2 blank lines, found 1

click here for bot help

Fixes ansible#35993 - Changes to update_size in commit eb4cc31 made it so the group dict passed into update_size was not modified. As a result, the 'replace' call does not see an updated min_size like it previously did and doesn't pause to wait for any new instances to spin up. Instead, it moves straight into terminating old instances. Fix is to add batch_size to min_size when calling wait_for_new_inst. Fixes ansible#28087 - Make replace_all_instances and replace_instances behave exactly the same by setting replace_instances = current list of instances when replace_all_instances used. Root cause of issue was that without lc_check terminate_batch will terminate all instances passed to it and after updating the asg size we were querying the asg again for the list of instances - so terminate batch saw the list including new ones just spun up. When creating new asg with replace_all_instances: yes and lc_check: false the instances that are initially created are then subsequently replaced. This change makes it so replace only occurs if the asg already existed. Add integration tests for ansible#28087 and ansible#35993.

msven · 2018-03-04T02:44:53Z

@s-hertel Added some tests cases, along with another small change to the module to fix #28087 for the case where the asg is just being created.

Fixes ansible#35993 - Changes to update_size in commit eb4cc31 made it so the group dict passed into update_size was not modified. As a result, the 'replace' call does not see an updated min_size like it previously did and doesn't pause to wait for any new instances to spin up. Instead, it moves straight into terminating old instances. Fix is to add batch_size to min_size when calling wait_for_new_inst. Fixes ansible#28087 - Make replace_all_instances and replace_instances behave exactly the same by setting replace_instances = current list of instances when replace_all_instances used. Root cause of issue was that without lc_check terminate_batch will terminate all instances passed to it and after updating the asg size we were querying the asg again for the list of instances - so terminate batch saw the list including new ones just spun up. When creating new asg with replace_all_instances: yes and lc_check: false the instances that are initially created are then subsequently replaced. This change makes it so replace only occurs if the asg already existed. Add integration tests for ansible#28087 and ansible#35993.

Fixes ansible#35993 - Changes to update_size in commit eb4cc31 made it so the group dict passed into update_size was not modified. As a result, the 'replace' call does not see an updated min_size like it previously did and doesn't pause to wait for any new instances to spin up. Instead, it moves straight into terminating old instances. Fix is to add batch_size to min_size when calling wait_for_new_inst. Fixes ansible#28087 - Make replace_all_instances and replace_instances behave exactly the same by setting replace_instances = current list of instances when replace_all_instances used. Root cause of issue was that without lc_check terminate_batch will terminate all instances passed to it and after updating the asg size we were querying the asg again for the list of instances - so terminate batch saw the list including new ones just spun up. When creating new asg with replace_all_instances: yes and lc_check: false the instances that are initially created are then subsequently replaced. This change makes it so replace only occurs if the asg already existed. Add integration tests for ansible#28087 and ansible#35993. (cherry picked from commit a2b3120)

s-hertel · 2018-03-28T22:08:44Z

Backported to 2.5 in ae4c246. Thanks @msven (and @willthames for bringing it up)!

willthames · 2018-03-28T22:47:32Z

Thanks for backporting @s-hertel

s-hertel removed the needs_triage Needs a first human triage before being processed. label Feb 26, 2018

ansibot added bug This issue/PR relates to a bug. and removed bugfix_pull_request labels Mar 2, 2018

msven force-pushed the ec2_asg/bugfixes branch from 08b0f20 to 0d650a0 Compare March 4, 2018 02:25

msven force-pushed the ec2_asg/bugfixes branch from 0d650a0 to f609512 Compare March 4, 2018 02:34

ansibot added committer_review In order to be merged, this PR must follow the certified review workflow. and removed ci_verified Changes made in this PR are causing tests to fail. needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. labels Mar 4, 2018

Refactor conditional

41fbd33

ryansb approved these changes Mar 5, 2018

View reviewed changes

ryansb merged commit a2b3120 into ansible:devel Mar 5, 2018

ansible locked and limited conversation to collaborators Apr 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ec2_asg: fix #28087 and #35993 #36679

ec2_asg: fix #28087 and #35993 #36679

msven commented Feb 24, 2018

ansibot commented Feb 24, 2018

s-hertel commented Feb 26, 2018 •

edited

msven commented Feb 27, 2018

ansibot commented Mar 4, 2018

msven commented Mar 4, 2018

s-hertel commented Mar 28, 2018 •

edited

willthames commented Mar 28, 2018

ec2_asg: fix #28087 and #35993 #36679

ec2_asg: fix #28087 and #35993 #36679

Conversation

msven commented Feb 24, 2018

SUMMARY

ISSUE TYPE

COMPONENT NAME

ANSIBLE VERSION

ADDITIONAL INFORMATION

ansibot commented Feb 24, 2018

s-hertel commented Feb 26, 2018 • edited

msven commented Feb 27, 2018

ansibot commented Mar 4, 2018

msven commented Mar 4, 2018

s-hertel commented Mar 28, 2018 • edited

willthames commented Mar 28, 2018

s-hertel commented Feb 26, 2018 •

edited

s-hertel commented Mar 28, 2018 •

edited