Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ec2_asg: fix #28087 and #35993 #36679

Merged
merged 2 commits into from Mar 5, 2018
Merged

ec2_asg: fix #28087 and #35993 #36679

merged 2 commits into from Mar 5, 2018

Conversation

msven
Copy link
Contributor

@msven msven commented Feb 24, 2018

Fixes #35993 - Changes to update_size in commit eb4cc31 made it so
the group dict passed into update_size was not modified. As a result,
the 'replace' call does not see an updated min_size like it previously
did and doesn't pause to wait for any new instances to spin up. Instead,
it moves straight into terminating old instances. Fix is to add batch_size
to min_size when calling wait_for_new_inst.

Fixes #28087 - Make replace_all_instances and replace_instances behave
exactly the same by setting replace_instances = current list of instances
when replace_all_instances used. Root cause of issue was that without lc_check
terminate_batch will terminate all instances passed to it and after updating
the asg size we were querying the asg again for the list of instances - so terminate batch
saw the list including new ones just spun up.

SUMMARY

Fixed issue "Have the ec2_asg module stop overshooting instances. #28087"
This was caused by the combination of lc_check: false and replace_all_instances: true.
Since lc_check is false terminate_batch will consider all instances passed into it as available
for termination (expected behavior). However, in the replace method just prior to the following loop
for i in get_chunks(instances, batch_size): we fetch the as_group again and update the list
of instances. This list of instances now includes additional instances that we just spun up earlier
in the replace call. This list is passed to terminate_batch and with lc_check false we end up thinking we need to terminate all of them. There are a couple of routes that could be taken to address this. I
chose to set replace_instances equal to the current set of instances attached to the asg at the start
of the replace call. This makes replace_instances and replace_all_instances logic behave the same
from that point on.

Fixed issue "ec2_asg terminates an instance before creating a replacement. #35993"
Commit eb4cc31 made changes to the update_size method which introduced this bug.
eb4cc31 made it so update_size didn't modify the group dict passed to it, but the replace
method expected group to be updated. The fix was to simply add the batch_size to
wait_for_new_inst so we wait for the new instances to spin up. Otherwise, we see
the original min which is already met and we proceed to immediately terminate instances.

ISSUE TYPE
  • Bugfix Pull Request
COMPONENT NAME

module ec2_asg

ANSIBLE VERSION
ansible 2.4.0.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/usr/share/ansible']
  ansible python module location = /home/msven/github/test/env/local/lib/python2.7/site-packages/ansible
  executable location = /home/msven/github/test/env/bin/ansible
  python version = 2.7.6 (default, Nov 23 2017, 15:49:48) [GCC 4.8.4]
ADDITIONAL INFORMATION

The following playbook can be used to reproduce both issues (assuming you have the launch cfg setup)

---

- hosts: localhost
  gather_facts: false
  tasks:
    - ec2_asg:
        name: test
        launch_config_name: test-lc
        replace_all_instances: true
        wait_for_instances: true
        lc_check: false
        min_size: 2
        max_size: 2
        desired_capacity: 2
        replace_batch_size: 2

@ansibot
Copy link
Contributor

ansibot commented Feb 24, 2018

@ansibot ansibot added aws bugfix_pull_request cloud committer_review In order to be merged, this PR must follow the certified review workflow. module This issue/PR relates to a module. needs_triage Needs a first human triage before being processed. new_contributor This PR is the first contribution by a new community member. support:certified This issue/PR relates to certified code. labels Feb 24, 2018
@s-hertel s-hertel removed the needs_triage Needs a first human triage before being processed. label Feb 26, 2018
@s-hertel
Copy link
Contributor

s-hertel commented Feb 26, 2018

@msven Since you've been able to reproduce the problems do you mind adding a test case to https://github.com/ansible/ansible/blob/devel/test/integration/targets/ec2_asg/tasks/main.yml so these don't get broken again in the future?

@msven
Copy link
Contributor Author

msven commented Feb 27, 2018

@s-hertel - Sure, I'll take a look

@ansibot ansibot added bug This issue/PR relates to a bug. and removed bugfix_pull_request labels Mar 2, 2018
@ansibot
Copy link
Contributor

ansibot commented Mar 4, 2018

The test ansible-test sanity --test pep8 [explain] failed with 2 errors:

lib/ansible/modules/cloud/amazon/ec2_asg.py:1421:1: E302 expected 2 blank lines, found 1
lib/ansible/modules/cloud/amazon/ec2_asg.py:1430:1: E302 expected 2 blank lines, found 1

click here for bot help

@ansibot ansibot added ci_verified Changes made in this PR are causing tests to fail. needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. test This PR relates to tests. and removed committer_review In order to be merged, this PR must follow the certified review workflow. labels Mar 4, 2018
Fixes ansible#35993 - Changes to update_size in commit eb4cc31 made it so
the group dict passed into update_size was not modified. As a result,
the 'replace' call does not see an updated min_size like it previously
did and doesn't pause to wait for any new instances to spin up. Instead,
it moves straight into terminating old instances. Fix is to add batch_size
to min_size when calling wait_for_new_inst.

Fixes ansible#28087 - Make replace_all_instances and replace_instances behave
exactly the same by setting replace_instances = current list of instances
when replace_all_instances used. Root cause of issue was that without lc_check
terminate_batch will terminate all instances passed to it and after updating
the asg size we were querying the asg again for the list of instances - so terminate batch
saw the list including new ones just spun up.

When creating new asg with replace_all_instances: yes and lc_check: false
the instances that are initially created are then subsequently replaced.
This change makes it so replace only occurs if the asg already existed.

Add integration tests for ansible#28087 and ansible#35993.
@ansibot ansibot added committer_review In order to be merged, this PR must follow the certified review workflow. and removed ci_verified Changes made in this PR are causing tests to fail. needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. labels Mar 4, 2018
@msven
Copy link
Contributor Author

msven commented Mar 4, 2018

@s-hertel Added some tests cases, along with another small change to the module to fix #28087 for the case where the asg is just being created.

@ryansb ryansb merged commit a2b3120 into ansible:devel Mar 5, 2018
willthames pushed a commit to willthames/ansible that referenced this pull request Mar 6, 2018
Fixes ansible#35993 - Changes to update_size in commit eb4cc31 made it so
the group dict passed into update_size was not modified. As a result,
the 'replace' call does not see an updated min_size like it previously
did and doesn't pause to wait for any new instances to spin up. Instead,
it moves straight into terminating old instances. Fix is to add batch_size
to min_size when calling wait_for_new_inst.

Fixes ansible#28087 - Make replace_all_instances and replace_instances behave
exactly the same by setting replace_instances = current list of instances
when replace_all_instances used. Root cause of issue was that without lc_check
terminate_batch will terminate all instances passed to it and after updating
the asg size we were querying the asg again for the list of instances - so terminate batch
saw the list including new ones just spun up.

When creating new asg with replace_all_instances: yes and lc_check: false
the instances that are initially created are then subsequently replaced.
This change makes it so replace only occurs if the asg already existed.

Add integration tests for ansible#28087 and ansible#35993.
itccompliance pushed a commit to itccompliance/ansible that referenced this pull request Mar 19, 2018
Fixes ansible#35993 - Changes to update_size in commit eb4cc31 made it so
the group dict passed into update_size was not modified. As a result,
the 'replace' call does not see an updated min_size like it previously
did and doesn't pause to wait for any new instances to spin up. Instead,
it moves straight into terminating old instances. Fix is to add batch_size
to min_size when calling wait_for_new_inst.

Fixes ansible#28087 - Make replace_all_instances and replace_instances behave
exactly the same by setting replace_instances = current list of instances
when replace_all_instances used. Root cause of issue was that without lc_check
terminate_batch will terminate all instances passed to it and after updating
the asg size we were querying the asg again for the list of instances - so terminate batch
saw the list including new ones just spun up.

When creating new asg with replace_all_instances: yes and lc_check: false
the instances that are initially created are then subsequently replaced.
This change makes it so replace only occurs if the asg already existed.

Add integration tests for ansible#28087 and ansible#35993.
s-hertel pushed a commit to s-hertel/ansible that referenced this pull request Mar 28, 2018
Fixes ansible#35993 - Changes to update_size in commit eb4cc31 made it so
the group dict passed into update_size was not modified. As a result,
the 'replace' call does not see an updated min_size like it previously
did and doesn't pause to wait for any new instances to spin up. Instead,
it moves straight into terminating old instances. Fix is to add batch_size
to min_size when calling wait_for_new_inst.

Fixes ansible#28087 - Make replace_all_instances and replace_instances behave
exactly the same by setting replace_instances = current list of instances
when replace_all_instances used. Root cause of issue was that without lc_check
terminate_batch will terminate all instances passed to it and after updating
the asg size we were querying the asg again for the list of instances - so terminate batch
saw the list including new ones just spun up.

When creating new asg with replace_all_instances: yes and lc_check: false
the instances that are initially created are then subsequently replaced.
This change makes it so replace only occurs if the asg already existed.

Add integration tests for ansible#28087 and ansible#35993.

(cherry picked from commit a2b3120)
@s-hertel
Copy link
Contributor

s-hertel commented Mar 28, 2018

Backported to 2.5 in ae4c246. Thanks @msven (and @willthames for bringing it up)!

@willthames
Copy link
Contributor

Thanks for backporting @s-hertel

@ansible ansible locked and limited conversation to collaborators Apr 27, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
aws bug This issue/PR relates to a bug. cloud committer_review In order to be merged, this PR must follow the certified review workflow. module This issue/PR relates to a module. new_contributor This PR is the first contribution by a new community member. support:certified This issue/PR relates to certified code. test This PR relates to tests.
Projects
None yet
5 participants