New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ec2_asg: fix #28087 and #35993 #36679
Conversation
@msven Since you've been able to reproduce the problems do you mind adding a test case to https://github.com/ansible/ansible/blob/devel/test/integration/targets/ec2_asg/tasks/main.yml so these don't get broken again in the future? |
@s-hertel - Sure, I'll take a look |
The test
|
Fixes ansible#35993 - Changes to update_size in commit eb4cc31 made it so the group dict passed into update_size was not modified. As a result, the 'replace' call does not see an updated min_size like it previously did and doesn't pause to wait for any new instances to spin up. Instead, it moves straight into terminating old instances. Fix is to add batch_size to min_size when calling wait_for_new_inst. Fixes ansible#28087 - Make replace_all_instances and replace_instances behave exactly the same by setting replace_instances = current list of instances when replace_all_instances used. Root cause of issue was that without lc_check terminate_batch will terminate all instances passed to it and after updating the asg size we were querying the asg again for the list of instances - so terminate batch saw the list including new ones just spun up. When creating new asg with replace_all_instances: yes and lc_check: false the instances that are initially created are then subsequently replaced. This change makes it so replace only occurs if the asg already existed. Add integration tests for ansible#28087 and ansible#35993.
Fixes ansible#35993 - Changes to update_size in commit eb4cc31 made it so the group dict passed into update_size was not modified. As a result, the 'replace' call does not see an updated min_size like it previously did and doesn't pause to wait for any new instances to spin up. Instead, it moves straight into terminating old instances. Fix is to add batch_size to min_size when calling wait_for_new_inst. Fixes ansible#28087 - Make replace_all_instances and replace_instances behave exactly the same by setting replace_instances = current list of instances when replace_all_instances used. Root cause of issue was that without lc_check terminate_batch will terminate all instances passed to it and after updating the asg size we were querying the asg again for the list of instances - so terminate batch saw the list including new ones just spun up. When creating new asg with replace_all_instances: yes and lc_check: false the instances that are initially created are then subsequently replaced. This change makes it so replace only occurs if the asg already existed. Add integration tests for ansible#28087 and ansible#35993.
Fixes ansible#35993 - Changes to update_size in commit eb4cc31 made it so the group dict passed into update_size was not modified. As a result, the 'replace' call does not see an updated min_size like it previously did and doesn't pause to wait for any new instances to spin up. Instead, it moves straight into terminating old instances. Fix is to add batch_size to min_size when calling wait_for_new_inst. Fixes ansible#28087 - Make replace_all_instances and replace_instances behave exactly the same by setting replace_instances = current list of instances when replace_all_instances used. Root cause of issue was that without lc_check terminate_batch will terminate all instances passed to it and after updating the asg size we were querying the asg again for the list of instances - so terminate batch saw the list including new ones just spun up. When creating new asg with replace_all_instances: yes and lc_check: false the instances that are initially created are then subsequently replaced. This change makes it so replace only occurs if the asg already existed. Add integration tests for ansible#28087 and ansible#35993.
Fixes ansible#35993 - Changes to update_size in commit eb4cc31 made it so the group dict passed into update_size was not modified. As a result, the 'replace' call does not see an updated min_size like it previously did and doesn't pause to wait for any new instances to spin up. Instead, it moves straight into terminating old instances. Fix is to add batch_size to min_size when calling wait_for_new_inst. Fixes ansible#28087 - Make replace_all_instances and replace_instances behave exactly the same by setting replace_instances = current list of instances when replace_all_instances used. Root cause of issue was that without lc_check terminate_batch will terminate all instances passed to it and after updating the asg size we were querying the asg again for the list of instances - so terminate batch saw the list including new ones just spun up. When creating new asg with replace_all_instances: yes and lc_check: false the instances that are initially created are then subsequently replaced. This change makes it so replace only occurs if the asg already existed. Add integration tests for ansible#28087 and ansible#35993. (cherry picked from commit a2b3120)
Backported to 2.5 in ae4c246. Thanks @msven (and @willthames for bringing it up)! |
Thanks for backporting @s-hertel |
Fixes #35993 - Changes to update_size in commit eb4cc31 made it so
the group dict passed into update_size was not modified. As a result,
the 'replace' call does not see an updated min_size like it previously
did and doesn't pause to wait for any new instances to spin up. Instead,
it moves straight into terminating old instances. Fix is to add batch_size
to min_size when calling wait_for_new_inst.
Fixes #28087 - Make replace_all_instances and replace_instances behave
exactly the same by setting replace_instances = current list of instances
when replace_all_instances used. Root cause of issue was that without lc_check
terminate_batch will terminate all instances passed to it and after updating
the asg size we were querying the asg again for the list of instances - so terminate batch
saw the list including new ones just spun up.
SUMMARY
Fixed issue "Have the ec2_asg module stop overshooting instances. #28087"
This was caused by the combination of lc_check: false and replace_all_instances: true.
Since lc_check is false terminate_batch will consider all instances passed into it as available
for termination (expected behavior). However, in the replace method just prior to the following loop
for i in get_chunks(instances, batch_size):
we fetch the as_group again and update the listof instances. This list of instances now includes additional instances that we just spun up earlier
in the replace call. This list is passed to terminate_batch and with lc_check false we end up thinking we need to terminate all of them. There are a couple of routes that could be taken to address this. I
chose to set
replace_instances
equal to the current set of instances attached to the asg at the startof the replace call. This makes
replace_instances
andreplace_all_instances
logic behave the samefrom that point on.
Fixed issue "ec2_asg terminates an instance before creating a replacement. #35993"
Commit eb4cc31 made changes to the update_size method which introduced this bug.
eb4cc31 made it so update_size didn't modify the group dict passed to it, but the replace
method expected group to be updated. The fix was to simply add the batch_size to
wait_for_new_inst so we wait for the new instances to spin up. Otherwise, we see
the original min which is already met and we proceed to immediately terminate instances.
ISSUE TYPE
COMPONENT NAME
module ec2_asg
ANSIBLE VERSION
ADDITIONAL INFORMATION
The following playbook can be used to reproduce both issues (assuming you have the launch cfg setup)