Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(17.32.0) DEVOPS-9370: use old capacity value when scaling down #41

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

(17.32.0) DEVOPS-9370: use old capacity value when scaling down #41

wants to merge 1 commit into from

Conversation

nz285
Copy link
Contributor

@nz285 nz285 commented Jul 19, 2017

Ticket here -
https://dunandb.jira.com/browse/DEVOPS-9370
I left a comment in the ticket detailing the reason for this change.

Tested with deploying dnbi-cache_qa from my local and it works.

USAU809914:License2Deploy zhangni (DEVOPS-9370)$ python rolling_deploy.py -f false -e qa -p dnbi-cache -b 13 -a ami-ffe9c59f -s dnbi-backend-qa
2017-07-19 16:03:11,391: INFO: Begin Logging...
2017-07-19 16:03:11,398: INFO: Found credentials in shared credentials file: ~/.aws/credentials
2017-07-19 16:03:12,238: INFO: Starting new HTTPS connection (1): cloudformation.us-west-1.amazonaws.com
2017-07-19 16:03:13,748: INFO: AMI ami-ffe9c59f is ready
2017-07-19 16:03:13,748: INFO: Build #: 13 ::: Autoscale Group: dnbi-backend-qa-dnbicacheASGqa-1AECPTB16Y0KA
2017-07-19 16:03:14,753: INFO: List of all Instance ID's and IP addresses in dnbi-backend-qa-dnbicacheASGqa-1AECPTB16Y0KA: {u'i-08297af941b736991': u'10.30.10.126', u'i-0d358ac45d88d09b0': u'10.30.10.167'}
2017-07-19 16:03:15,830: INFO: Disabled cloud-watch alarm. dnbi-backend-qa-dnbicacheSCALEDOWNALARMqa-1RMZBUOKBCJ2
2017-07-19 16:03:15,996: INFO: Disabled cloud-watch alarm. dnbi-backend-qa-dnbicacheSCALEUPALARMqa-MPRMFU4DVJP7
2017-07-19 16:03:16,219: INFO: Current desired count was changed from 2 to 4
2017-07-19 16:03:16,219: INFO: Set autoscale capacity for dnbi-backend-qa-dnbicacheASGqa-1AECPTB16Y0KA to 4
2017-07-19 16:03:16,394: INFO: Trying for maximum 10 minutes to allow for instances to be created.
2017-07-19 16:03:16,553: INFO: Instance ID List: [u'i-08297af941b736991', u'i-0d358ac45d88d09b0']
2017-07-19 16:03:16,768: WARNING: Not all new instances with build number "13" are in the group, retrying in 60 seconds...
2017-07-19 16:04:17,449: INFO: Instance ID List: [u'i-05b98bd6d2f51b782', u'i-08297af941b736991', u'i-0a4b2133213687788', u'i-0d358ac45d88d09b0']
2017-07-19 16:04:18,235: WARNING: Not all new instances with build number "13" are in the group, retrying in 60 seconds...
2017-07-19 16:05:18,874: INFO: Instance ID List: [u'i-05b98bd6d2f51b782', u'i-08297af941b736991', u'i-0a4b2133213687788', u'i-0d358ac45d88d09b0']
2017-07-19 16:05:20,023: INFO: New Instance List with IP Addresses: {u'i-0a4b2133213687788': u'10.30.10.17', u'i-05b98bd6d2f51b782': u'10.30.10.8'}
2017-07-19 16:05:20,023: INFO: Waiting maximum 5 minutes for instances to be ready.
2017-07-19 16:05:20,158: WARNING: i-05b98bd6d2f51b782 is not in a fully working state yet
2017-07-19 16:05:51,569: WARNING: i-05b98bd6d2f51b782 is not in a fully working state yet
2017-07-19 16:06:22,567: WARNING: i-05b98bd6d2f51b782 is not in a fully working state yet
2017-07-19 16:06:53,357: WARNING: i-05b98bd6d2f51b782 is not in a fully working state yet
2017-07-19 16:07:24,651: INFO: i-05b98bd6d2f51b782 is in a healthy state. Moving on...
2017-07-19 16:07:24,788: INFO: i-0a4b2133213687788 is in a healthy state. Moving on...
2017-07-19 16:07:24,788: INFO: Trying for maximum 5 minutes to health-check all instances.
2017-07-19 16:07:25,455: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-05b98bd6d2f51b782,OutOfService), InstanceState:(i-0a4b2133213687788,OutOfService)], retrying in 30 seconds...
2017-07-19 16:07:56,237: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-05b98bd6d2f51b782,OutOfService), InstanceState:(i-0a4b2133213687788,OutOfService)], retrying in 30 seconds...
2017-07-19 16:08:27,667: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-05b98bd6d2f51b782,OutOfService), InstanceState:(i-0a4b2133213687788,OutOfService)], retrying in 30 seconds...
2017-07-19 16:08:58,368: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-05b98bd6d2f51b782,OutOfService), InstanceState:(i-0a4b2133213687788,OutOfService)], retrying in 30 seconds...
2017-07-19 16:09:29,448: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-05b98bd6d2f51b782,OutOfService), InstanceState:(i-0a4b2133213687788,OutOfService)], retrying in 30 seconds...
2017-07-19 16:10:00,629: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-05b98bd6d2f51b782,OutOfService)], retrying in 30 seconds...
2017-07-19 16:10:32,184: INFO: ELB healthcheck OK
2017-07-19 16:10:32,184: INFO: Current desired count was changed from 4 to 2
2017-07-19 16:10:32,184: INFO: Set autoscale capacity for dnbi-backend-qa-dnbicacheASGqa-1AECPTB16Y0KA to 2
2017-07-19 16:10:32,761: INFO: Waiting maximum 5 minutes to terminate old instances.
2017-07-19 16:10:34,139: INFO: Deployed instances [InstanceState:(i-05b98bd6d2f51b782,InService), InstanceState:(i-08297af941b736991,InService), InstanceState:(i-0a4b2133213687788,InService), InstanceState:(i-0d358ac45d88d09b0,InService)] to ELB: dnbicacheELBqa
2017-07-19 16:10:34,285: INFO: No tagging necessary, already tagged with env: qa
2017-07-19 16:10:34,286: INFO: Found an alarm. dnbi-backend-qa-dnbicacheSCALEDOWNALARMqa-1RMZBUOKBCJ2
2017-07-19 16:10:34,929: INFO: Enabled cloud-watch alarm. dnbi-backend-qa-dnbicacheSCALEDOWNALARMqa-1RMZBUOKBCJ2
2017-07-19 16:10:34,930: INFO: Found an alarm. dnbi-backend-qa-dnbicacheSCALEUPALARMqa-MPRMFU4DVJP7
2017-07-19 16:10:35,086: INFO: Enabled cloud-watch alarm. dnbi-backend-qa-dnbicacheSCALEUPALARMqa-MPRMFU4DVJP7
2017-07-19 16:10:35,086: INFO: Deployment Complete!

Also tested with cos-service_qa -

USAU809914:License2Deploy zhangni (DEVOPS-9370)$ python rolling_deploy.py -f false -e qa -p cos-service -b 203 -a ami-244f6644 -s cos-service-qa
2017-07-20 14:15:18,632: INFO: Begin Logging...
2017-07-20 14:15:18,641: INFO: Found credentials in shared credentials file: ~/.aws/credentials
2017-07-20 14:15:19,413: INFO: Starting new HTTPS connection (1): cloudformation.us-west-1.amazonaws.com
2017-07-20 14:15:20,977: INFO: AMI ami-244f6644 is ready
2017-07-20 14:15:20,977: INFO: Build #: 203 ::: Autoscale Group: cos-service-qa-cosserviceASGqa-1HP11V36JSGQG
2017-07-20 14:15:21,939: INFO: List of all Instance ID's and IP addresses in cos-service-qa-cosserviceASGqa-1HP11V36JSGQG: {u'i-07e081e6e544897c0': u'10.36.10.230', u'i-0886ed808dbd65ba9': u'10.36.10.94'}
2017-07-20 14:15:22,804: INFO: Disabled cloud-watch alarm. cos-service-qa-cosserviceSCALEDOWNALARMqa-1W9NQYIO5U0OO
2017-07-20 14:15:22,968: INFO: Disabled cloud-watch alarm. cos-service-qa-cosserviceSCALEUPALARMqa-1R1ZXV9VIOY2U
2017-07-20 14:15:23,092: INFO: Current desired count was changed from 2 to 4
2017-07-20 14:15:23,092: INFO: Set autoscale capacity for cos-service-qa-cosserviceASGqa-1HP11V36JSGQG to 4
2017-07-20 14:15:23,229: INFO: Trying for maximum 10 minutes to allow for instances to be created.
2017-07-20 14:15:23,357: INFO: Instance ID List: [u'i-07e081e6e544897c0', u'i-0886ed808dbd65ba9']
2017-07-20 14:15:23,571: WARNING: Not all new instances with build number "203" are in the group, retrying in 60 seconds...
2017-07-20 14:16:24,205: INFO: Instance ID List: [u'i-008024d31868349df', u'i-07e081e6e544897c0', u'i-0886ed808dbd65ba9', u'i-0bd1edbcbbc869e1d']
2017-07-20 14:16:25,362: INFO: New Instance List with IP Addresses: {u'i-008024d31868349df': u'10.36.10.102', u'i-0bd1edbcbbc869e1d': u'10.36.10.80'}
2017-07-20 14:16:25,362: INFO: Waiting maximum 5 minutes for instances to be ready.
2017-07-20 14:16:25,513: WARNING: i-0bd1edbcbbc869e1d is not in a fully working state yet
2017-07-20 14:16:56,482: WARNING: i-0bd1edbcbbc869e1d is not in a fully working state yet
2017-07-20 14:17:27,385: WARNING: i-0bd1edbcbbc869e1d is not in a fully working state yet
2017-07-20 14:17:58,629: WARNING: i-0bd1edbcbbc869e1d is not in a fully working state yet
2017-07-20 14:18:29,472: WARNING: i-0bd1edbcbbc869e1d is not in a fully working state yet
2017-07-20 14:19:00,837: INFO: i-0bd1edbcbbc869e1d is in a healthy state. Moving on...
2017-07-20 14:19:00,954: INFO: i-008024d31868349df is in a healthy state. Moving on...
2017-07-20 14:19:00,954: INFO: Trying for maximum 5 minutes to health-check all instances.
2017-07-20 14:19:01,553: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-0bd1edbcbbc869e1d,OutOfService), InstanceState:(i-008024d31868349df,OutOfService)], retrying in 30 seconds...
2017-07-20 14:19:32,918: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-0bd1edbcbbc869e1d,OutOfService), InstanceState:(i-008024d31868349df,OutOfService)], retrying in 30 seconds...
2017-07-20 14:20:04,430: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-0bd1edbcbbc869e1d,OutOfService), InstanceState:(i-008024d31868349df,OutOfService)], retrying in 30 seconds...
2017-07-20 14:20:35,043: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-0bd1edbcbbc869e1d,OutOfService), InstanceState:(i-008024d31868349df,OutOfService)], retrying in 30 seconds...
2017-07-20 14:21:06,528: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-0bd1edbcbbc869e1d,OutOfService), InstanceState:(i-008024d31868349df,OutOfService)], retrying in 30 seconds...
2017-07-20 14:21:37,328: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-0bd1edbcbbc869e1d,OutOfService), InstanceState:(i-008024d31868349df,OutOfService)], retrying in 30 seconds...
2017-07-20 14:22:08,505: INFO: ELB healthcheck OK
2017-07-20 14:22:08,505: INFO: Current desired count was changed from 4 to 2
2017-07-20 14:22:08,505: INFO: Set autoscale capacity for cos-service-qa-cosserviceASGqa-1HP11V36JSGQG to 2
2017-07-20 14:22:09,129: INFO: Waiting maximum 5 minutes to terminate old instances.
2017-07-20 14:22:10,566: INFO: Deployed instances [InstanceState:(i-008024d31868349df,InService), InstanceState:(i-07e081e6e544897c0,InService), InstanceState:(i-0886ed808dbd65ba9,InService), InstanceState:(i-0bd1edbcbbc869e1d,InService)] to ELB: cosserviceELBqa
2017-07-20 14:22:10,727: INFO: No tagging necessary, already tagged with env: qa
2017-07-20 14:22:10,727: INFO: Found an alarm. cos-service-qa-cosserviceSCALEDOWNALARMqa-1W9NQYIO5U0OO
2017-07-20 14:22:11,367: INFO: Enabled cloud-watch alarm. cos-service-qa-cosserviceSCALEDOWNALARMqa-1W9NQYIO5U0OO
2017-07-20 14:22:11,367: INFO: Found an alarm. cos-service-qa-cosserviceSCALEUPALARMqa-1R1ZXV9VIOY2U
2017-07-20 14:22:11,518: INFO: Enabled cloud-watch alarm. cos-service-qa-cosserviceSCALEUPALARMqa-1R1ZXV9VIOY2U
2017-07-20 14:22:11,518: INFO: Deployment Complete!

@nz285 nz285 changed the title Devops 9370 (17.31.0) DEVOPS-9370: use old capacity value when scaling down Jul 19, 2017
Copy link
Contributor

@mprince mprince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 after cleaning up the commits

Copy link
Contributor

@banderon1 banderon1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup commits, otherwise looks good!

Copy link
Contributor

@mayn mayn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although less verbose, these code changes make it harder to understand what is actually happening.

Is all this refactoring necessary?

Also isn't the issue around trying to scale down, below the min capacity of the ASG? can't we just set the desired count to the max(new_count,group_min_count)

@nz285
Copy link
Contributor Author

nz285 commented Jul 21, 2017

@mayn It's not refactoring really. It's changing the value it looks at when it scales down. That is it does not look at the ASG capacity at that moment (which could have been changed since deployment started due to cpu utilization). It looks at the old ASG capacity that we retrieved when deployment started.

It won't dip below min capacity as the old desired capacity is either equal or above the min capacity when deployment starts. We just revert to what the ASG desired capacity was during scaling down. This new logic gurantee it won't dip below min capacity. But the old logic could as we have seen.

@banderon1 banderon1 changed the title (17.31.0) DEVOPS-9370: use old capacity value when scaling down (WIP) DEVOPS-9370: use old capacity value when scaling down Jul 21, 2017
@coveralls
Copy link

coveralls commented Jul 21, 2017

Coverage Status

Coverage decreased (-0.03%) to 96.959% when pulling 6716640 on nz285:DEVOPS-9370 into 9b0a4a2 on dandb:master.

@dandb dandb deleted a comment from coveralls Jul 21, 2017
@dandb dandb deleted a comment from coveralls Jul 21, 2017
@dandb dandb deleted a comment from coveralls Jul 21, 2017
@dandb dandb deleted a comment from coveralls Jul 21, 2017
@dandb dandb deleted a comment from coveralls Jul 21, 2017
@dandb dandb deleted a comment from coveralls Jul 21, 2017
@nz285 nz285 changed the title (WIP) DEVOPS-9370: use old capacity value when scaling down (17.32.0) DEVOPS-9370: use old capacity value when scaling down Jul 21, 2017
@@ -52,6 +52,7 @@ def __init__(self,
self.health_wait = health_wait
self.only_new_wait = only_new_wait
self.existing_instance_ids = []
self.old_desired_capacity = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this default of 2 coming from vs None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default can be anything because that field will be assigned to at the beginning of deployment. However a test case is going to fail if set it to None.

if desired_state == 'increase':
new_count = self.double_autoscale_instance_count(cur_count)
self.old_desired_capacity = int(self.get_group_info(group_name)[0].desired_capacity)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should moved back outside of the try block.
if you want to cache the value do an if None: set the value, the reason being is that if this method is called todecrease without being called to increase first it will currently set itself to 2 , correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. As mentioned above, 2 was set there just to satisfy a test case. It could be set to 1000 as long as a number greater than 2. It won't matter because this line right here will set it to the original desired capacity of the ASG. It is 2 for dnbi, perhaps 3 for owl? or 5 for whatever. It is the ASG capacity that we later want to revert to.

logging.info("Current desired count was changed from {0} to {1}".format(self.new_desired_capacity, self.old_desired_capacity))
self.new_desired_capacity = self.old_desired_capacity
else:
raise Exception("Please make sure the desired_state is set to either increase or decrease")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put back the value that was passed in the error message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should also be a ValueError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is to satisfy a test case. We can change the test but it is really not related to the issue we are attacking here.

@@ -52,6 +52,7 @@ def __init__(self,
self.health_wait = health_wait
self.only_new_wait = only_new_wait
self.existing_instance_ids = []
self.old_desired_capacity = 2
self.new_desired_capacity = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a dead gobal variable as you've refactored to only be used within method scope

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not dead. new_desired_capacity is used by code (line 202). We need this field to hold the value the capacity is temporarily increased to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it's only used in the one function, calculate_autoscale_desired_instance_count, it doesn't need to be declared in the init. we can just declare it in the function w/o self.

@@ -52,6 +52,7 @@ def __init__(self,
self.health_wait = health_wait
self.only_new_wait = only_new_wait
self.existing_instance_ids = []
self.old_desired_capacity = 2
self.new_desired_capacity = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it's only used in the one function, calculate_autoscale_desired_instance_count, it doesn't need to be declared in the init. we can just declare it in the function w/o self.

@nz285
Copy link
Contributor Author

nz285 commented Jul 26, 2017

@taoistmath We need to store that state when it's initially calculated and use that value later, rather than calculate on the fly when it's needed. That's the whole point. Those 2 fields (old and new) are the states we need to store when deployment begins and we need to reference these 2 states later. We do not want to calculate them when the same function is invoked again later, because the ASG capacity could have changed by then.

@taoistmath
Copy link
Contributor

@nz285 sorry for not understanding, when I look at the code base, the references to new_desired_capacity are all being removed except for in that one function. Can you please point me to where that state is being stored? You reference line 202 in your comment to mayn, but you're removing that line in your PR, so I'm not sure where I should be looking.

@nz285
Copy link
Contributor Author

nz285 commented Jul 26, 2017

@taoistmath That variable is directly used only on line 202, yes. But if you follow up the call chain of the function it is in all the way to gather_instance_info, gather_instance_info is used at 1) launch 2) revert. I should be more clear perhaps that the field was referenced more than once at differenct occasions, but not appearing in code multiple places.

@@ -123,24 +124,19 @@ def get_lb(self):
def calculate_autoscale_desired_instance_count(self, group_name, desired_state):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be done as such and avoid the class variable:

  def calculate_autoscale_desired_instance_count(self, group_name, desired_state):
    ''' Search via specific autoscale group name to return modified desired instance count '''
    try:
      new_desired_capacity = self.old_desired_capacity * 2
      if desired_state == 'increase':
        self.old_desired_capacity = int(self.get_group_info(group_name)[0].desired_capacity)
        logging.info("Current desired count was changed from {0} to {1}".format(self.old_desired_capacity, new_desired_capacity))
        return new_desired_capacity
      elif desired_state == 'decrease':
        logging.info("Current desired count was changed from {0} to {1}".format(new_desired_capacity, self.old_desired_capacity))
        return self.old_desired_capacity
      else:
        raise Exception("Please make sure the desired_state is set to either increase or decrease")
      return None #not sure this is required
    except Exception as e:
      logging.error(e)
      exit(self.exit_error_code)

@taoistmath taoistmath dismissed their stale review July 26, 2017 23:38

understand direction commiter is taking

Copy link

@vmadura vmadura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for joining the party late, but CloudWatch Alarms are supposed to be disabled before and re-enabled after deployment. This issue should never occur since the desired capacity should not change.

self.disable_project_cloudwatch_alarms()
self.new_desired_capacity = self.calculate_autoscale_desired_instance_count(group_name, 'increase')
self.set_autoscale_instance_desired_count(self.new_desired_capacity, group_name)
self.launch_new_instances(group_name)
self.set_autoscale_instance_desired_count(self.calculate_autoscale_desired_instance_count(group_name, 'decrease'), group_name)
self.confirm_lb_has_only_new_instances()
self.tag_ami(self.ami_id, self.env)
self.enable_project_cloudwatch_alarms()

I'm guessing there is some other bug during disabling of cloudwatch alarms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants