(17.32.0) DEVOPS-9370: use old capacity value when scaling down #41

nz285 · 2017-07-19T18:28:37Z

Ticket here -
https://dunandb.jira.com/browse/DEVOPS-9370
I left a comment in the ticket detailing the reason for this change.

Tested with deploying dnbi-cache_qa from my local and it works.

USAU809914:License2Deploy zhangni (DEVOPS-9370)$ python rolling_deploy.py -f false -e qa -p dnbi-cache -b 13 -a ami-ffe9c59f -s dnbi-backend-qa
2017-07-19 16:03:11,391: INFO: Begin Logging...
2017-07-19 16:03:11,398: INFO: Found credentials in shared credentials file: ~/.aws/credentials
2017-07-19 16:03:12,238: INFO: Starting new HTTPS connection (1): cloudformation.us-west-1.amazonaws.com
2017-07-19 16:03:13,748: INFO: AMI ami-ffe9c59f is ready
2017-07-19 16:03:13,748: INFO: Build #: 13 ::: Autoscale Group: dnbi-backend-qa-dnbicacheASGqa-1AECPTB16Y0KA
2017-07-19 16:03:14,753: INFO: List of all Instance ID's and IP addresses in dnbi-backend-qa-dnbicacheASGqa-1AECPTB16Y0KA: {u'i-08297af941b736991': u'10.30.10.126', u'i-0d358ac45d88d09b0': u'10.30.10.167'}
2017-07-19 16:03:15,830: INFO: Disabled cloud-watch alarm. dnbi-backend-qa-dnbicacheSCALEDOWNALARMqa-1RMZBUOKBCJ2
2017-07-19 16:03:15,996: INFO: Disabled cloud-watch alarm. dnbi-backend-qa-dnbicacheSCALEUPALARMqa-MPRMFU4DVJP7
2017-07-19 16:03:16,219: INFO: Current desired count was changed from 2 to 4
2017-07-19 16:03:16,219: INFO: Set autoscale capacity for dnbi-backend-qa-dnbicacheASGqa-1AECPTB16Y0KA to 4
2017-07-19 16:03:16,394: INFO: Trying for maximum 10 minutes to allow for instances to be created.
2017-07-19 16:03:16,553: INFO: Instance ID List: [u'i-08297af941b736991', u'i-0d358ac45d88d09b0']
2017-07-19 16:03:16,768: WARNING: Not all new instances with build number "13" are in the group, retrying in 60 seconds...
2017-07-19 16:04:17,449: INFO: Instance ID List: [u'i-05b98bd6d2f51b782', u'i-08297af941b736991', u'i-0a4b2133213687788', u'i-0d358ac45d88d09b0']
2017-07-19 16:04:18,235: WARNING: Not all new instances with build number "13" are in the group, retrying in 60 seconds...
2017-07-19 16:05:18,874: INFO: Instance ID List: [u'i-05b98bd6d2f51b782', u'i-08297af941b736991', u'i-0a4b2133213687788', u'i-0d358ac45d88d09b0']
2017-07-19 16:05:20,023: INFO: New Instance List with IP Addresses: {u'i-0a4b2133213687788': u'10.30.10.17', u'i-05b98bd6d2f51b782': u'10.30.10.8'}
2017-07-19 16:05:20,023: INFO: Waiting maximum 5 minutes for instances to be ready.
2017-07-19 16:05:20,158: WARNING: i-05b98bd6d2f51b782 is not in a fully working state yet
2017-07-19 16:05:51,569: WARNING: i-05b98bd6d2f51b782 is not in a fully working state yet
2017-07-19 16:06:22,567: WARNING: i-05b98bd6d2f51b782 is not in a fully working state yet
2017-07-19 16:06:53,357: WARNING: i-05b98bd6d2f51b782 is not in a fully working state yet
2017-07-19 16:07:24,651: INFO: i-05b98bd6d2f51b782 is in a healthy state. Moving on...
2017-07-19 16:07:24,788: INFO: i-0a4b2133213687788 is in a healthy state. Moving on...
2017-07-19 16:07:24,788: INFO: Trying for maximum 5 minutes to health-check all instances.
2017-07-19 16:07:25,455: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-05b98bd6d2f51b782,OutOfService), InstanceState:(i-0a4b2133213687788,OutOfService)], retrying in 30 seconds...
2017-07-19 16:07:56,237: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-05b98bd6d2f51b782,OutOfService), InstanceState:(i-0a4b2133213687788,OutOfService)], retrying in 30 seconds...
2017-07-19 16:08:27,667: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-05b98bd6d2f51b782,OutOfService), InstanceState:(i-0a4b2133213687788,OutOfService)], retrying in 30 seconds...
2017-07-19 16:08:58,368: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-05b98bd6d2f51b782,OutOfService), InstanceState:(i-0a4b2133213687788,OutOfService)], retrying in 30 seconds...
2017-07-19 16:09:29,448: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-05b98bd6d2f51b782,OutOfService), InstanceState:(i-0a4b2133213687788,OutOfService)], retrying in 30 seconds...
2017-07-19 16:10:00,629: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-05b98bd6d2f51b782,OutOfService)], retrying in 30 seconds...
2017-07-19 16:10:32,184: INFO: ELB healthcheck OK
2017-07-19 16:10:32,184: INFO: Current desired count was changed from 4 to 2
2017-07-19 16:10:32,184: INFO: Set autoscale capacity for dnbi-backend-qa-dnbicacheASGqa-1AECPTB16Y0KA to 2
2017-07-19 16:10:32,761: INFO: Waiting maximum 5 minutes to terminate old instances.
2017-07-19 16:10:34,139: INFO: Deployed instances [InstanceState:(i-05b98bd6d2f51b782,InService), InstanceState:(i-08297af941b736991,InService), InstanceState:(i-0a4b2133213687788,InService), InstanceState:(i-0d358ac45d88d09b0,InService)] to ELB: dnbicacheELBqa
2017-07-19 16:10:34,285: INFO: No tagging necessary, already tagged with env: qa
2017-07-19 16:10:34,286: INFO: Found an alarm. dnbi-backend-qa-dnbicacheSCALEDOWNALARMqa-1RMZBUOKBCJ2
2017-07-19 16:10:34,929: INFO: Enabled cloud-watch alarm. dnbi-backend-qa-dnbicacheSCALEDOWNALARMqa-1RMZBUOKBCJ2
2017-07-19 16:10:34,930: INFO: Found an alarm. dnbi-backend-qa-dnbicacheSCALEUPALARMqa-MPRMFU4DVJP7
2017-07-19 16:10:35,086: INFO: Enabled cloud-watch alarm. dnbi-backend-qa-dnbicacheSCALEUPALARMqa-MPRMFU4DVJP7
2017-07-19 16:10:35,086: INFO: Deployment Complete!

Also tested with cos-service_qa -

USAU809914:License2Deploy zhangni (DEVOPS-9370)$ python rolling_deploy.py -f false -e qa -p cos-service -b 203 -a ami-244f6644 -s cos-service-qa
2017-07-20 14:15:18,632: INFO: Begin Logging...
2017-07-20 14:15:18,641: INFO: Found credentials in shared credentials file: ~/.aws/credentials
2017-07-20 14:15:19,413: INFO: Starting new HTTPS connection (1): cloudformation.us-west-1.amazonaws.com
2017-07-20 14:15:20,977: INFO: AMI ami-244f6644 is ready
2017-07-20 14:15:20,977: INFO: Build #: 203 ::: Autoscale Group: cos-service-qa-cosserviceASGqa-1HP11V36JSGQG
2017-07-20 14:15:21,939: INFO: List of all Instance ID's and IP addresses in cos-service-qa-cosserviceASGqa-1HP11V36JSGQG: {u'i-07e081e6e544897c0': u'10.36.10.230', u'i-0886ed808dbd65ba9': u'10.36.10.94'}
2017-07-20 14:15:22,804: INFO: Disabled cloud-watch alarm. cos-service-qa-cosserviceSCALEDOWNALARMqa-1W9NQYIO5U0OO
2017-07-20 14:15:22,968: INFO: Disabled cloud-watch alarm. cos-service-qa-cosserviceSCALEUPALARMqa-1R1ZXV9VIOY2U
2017-07-20 14:15:23,092: INFO: Current desired count was changed from 2 to 4
2017-07-20 14:15:23,092: INFO: Set autoscale capacity for cos-service-qa-cosserviceASGqa-1HP11V36JSGQG to 4
2017-07-20 14:15:23,229: INFO: Trying for maximum 10 minutes to allow for instances to be created.
2017-07-20 14:15:23,357: INFO: Instance ID List: [u'i-07e081e6e544897c0', u'i-0886ed808dbd65ba9']
2017-07-20 14:15:23,571: WARNING: Not all new instances with build number "203" are in the group, retrying in 60 seconds...
2017-07-20 14:16:24,205: INFO: Instance ID List: [u'i-008024d31868349df', u'i-07e081e6e544897c0', u'i-0886ed808dbd65ba9', u'i-0bd1edbcbbc869e1d']
2017-07-20 14:16:25,362: INFO: New Instance List with IP Addresses: {u'i-008024d31868349df': u'10.36.10.102', u'i-0bd1edbcbbc869e1d': u'10.36.10.80'}
2017-07-20 14:16:25,362: INFO: Waiting maximum 5 minutes for instances to be ready.
2017-07-20 14:16:25,513: WARNING: i-0bd1edbcbbc869e1d is not in a fully working state yet
2017-07-20 14:16:56,482: WARNING: i-0bd1edbcbbc869e1d is not in a fully working state yet
2017-07-20 14:17:27,385: WARNING: i-0bd1edbcbbc869e1d is not in a fully working state yet
2017-07-20 14:17:58,629: WARNING: i-0bd1edbcbbc869e1d is not in a fully working state yet
2017-07-20 14:18:29,472: WARNING: i-0bd1edbcbbc869e1d is not in a fully working state yet
2017-07-20 14:19:00,837: INFO: i-0bd1edbcbbc869e1d is in a healthy state. Moving on...
2017-07-20 14:19:00,954: INFO: i-008024d31868349df is in a healthy state. Moving on...
2017-07-20 14:19:00,954: INFO: Trying for maximum 5 minutes to health-check all instances.
2017-07-20 14:19:01,553: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-0bd1edbcbbc869e1d,OutOfService), InstanceState:(i-008024d31868349df,OutOfService)], retrying in 30 seconds...
2017-07-20 14:19:32,918: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-0bd1edbcbbc869e1d,OutOfService), InstanceState:(i-008024d31868349df,OutOfService)], retrying in 30 seconds...
2017-07-20 14:20:04,430: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-0bd1edbcbbc869e1d,OutOfService), InstanceState:(i-008024d31868349df,OutOfService)], retrying in 30 seconds...
2017-07-20 14:20:35,043: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-0bd1edbcbbc869e1d,OutOfService), InstanceState:(i-008024d31868349df,OutOfService)], retrying in 30 seconds...
2017-07-20 14:21:06,528: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-0bd1edbcbbc869e1d,OutOfService), InstanceState:(i-008024d31868349df,OutOfService)], retrying in 30 seconds...
2017-07-20 14:21:37,328: WARNING: Must check load balancer again. Following instance(s) are not "InService": [InstanceState:(i-0bd1edbcbbc869e1d,OutOfService), InstanceState:(i-008024d31868349df,OutOfService)], retrying in 30 seconds...
2017-07-20 14:22:08,505: INFO: ELB healthcheck OK
2017-07-20 14:22:08,505: INFO: Current desired count was changed from 4 to 2
2017-07-20 14:22:08,505: INFO: Set autoscale capacity for cos-service-qa-cosserviceASGqa-1HP11V36JSGQG to 2
2017-07-20 14:22:09,129: INFO: Waiting maximum 5 minutes to terminate old instances.
2017-07-20 14:22:10,566: INFO: Deployed instances [InstanceState:(i-008024d31868349df,InService), InstanceState:(i-07e081e6e544897c0,InService), InstanceState:(i-0886ed808dbd65ba9,InService), InstanceState:(i-0bd1edbcbbc869e1d,InService)] to ELB: cosserviceELBqa
2017-07-20 14:22:10,727: INFO: No tagging necessary, already tagged with env: qa
2017-07-20 14:22:10,727: INFO: Found an alarm. cos-service-qa-cosserviceSCALEDOWNALARMqa-1W9NQYIO5U0OO
2017-07-20 14:22:11,367: INFO: Enabled cloud-watch alarm. cos-service-qa-cosserviceSCALEDOWNALARMqa-1W9NQYIO5U0OO
2017-07-20 14:22:11,367: INFO: Found an alarm. cos-service-qa-cosserviceSCALEUPALARMqa-1R1ZXV9VIOY2U
2017-07-20 14:22:11,518: INFO: Enabled cloud-watch alarm. cos-service-qa-cosserviceSCALEUPALARMqa-1R1ZXV9VIOY2U
2017-07-20 14:22:11,518: INFO: Deployment Complete!

mprince

+1 after cleaning up the commits

banderon1

cleanup commits, otherwise looks good!

mayn

although less verbose, these code changes make it harder to understand what is actually happening.

Is all this refactoring necessary?

Also isn't the issue around trying to scale down, below the min capacity of the ASG? can't we just set the desired count to the max(new_count,group_min_count)

nz285 · 2017-07-21T14:56:04Z

@mayn It's not refactoring really. It's changing the value it looks at when it scales down. That is it does not look at the ASG capacity at that moment (which could have been changed since deployment started due to cpu utilization). It looks at the old ASG capacity that we retrieved when deployment started.

It won't dip below min capacity as the old desired capacity is either equal or above the min capacity when deployment starts. We just revert to what the ASG desired capacity was during scaling down. This new logic gurantee it won't dip below min capacity. But the old logic could as we have seen.

coveralls · 2017-07-21T21:27:20Z

Coverage decreased (-0.03%) to 96.959% when pulling 6716640 on nz285:DEVOPS-9370 into 9b0a4a2 on dandb:master.

mayn · 2017-07-26T12:52:44Z

License2Deploy/rolling_deploy.py

@@ -52,6 +52,7 @@ def __init__(self,
    self.health_wait = health_wait
    self.only_new_wait = only_new_wait
    self.existing_instance_ids = []
+    self.old_desired_capacity = 2


where is this default of 2 coming from vs None

The default can be anything because that field will be assigned to at the beginning of deployment. However a test case is going to fail if set it to None.

mayn · 2017-07-26T13:07:34Z

License2Deploy/rolling_deploy.py

      if desired_state == 'increase':
-        new_count = self.double_autoscale_instance_count(cur_count)
+        self.old_desired_capacity = int(self.get_group_info(group_name)[0].desired_capacity)


this should moved back outside of the try block.
if you want to cache the value do an if None: set the value, the reason being is that if this method is called todecrease without being called to increase first it will currently set itself to 2 , correct?

No. As mentioned above, 2 was set there just to satisfy a test case. It could be set to 1000 as long as a number greater than 2. It won't matter because this line right here will set it to the original desired capacity of the ASG. It is 2 for dnbi, perhaps 3 for owl? or 5 for whatever. It is the ASG capacity that we later want to revert to.

mayn · 2017-07-26T13:07:57Z

License2Deploy/rolling_deploy.py

+        logging.info("Current desired count was changed from {0} to {1}".format(self.new_desired_capacity, self.old_desired_capacity))
+        self.new_desired_capacity = self.old_desired_capacity
+      else:
+        raise Exception("Please make sure the desired_state is set to either increase or decrease")


put back the value that was passed in the error message

this should also be a ValueError

Again, this is to satisfy a test case. We can change the test but it is really not related to the issue we are attacking here.

mayn · 2017-07-26T13:12:35Z

License2Deploy/rolling_deploy.py

@@ -52,6 +52,7 @@ def __init__(self,
    self.health_wait = health_wait
    self.only_new_wait = only_new_wait
    self.existing_instance_ids = []
+    self.old_desired_capacity = 2
    self.new_desired_capacity = None


this is a dead gobal variable as you've refactored to only be used within method scope

It is not dead. new_desired_capacity is used by code (line 202). We need this field to hold the value the capacity is temporarily increased to.

since it's only used in the one function, calculate_autoscale_desired_instance_count, it doesn't need to be declared in the init. we can just declare it in the function w/o self.

taoistmath · 2017-07-26T17:40:40Z

License2Deploy/rolling_deploy.py

@@ -52,6 +52,7 @@ def __init__(self,
    self.health_wait = health_wait
    self.only_new_wait = only_new_wait
    self.existing_instance_ids = []
+    self.old_desired_capacity = 2
    self.new_desired_capacity = None


since it's only used in the one function, calculate_autoscale_desired_instance_count, it doesn't need to be declared in the init. we can just declare it in the function w/o self.

nz285 · 2017-07-26T18:42:48Z

@taoistmath We need to store that state when it's initially calculated and use that value later, rather than calculate on the fly when it's needed. That's the whole point. Those 2 fields (old and new) are the states we need to store when deployment begins and we need to reference these 2 states later. We do not want to calculate them when the same function is invoked again later, because the ASG capacity could have changed by then.

taoistmath · 2017-07-26T20:41:46Z

@nz285 sorry for not understanding, when I look at the code base, the references to new_desired_capacity are all being removed except for in that one function. Can you please point me to where that state is being stored? You reference line 202 in your comment to mayn, but you're removing that line in your PR, so I'm not sure where I should be looking.

nz285 · 2017-07-26T20:47:37Z

@taoistmath That variable is directly used only on line 202, yes. But if you follow up the call chain of the function it is in all the way to gather_instance_info, gather_instance_info is used at 1) launch 2) revert. I should be more clear perhaps that the field was referenced more than once at differenct occasions, but not appearing in code multiple places.

taoistmath · 2017-07-26T23:37:35Z

License2Deploy/rolling_deploy.py

@@ -123,24 +124,19 @@ def get_lb(self):
  def calculate_autoscale_desired_instance_count(self, group_name, desired_state):


Can't this be done as such and avoid the class variable:

def calculate_autoscale_desired_instance_count(self, group_name, desired_state): ''' Search via specific autoscale group name to return modified desired instance count ''' try: new_desired_capacity = self.old_desired_capacity * 2 if desired_state == 'increase': self.old_desired_capacity = int(self.get_group_info(group_name)[0].desired_capacity) logging.info("Current desired count was changed from {0} to {1}".format(self.old_desired_capacity, new_desired_capacity)) return new_desired_capacity elif desired_state == 'decrease': logging.info("Current desired count was changed from {0} to {1}".format(new_desired_capacity, self.old_desired_capacity)) return self.old_desired_capacity else: raise Exception("Please make sure the desired_state is set to either increase or decrease") return None #not sure this is required except Exception as e: logging.error(e) exit(self.exit_error_code)

understand direction commiter is taking

vmadura

Sorry for joining the party late, but CloudWatch Alarms are supposed to be disabled before and re-enabled after deployment. This issue should never occur since the desired capacity should not change.

License2Deploy/License2Deploy/rolling_deploy.py

Lines 365 to 372 in adecf1d

    
           self.disable_project_cloudwatch_alarms() 
        
           self.new_desired_capacity = self.calculate_autoscale_desired_instance_count(group_name, 'increase') 
        
           self.set_autoscale_instance_desired_count(self.new_desired_capacity, group_name) 
        
           self.launch_new_instances(group_name) 
        
           self.set_autoscale_instance_desired_count(self.calculate_autoscale_desired_instance_count(group_name, 'decrease'), group_name) 
        
           self.confirm_lb_has_only_new_instances() 
        
           self.tag_ami(self.ami_id, self.env) 
        
           self.enable_project_cloudwatch_alarms()

I'm guessing there is some other bug during disabling of cloudwatch alarms.

nz285 changed the title ~~Devops 9370~~ (17.31.0) DEVOPS-9370: use old capacity value when scaling down Jul 19, 2017

mprince approved these changes Jul 20, 2017

View reviewed changes

banderon1 suggested changes Jul 20, 2017

View reviewed changes

mayn suggested changes Jul 21, 2017

View reviewed changes

banderon1 changed the title ~~(17.31.0) DEVOPS-9370: use old capacity value when scaling down~~ (WIP) DEVOPS-9370: use old capacity value when scaling down Jul 21, 2017

DEVOPS-9370: use old capacity value when scaling down

6716640

dandb deleted a comment from coveralls Jul 21, 2017

nz285 changed the title ~~(WIP) DEVOPS-9370: use old capacity value when scaling down~~ (17.32.0) DEVOPS-9370: use old capacity value when scaling down Jul 21, 2017

banderon1 approved these changes Jul 21, 2017

View reviewed changes

mayn reviewed Jul 26, 2017

View reviewed changes

mayn suggested changes Jul 26, 2017

View reviewed changes

taoistmath previously requested changes Jul 26, 2017

View reviewed changes

taoistmath reviewed Jul 26, 2017

View reviewed changes

vmadura suggested changes Jul 27, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(17.32.0) DEVOPS-9370: use old capacity value when scaling down #41

(17.32.0) DEVOPS-9370: use old capacity value when scaling down #41

nz285 commented Jul 19, 2017 •

edited

Loading

mprince left a comment

banderon1 left a comment

mayn left a comment

nz285 commented Jul 21, 2017 •

edited

Loading

coveralls commented Jul 21, 2017 •

edited

Loading

mayn Jul 26, 2017

nz285 Jul 26, 2017

mayn Jul 26, 2017

nz285 Jul 26, 2017

mayn Jul 26, 2017

mayn Jul 26, 2017

nz285 Jul 26, 2017

mayn Jul 26, 2017

nz285 Jul 26, 2017

taoistmath Jul 26, 2017

taoistmath Jul 26, 2017

nz285 commented Jul 26, 2017

taoistmath commented Jul 26, 2017

nz285 commented Jul 26, 2017

taoistmath Jul 26, 2017

vmadura left a comment •

edited

Loading

		@@ -123,24 +124,19 @@ def get_lb(self):
		def calculate_autoscale_desired_instance_count(self, group_name, desired_state):

	self.disable_project_cloudwatch_alarms()
	self.new_desired_capacity = self.calculate_autoscale_desired_instance_count(group_name, 'increase')
	self.set_autoscale_instance_desired_count(self.new_desired_capacity, group_name)
	self.launch_new_instances(group_name)
	self.set_autoscale_instance_desired_count(self.calculate_autoscale_desired_instance_count(group_name, 'decrease'), group_name)
	self.confirm_lb_has_only_new_instances()
	self.tag_ami(self.ami_id, self.env)
	self.enable_project_cloudwatch_alarms()

(17.32.0) DEVOPS-9370: use old capacity value when scaling down #41

Are you sure you want to change the base?

(17.32.0) DEVOPS-9370: use old capacity value when scaling down #41

Conversation

nz285 commented Jul 19, 2017 • edited Loading

mprince left a comment

Choose a reason for hiding this comment

banderon1 left a comment

Choose a reason for hiding this comment

mayn left a comment

Choose a reason for hiding this comment

nz285 commented Jul 21, 2017 • edited Loading

coveralls commented Jul 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nz285 commented Jul 26, 2017

taoistmath commented Jul 26, 2017

nz285 commented Jul 26, 2017

Choose a reason for hiding this comment

vmadura left a comment • edited Loading

Choose a reason for hiding this comment

nz285 commented Jul 19, 2017 •

edited

Loading

nz285 commented Jul 21, 2017 •

edited

Loading

coveralls commented Jul 21, 2017 •

edited

Loading

vmadura left a comment •

edited

Loading