Musings on AutoScaling #856

rix0rrr · 2018-10-05T15:00:51Z

Note: this issue is not really an issue per se. It's a place for me to write down my thoughts and solicit feedback in place where people can easily access the document and comment.

I imagine when people want to set up autoscaling, they want to set up something like this:

 Scaling        -3       -1                                     +1       +3        
 activity   │        │        │                            │         │         │   
            ├────────┼────────┼────────────────────────────┼─────────┼─────────┤   
            │        │        │                            │         │         │   
                                                                                   
CPU usage   0%      10%      30%                          70%        90%      100%

Disregarding TargetTracking scaling for a moment, the way to set this up is with StepScaling. You'll make a StepScaling policy which is activated by a CloudWatch alarm. The StepScaling policy has different scaling tiers depending on the distance of the current metric value to its Alarm threshold. Configuration for a step scaling policy looks like this:

    Scaling           -2             -1               +1              +2           
    activity                  │               │               │                    
               ◀──────────────┼───────────────┼───────────────┼───────────────────▶
                              │               │               │                    
 Distance from                                                                     
alarm threshold              -10              0              +10

Normally, CloudWatch Alarm Actions are edge-triggered (that is, an Action occurs only when the alarm transitions from OK to ALARM or vice-versa). However, if the Alarm Action is an AutoScaling policy the Alarm keeps on triggering the AutoScaling policy periodically, so that if the alarm goes further out of spec, higher scaling tiers can be activated. (For example, the CPU usage goes to 75% and an instance is added. However, that doesn't make the load go down enough yet, and after an (undefined) while the policy is activated again and another instance is added).

The question is, how to set up step scaling policies to achieve the scaling that the user wants?

Solution 1: Two alarms, two policies

This is what I see most on the internet--a separate alarm and a separate scaling policy for scaling out and scaling in. Seems inefficient, resource-wise.

Also, AutoScaling policies seem to imply they can both scale in and scale out in a single policy, but you'd never take advantage of that in this way.

               -3       -1                                     +1       +3        
           │        │        │                            │         │         │   
  Metric   ├────────┼────────┼────────────────────────────┼─────────┼─────────┤   
           │        │        │                            │         │         │   
                                                                                  
           0%      10%      30%                          70%        90%      100% 
                                                                                  
                                                          ║                       
   Alarm1                                     > 70%       ║                       
                                                          ║                       
                                                               +1        +3       
                                                          │         │             
  Policy1                                                 ├─────────┼─────────▶   
                                                          │         │             
                                                                                  
                                                          0        +20            
                                                                                  
                             ║                                                    
   Alarm2           < 30%    ║                                                    
                             ║                                                    
               -3       -1                                                        
                    │        │                                                    
  Policy2 ◀─────────┼────────┤                                                    
                    │        │                                                    
                                                                                  
                  -60       -40

Solution 2: One alarm, one policy

Ideally, it seems like I would want just a single alarm/metric/scalingpolicy configuration. But I don't know if this would even work, it would require that the scaling policy is activated on both sides of the CloudWatch alarm, and it might not do that.

               -3       -1                                     +1       +3        
           │        │        │                            │         │         │   
  Metric   ├────────┼────────┼────────────────────────────┼─────────┼─────────┤   
           │        │        │                            │         │         │   
                                                                                  
           0%      10%      30%                          70%        90%      100% 
                                                                                  
                                                          ║                       
   Alarm                                  Alarm at 70%    ║                       
                                                          ║                       
                                                                                  
               -3       -1                0                    +1        +3       
                    │        │                            │         │             
  Policy  ◀─────────┼────────┼────────────────────────────┼─────────┼─────────▶   
                    │        │                            │         │             
                                                                                  
                  -60       -40                           0        +20

Solution 3: Two alarms, one policy

Another potential way to go would be to have two alarms trigger the same scaling policy (one lower-than-threshold and one greater-than-threshold) and just describe the scaling behavior with respect to the alarm thresholds on either side.

We can save on a ScalingPolicy in this way (with respect to solution 1). It's not as obvious what's going on though, and would rely on the fact that 2 different alarms give the deltas to 2 different thresholds to the same scaling policy ONLY when they're in alarm.

               -3       -1                                     +1       +3        
           │        │        │                            │         │         │   
  Metric   ├────────┼────────┼────────────────────────────┼─────────┼─────────┤   
           │        │        │                            │         │         │   
                                                                                  
           0%      10%      30%                          70%        90%      100% 
                                                                                  
                                                          ║                       
   Alarm1                                     > 70%       ║                       
                                                          ║                       
                             ║                                                    
   Alarm2           < 30%    ║                            ─                       
                             ║                         ┌ ┘                        
                             ─                        ─                           
                              └ ┐                  ┌ ┘                            
                                 ─                ─                               
                                  └ ┐          ┌ ┘                                
                                     ─        ─                                   
                                      └ ┐  ┌ ┘                                    
                                         ─┌                                       
                         -3        -1     ▼     +1           +3                   
                              │           │           │                           
  Policy  ◀───────────────────┼───────────┼───────────┼───────────────────────▶   
                              │           │           │                           
                                                                                  
                            -20           0          +20

Questions

~~Can an AutoScaling Policy be in the OKActions of a CloudWatch Alarm? Does that work?~~ => yes
~~If it does work, will it also continue evaluating and triggering the Scaling Policy periodically if it's on the OKActions, just like it does in AlarmActions?~~ => yes
Can a single AutoScaling policy even be the target of two CloudWatch alarms? If so, will it respect both threshold deltas as I'm expecting it to in scenario 3?
Is there a difference between Application AutoScaling and EC2 Instance AutoScaling?
Can I have multiple StepScalingPolicies on the same target that both scale the target at the same time? Are they going to fight? What if one has a ChangeInCapacity = 0 ?

The text was updated successfully, but these errors were encountered:

rix0rrr · 2018-10-05T15:34:24Z

By the way, this is all predicated on the assumption that users would rather think in terms of the very first thing I showed: absolute metric values and the scaling behavior based on that; all the drilldown that happens to scaling policies to me feels just like implementation details and calculations that can and should be hidden.

I don't have data for that, it's just a gut feeling. Am I sorely wrong on that?

allisaurus · 2018-10-05T16:01:25Z

Can I ask why you're not considering TargetTracking here? From my (admittedly limited) understanding, this would be a way to scale in and out based on one policy/metric. Are there specifics re: how such a setup would behave that you're concerned about?

rix0rrr · 2018-10-05T16:07:16Z

I love the idea of target tracking, and I will definitely implement that as well.

But do you think I should not implement step scaling at all? Just leave it out of the API and force all customers to use target tracking if they want to autoscale?

rix0rrr · 2018-10-05T16:10:03Z

@allisaurus, by the way, do custom metrics work for target tracking? Because the CloudFormation docs seem to imply they don't.

allisaurus · 2018-10-05T16:36:32Z

It looks like custom metrics related to EC2 instance utilization are permissible with target tracking. Looks like there may be a gap in what's currently supported in the service vs. via CFN.

To your second point: accommodating use cases that require significantly different scale in/out behavior, or those that work off metrics incompatible w/ target tracking, would be a good argument for implementing step scaling as well (though I'm ill equipped to speak to how prevalent they are).

rix0rrr · 2018-10-06T11:41:06Z

I don't think we can afford to only implement only part of the feature set. So that means we have to build an API for step scaling anyway, and I'd like it to be as good as possible.

jungseoklee · 2018-10-07T23:11:18Z

@allisaurus, by the way, do custom metrics work for target tracking? Because the CloudFormation docs seem to imply they don't.

The link is about Application Auto Scaling, not EC2 Auto Scaling.

rix0rrr · 2018-10-08T08:31:43Z

@jungseoklee where did you get the impression this topic was about EC2 autoscaling? I don't think I've mentioned it anywhere, and in fact I did come at this by just looking at App AutoScaling so far (although I do believe the instance autoscaling API is very similar).

rix0rrr · 2018-10-08T12:23:18Z

Here's another question, feel free to provide input:

How are we going to model the API to represent thresholds and scaling actions?

Let's say for a fictitious CPU usage/scaling example:

Option 1: fluent API

scaling
    .at(0).scale(-2)
    .at(10).scale(-1)
    .at(20).scale(0)
    .at(80).scale(+1)
    .at(90).scale(+2)
.end()

Option 2: allow omitting bounds of bordering intervals

scaling.addTier({ upperBound: 10, adjustment: -2 });
scaling.addTier({ upperBound: 20, adjustment: -1 });
scaling.addTier({ lowerBound: 80, upperBound: 90, adjustment: +1 });
scaling.addTier({ adjustment: +2 });

or:

scaling.addTier({ upperBound: 10, adjustment: -2 });
scaling.addTier({ upperBound: 20, adjustment: -1 });
scaling.addTier({ lowerBound: 80, adjustment: +1 });
scaling.addTier({ lowerBound: 90, adjustment: +2 });

Option 3: mixing thresholds and scales in a single array:

scale([ 0, -2, 10, -1, 20, 0, 80, +1, 90, +2, 100 ])

Option 4: separate thresholds and scales

.thresholds([0, 10, 20, 80, 90, 100])
.scales([-2, -1, 0, 1, 2])

jungseoklee · 2018-10-08T16:06:25Z

@jungseoklee where did you get the impression this topic was about EC2 autoscaling? I don't think I've mentioned it anywhere, and in fact I did come at this by just looking at App AutoScaling so far (although I do believe the instance autoscaling API is very similar).

True. You mentioned two terms, "CPU usage" and "instance is added", only. Nevertheless, I unconsciously combined the terms with 1) the link, custom metrics related to EC2 instance utilization, in the comment stream and 2) my understanding that step scaling policy is not applicable to DynamoDB [1] which is about Application Auto Scaling, so I got the impression.

I would like to understand this topic correctly. There are no other intentions.

[1] https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-applicationautoscaling-scalingpolicy.html#cfn-applicationautoscaling-scalingpolicy-policytype

jungseoklee · 2018-10-08T16:19:24Z

Here's another question, feel free to provide input:

How are we going to model the API to represent thresholds and scaling actions?

I would vote for option 1 which is a great abstraction and modeling to me.

Regarding option 2, users need to understand what upperBound, lowerBound, and adjustment are, and upperBound and lowerBound seems like optional members of a property used for addTier method.

Regarding option 3, it seems error-prone compared to other options. For example, what if I switch 0 scale with 20 threshold like scale([ 0, -2, 10, -1, 0, 20, 80, +1, 90, +2, 100 ])?

In case of option 4, this option looks better than options 2 and option 3, but we probably need to check if array size of scale == (array size of threshold - 1).

allisaurus · 2018-10-08T16:21:56Z

@rix0rrr 👍 agreed re: not implementing just part of the feature set (and wanting to make step scaling the best it can be). I only meant that since use cases exist which won't be well accommodated by target tracking, I think we should implement step scaling too vs. just target tracking. Thanks, btw, for the creating this issue for discussion!

Fixes #856, #861, #640, #644.

Adds a construct library for Application AutoScaling. The DynamoDB construct library has been updated to use the new AutoScaling mechanism, which allows more configuration and uses a Service Linked Role instead of a role per table. BREAKING CHANGE: instead of `addReadAutoScaling()`, call `autoScaleReadCapacity()`, and similar for write scaling. Fixes #856, #861, #640, #644.

__IMPORTANT NOTE__: when upgrading to this version of the CDK framework, you must also upgrade your installation the CDK Toolkit to the matching version: ```shell $ npm i -g aws-cdk $ cdk --version 0.14.0 (build ...) ``` Bug Fixes ========= * remove CloudFormation property renames ([#973](#973)) ([3f86603](3f86603)), closes [#852](#852) * **aws-ec2:** fix retention of all egress traffic rule ([#998](#998)) ([b9d5b43](b9d5b43)), closes [#987](#987) * **aws-s3-deployment:** avoid deletion during update using physical ids ([#1006](#1006)) ([bca99c6](bca99c6)), closes [#981](#981) [#981](#981) * **cloudformation-diff:** ignore changes to DependsOn ([#1005](#1005)) ([3605f9c](3605f9c)), closes [#274](#274) * **cloudformation-diff:** track replacements ([#1003](#1003)) ([a83ac5f](a83ac5f)), closes [#1001](#1001) * **docs:** fix EC2 readme for "natgatway" configuration ([#994](#994)) ([0b1e7cc](0b1e7cc)) * **docs:** updates to contribution guide ([#997](#997)) ([b42e742](b42e742)) * **iam:** Merge multiple principals correctly ([#983](#983)) ([3fc5c8c](3fc5c8c)), closes [#924](#924) [#916](#916) [#958](#958) Features ========= * add construct library for Application AutoScaling ([#933](#933)) ([7861c6f](7861c6f)), closes [#856](#856) [#861](#861) [#640](#640) [#644](#644) * add HostedZone context provider ([#823](#823)) ([1626c37](1626c37)) * **assert:** haveResource lists failing properties ([#1016](#1016)) ([7f6f3fd](7f6f3fd)) * **aws-cdk:** add CDK app version negotiation ([#988](#988)) ([db4e718](db4e718)), closes [#891](#891) * **aws-codebuild:** Introduce a CodePipeline test Action. ([#873](#873)) ([770f9aa](770f9aa)) * **aws-sqs:** Add grantXxx() methods ([#1004](#1004)) ([8c90350](8c90350)) * **core:** Pre-concatenate Fn::Join ([#967](#967)) ([33c32a8](33c32a8)), closes [#916](#916) [#958](#958) BREAKING CHANGES ========= * DynamoDB AutoScaling: Instead of `addReadAutoScaling()`, call `autoScaleReadCapacity()`, and similar for write scaling. * CloudFormation resource usage: If you use L1s, you may need to change some `XxxName` properties back into `Name`. These will match the CloudFormation property names. * You must use the matching `aws-cdk` toolkit when upgrading to this version, or context providers will cease to work. All existing cached context values in `cdk.json` will be invalidated and refreshed.

__IMPORTANT NOTE__: when upgrading to this version of the CDK framework, you must also upgrade your installation the CDK Toolkit to the matching version: ```shell $ npm i -g aws-cdk $ cdk --version 0.14.0 (build ...) ``` Bug Fixes ========= * remove CloudFormation property renames ([aws#973](aws#973)) ([3f86603](aws@3f86603)), closes [aws#852](aws#852) * **aws-ec2:** fix retention of all egress traffic rule ([aws#998](aws#998)) ([b9d5b43](aws@b9d5b43)), closes [aws#987](aws#987) * **aws-s3-deployment:** avoid deletion during update using physical ids ([aws#1006](aws#1006)) ([bca99c6](aws@bca99c6)), closes [aws#981](aws#981) [aws#981](aws#981) * **cloudformation-diff:** ignore changes to DependsOn ([aws#1005](aws#1005)) ([3605f9c](aws@3605f9c)), closes [aws#274](aws#274) * **cloudformation-diff:** track replacements ([aws#1003](aws#1003)) ([a83ac5f](aws@a83ac5f)), closes [aws#1001](aws#1001) * **docs:** fix EC2 readme for "natgatway" configuration ([aws#994](aws#994)) ([0b1e7cc](aws@0b1e7cc)) * **docs:** updates to contribution guide ([aws#997](aws#997)) ([b42e742](aws@b42e742)) * **iam:** Merge multiple principals correctly ([aws#983](aws#983)) ([3fc5c8c](aws@3fc5c8c)), closes [aws#924](aws#924) [aws#916](aws#916) [aws#958](aws#958) Features ========= * add construct library for Application AutoScaling ([aws#933](aws#933)) ([7861c6f](aws@7861c6f)), closes [aws#856](aws#856) [aws#861](aws#861) [aws#640](aws#640) [aws#644](aws#644) * add HostedZone context provider ([aws#823](aws#823)) ([1626c37](aws@1626c37)) * **assert:** haveResource lists failing properties ([aws#1016](aws#1016)) ([7f6f3fd](aws@7f6f3fd)) * **aws-cdk:** add CDK app version negotiation ([aws#988](aws#988)) ([db4e718](aws@db4e718)), closes [aws#891](aws#891) * **aws-codebuild:** Introduce a CodePipeline test Action. ([aws#873](aws#873)) ([770f9aa](aws@770f9aa)) * **aws-sqs:** Add grantXxx() methods ([aws#1004](aws#1004)) ([8c90350](aws@8c90350)) * **core:** Pre-concatenate Fn::Join ([aws#967](aws#967)) ([33c32a8](aws@33c32a8)), closes [aws#916](aws#916) [aws#958](aws#958) BREAKING CHANGES ========= * DynamoDB AutoScaling: Instead of `addReadAutoScaling()`, call `autoScaleReadCapacity()`, and similar for write scaling. * CloudFormation resource usage: If you use L1s, you may need to change some `XxxName` properties back into `Name`. These will match the CloudFormation property names. * You must use the matching `aws-cdk` toolkit when upgrading to this version, or context providers will cease to work. All existing cached context values in `cdk.json` will be invalidated and refreshed.

jungseoklee mentioned this issue Oct 7, 2018

aws-dynamodb: refactor to use Application AutoScaling library #861

Closed

rix0rrr pushed a commit that referenced this issue Oct 15, 2018

feat: add construct library for Application AutoScaling

08b8a1c

Fixes #856, #861, #640, #644.

rix0rrr mentioned this issue Oct 15, 2018

feat: add construct library for Application AutoScaling #933

Merged

rix0rrr closed this as completed in #933 Oct 25, 2018

rix0rrr mentioned this issue Oct 26, 2018

v0.14.0 #1021

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Musings on AutoScaling #856

Musings on AutoScaling #856

rix0rrr commented Oct 5, 2018 •

edited

Loading

rix0rrr commented Oct 5, 2018

allisaurus commented Oct 5, 2018

rix0rrr commented Oct 5, 2018 •

edited

Loading

rix0rrr commented Oct 5, 2018 •

edited

Loading

allisaurus commented Oct 5, 2018

rix0rrr commented Oct 6, 2018

jungseoklee commented Oct 7, 2018

rix0rrr commented Oct 8, 2018

rix0rrr commented Oct 8, 2018

jungseoklee commented Oct 8, 2018

jungseoklee commented Oct 8, 2018

allisaurus commented Oct 8, 2018

Musings on AutoScaling #856

Musings on AutoScaling #856

Comments

rix0rrr commented Oct 5, 2018 • edited Loading

Solution 1: Two alarms, two policies

Solution 2: One alarm, one policy

Solution 3: Two alarms, one policy

Questions

rix0rrr commented Oct 5, 2018

allisaurus commented Oct 5, 2018

rix0rrr commented Oct 5, 2018 • edited Loading

rix0rrr commented Oct 5, 2018 • edited Loading

allisaurus commented Oct 5, 2018

rix0rrr commented Oct 6, 2018

jungseoklee commented Oct 7, 2018

rix0rrr commented Oct 8, 2018

rix0rrr commented Oct 8, 2018

Option 1: fluent API

Option 2: allow omitting bounds of bordering intervals

Option 3: mixing thresholds and scales in a single array:

Option 4: separate thresholds and scales

jungseoklee commented Oct 8, 2018

jungseoklee commented Oct 8, 2018

allisaurus commented Oct 8, 2018

rix0rrr commented Oct 5, 2018 •

edited

Loading

rix0rrr commented Oct 5, 2018 •

edited

Loading

rix0rrr commented Oct 5, 2018 •

edited

Loading