Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support dynamic scaling of stable ReplicaSet as inverse of canary weight #1430

Merged
merged 4 commits into from Sep 21, 2021

Conversation

jessesuen
Copy link
Member

@jessesuen jessesuen commented Aug 19, 2021

Fixes #1029

  1. Introduces a canary.dynamicStableScale flag which allows the stable ReplicaSet to dynamically scale down/up depending on canary traffic weight. This feature is needed to run less pods during an update, but at the cost of slower aborts (since the stable will need to scale back up).

  2. The feature introduces a new weights field into the rollout status status.canary.weights. e.g.:

status:
  canary:
    weights:
      canary:
        podTemplateHash: 68978bfc96
        serviceName: dynamic-stable-scale-canary
        weight: 0
      stable:
        podTemplateHash: 5cc54cc5f4
        serviceName: dynamic-stable-scale-stable
        weight: 100

The tracking of canary weights into status is needed for us to understand the weight which was set in the traffic router. This allows us to calculate the amount of replicas which we can safely reduce the scale the stable (or canary during abort). For example, it is not safe to scale down the stable, until we know traffic has shifted away from it. It also avoids us to have all traffic routers implement a GetCurrentWeight() function because we instead remember what we set last.

We could also use this information to show in the Rollout dashboard in the future, or the CLI. Currently the actual weight field in the CLI is not very accurate.

  1. During an abort, we scale stable to 100% immediately. As the stable pods become available, we start shifting traffic to the stable ReplicaSet (whatever is available), and start scaling down the canary. Users may opt-out of scaling down the canary upon abort by explicitly setting a value for canary,abortScaleDownDelaySeconds.

Signed-off-by: Jesse Suen jessesuen@gmail.com

@jessesuen jessesuen force-pushed the feat-dynamicStableScale branch 2 times, most recently from c522218 to b2beef5 Compare August 28, 2021 09:00
@jessesuen jessesuen marked this pull request as ready for review September 3, 2021 06:06
@harikrongali harikrongali added this to the v1.1 milestone Sep 13, 2021
@codecov
Copy link

codecov bot commented Sep 16, 2021

Codecov Report

Merging #1430 (0e01d6b) into master (d9ba36a) will increase coverage by 0.06%.
The diff coverage is 76.26%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1430      +/-   ##
==========================================
+ Coverage   81.67%   81.73%   +0.06%     
==========================================
  Files         110      112       +2     
  Lines       14798    15070     +272     
==========================================
+ Hits        12086    12318     +232     
- Misses       2078     2107      +29     
- Partials      634      645      +11     
Impacted Files Coverage Δ
utils/conditions/conditions.go 80.76% <ø> (ø)
utils/replicaset/replicaset.go 90.04% <12.50%> (-0.83%) ⬇️
rollout/trafficrouting/istio/istio.go 80.96% <33.33%> (ø)
rollout/trafficrouting/nginx/nginx.go 85.81% <33.33%> (ø)
rollout/replicaset.go 67.59% <59.37%> (-3.09%) ⬇️
utils/defaults/defaults.go 88.02% <62.50%> (-3.64%) ⬇️
rollout/trafficrouting/alb/alb.go 84.86% <63.63%> (ø)
utils/replicaset/canary.go 89.42% <69.56%> (+4.23%) ⬆️
rollout/trafficrouting/smi/smi.go 94.90% <71.42%> (ø)
rollout/canary.go 76.52% <73.33%> (+0.06%) ⬆️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7c54090...0e01d6b. Read the comment docs.

@harikrongali
Copy link
Contributor

@perenesenko can you please review and validate

Copy link
Member

@perenesenko perenesenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested different scenarios of promotion and abortion with Istio and Nginx. All works fine
Just have two questions...

docs/features/canary.md Outdated Show resolved Hide resolved
docs/features/canary.md Outdated Show resolved Hide resolved
@perenesenko
Copy link
Member

Q1 what about using the setCanaryScale. Previously we discussed with @harikrongali Probably we should prevent to use of dynamicStableScale and setCanaryScale together.

In case of steps

      steps:
        - setCanaryScale:
            weight: 20
        - setWeight: 80
        - pause: {}

After the setCanaryScale the canary no longer be autoscaling by the dynamicStableScale flag.
Also we will have a situation when 20% of canary pods serving 80% of traffic

@perenesenko
Copy link
Member

Q2 The scaleDownDelaySeconds is not working with the dynamicStableScaling as once we scale canary to 100 the stable will have 0 pods. So the scaleDownDelaySeconds will be ignored. Should we mention this somewhere?

@harikrongali
Copy link
Contributor

For Q1, it is confirmed that #1382 (comment) setCanaryScale and dynamicStableScale will work together. I would like to see what is the expected behavior if both values set. @jessesuen can you please provide us the behavior when both are configured. @perenesenko when you validated, 20%of pods are taking 80% of traffic? and can you provide what is the stable replica set count at that time?

@perenesenko
Copy link
Member

@harikrongali
Yes, my configuration has 5 pods. During update the canary was set to 20% which is 1 pod and after sending 80 to canary it still has 1 pod.

@jessesuen
Copy link
Member Author

jessesuen commented Sep 18, 2021

Q1 what about using the setCanaryScale. Previously we discussed with @harikrongali Probably we should prevent to use of dynamicStableScale and setCanaryScale together.

In case of steps

      steps:
        - setCanaryScale:
            weight: 20
        - setWeight: 80
        - pause: {}

After the setCanaryScale the canary no longer be autoscaling by the dynamicStableScale flag.
Also we will have a situation when 20% of canary pods serving 80% of traffic

Yes, this is a risk with the setCanaryScale feature. But that is a risk unrelated to the dynamicStableScale and existed before this feature. I'm adding some documentation to highlight the concern.

I would like to see what is the expected behavior if both values set. @jessesuen can you please provide us the behavior when both are configured.

They independent choices. Both can be configured at the same time. dynamicStableScaling can be simply understood as: "whatever the current canary traffic weight is, the stable will be scaled to be:
inverse canary weight * rollout.spec.replicas

So if canaryWeight is set to 80, and there are 10 spec.replicas, then stable scale will be 2 irrespective of how the canary was scaled.

Q2 The scaleDownDelaySeconds is not working with the dynamicStableScaling as once we scale canary to 100 the stable will have 0 pods. So the scaleDownDelaySeconds will be ignored. Should we mention this somewhere?

correct. scaleDownDelaySeconds is not expected to work with dynamicStableScaling and is ignored.

@jessesuen
Copy link
Member Author

jessesuen commented Sep 20, 2021

Previously we discussed with @harikrongali Probably we should prevent to use of dynamicStableScale and setCanaryScale together.

Just to clarify. Theres no harm in using setCanaryScale and dynamicStableScale together. Setting the canary scale does not affect traffic weight. That is what setWeight is for. DynamicStableScale will always follow the change in canary weight, and so is only affected by setWeight.

correct. scaleDownDelaySeconds is not expected to work with dynamicStableScaling and is ignored.

I added validation to make sure these two options are not used together.

…ry weight

Signed-off-by: Jesse Suen <jessesuen@gmail.com>
Signed-off-by: Jesse Suen <jessesuen@gmail.com>
Signed-off-by: Jesse Suen <jessesuen@gmail.com>
docs/features/canary.md Outdated Show resolved Hide resolved
Signed-off-by: Jesse Suen <jessesuen@gmail.com>
@sonarcloud
Copy link

sonarcloud bot commented Sep 20, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 8 Code Smells

No Coverage information No Coverage information
5.7% 5.7% Duplication

Copy link
Contributor

@alexmt alexmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jessesuen jessesuen merged commit c1353e4 into argoproj:master Sep 21, 2021
@jessesuen jessesuen deleted the feat-dynamicStableScale branch September 21, 2021 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to control stable scale when using traffic routing
5 participants