-
Notifications
You must be signed in to change notification settings - Fork 733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add finalizing webhook #1194
base: main
Are you sure you want to change the base?
add finalizing webhook #1194
Conversation
Signed-off-by: Ying Liu <ying.liu.lying@gmail.com>
pkg/controller/scheduler.go
Outdated
if cd.SkipAnalysis() { | ||
c.recordEventInfof(cd, "Promotion completed! Canary analysis was skipped for %s.%s", cd.Spec.TargetRef.Name, cd.Namespace) | ||
c.alert(cd, "Canary analysis was skipped, promotion finished.", | ||
false, flaggerv1.SeverityInfo) | ||
} else { | ||
c.recordEventInfof(cd, "Promotion completed! Scaling down %s.%s", cd.Spec.TargetRef.Name, cd.Namespace) | ||
c.alert(cd, "Canary analysis completed successfully, promotion finished.", | ||
false, flaggerv1.SeverityInfo) | ||
} | ||
return | ||
} | ||
|
||
// check canary status | ||
var retriable = true | ||
retriable, err = canaryController.IsCanaryReady(cd) | ||
if err != nil && retriable { | ||
c.recordEventWarningf(cd, "%v", err) | ||
return | ||
} | ||
|
||
// check if analysis should be skipped | ||
if skip := c.shouldSkipAnalysis(cd, canaryController, meshRouter, err, retriable); skip { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason behind this? IMO, we should check and if required skip analysis, before promoting and/or finalizing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted this change
pkg/controller/scheduler.go
Outdated
@@ -868,6 +868,10 @@ func (c *Controller) rollback(canary *flaggerv1.Canary, canaryController canary. | |||
|
|||
c.recorder.SetWeight(canary, primaryWeight, canaryWeight) | |||
|
|||
if ok := c.runConfirmFinalizingHook(canary, flaggerv1.CanaryPhaseFailed); !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rollback()
is meant to roll a Canary analysis back, why would it call the the webhooks meant to confirm canary finalization, which is supposed to indicate that the canary has been promoted and thus being finalized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the reason in the description.
Signed-off-by: Ying Liu <ying.liu.lying@gmail.com>
@aryan9600 @stefanprodan can you help to review again and provide your feedback to see whether this idea to add a finalizing webhook is acceptable? Thanks! |
Signed-off-by: Ying Liu <ying.liu.lying@gmail.com>
return | ||
} | ||
if (cd.Status.Phase == flaggerv1.CanaryPhaseFinalising || | ||
cd.Status.Phase == flaggerv1.CanaryPhaseWaitingFinalising) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we checking for CanaryPhaseFinalising
? From what I understand runFinalizing
, runs the finalizing webhooks, sets the status to Finalizing
, scales the canary deployment down to zero and then sets the status to Succeeded
. If runFinalizing
fails to scale down the canary deployment, then the canary would be in Finalizing
status, and runFinalizing()
would be called on the next run, which would subsequently call the finalizing webhooks as well.
// check if the number of failed checks reached the threshold for rollback | ||
if (cd.Status.Phase == flaggerv1.CanaryPhaseProgressing || | ||
cd.Status.Phase == flaggerv1.CanaryPhaseWaitingPromotion || | ||
cd.Status.Phase == flaggerv1.CanaryPhaseWaitingFinalising) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WaitingFinalizing
comes between Promoting
and Finalizing
, we would not want to rollback after the canary has been promoted.
if err := canaryController.IsPrimaryReady(cd); err != nil { | ||
c.recordEventWarningf(cd, "%v", err) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already check for that in the beginning:
flagger/pkg/controller/scheduler.go
Lines 250 to 256 in 560f884
// check primary status | |
if !cd.SkipAnalysis() { | |
if err := canaryController.IsPrimaryReady(cd); err != nil { | |
c.recordEventWarningf(cd, "%v", err) | |
return | |
} | |
} |
if err := canaryController.Promote(canary); err != nil { | ||
c.recordEventWarningf(canary, "%v", err) | ||
return true | ||
if canary.Status.Phase != flaggerv1.CanaryPhasePromoting { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When skipAnalysis
is enabled, it means that we halt the entire state machine, promote the canary, scale down the canary deployment and mark it as Succeeded
, this changes that and relies on the state machine to continue and mark it as succeeded, which is not we want to do.
If the intention is to have a webhook that runs before a canary is finalized/succeeds OR a canary is rolled back, then |
Fix of issue #1193
With the changes, we can see below "canary status"
The reason to have a 'WaitingFinalising" is to have a time to handle something left-over outside k8s.
i.e. we are using an external traffic management providers on azure (azure ATM). When canary starts to process, we changed the ATM policy to route some traffic to canary in waiting stage. Then before the canary scale down to zero in "Finalizing' stage, we need a time to route the traffic back to primary, so this need to apply to both succeed and failed cases.
If the idea is acceptable, I will continue to fix the unit test failures (mostly about the expected status changes)