Skip Rollback #1186

paichinger · 2022-05-05T19:20:56Z

Hey!
We're currently looking into flagger to drive our staged rollout and the following question popped up:
Can flagger be configured to not do a rollback even when the analysis failed?
Our use case:
We have applications which might perform non-backwards-compatible database migrations when they're updated. Then it might be more desirable to leave a app in a failed state instead of rolling it back. The consequences of a rollback might be even worse.
So we were wondering if flagger could be configured in a way so that it would just kind of "freeze" as soon as the analysis fails. Of course it should not promote further either. Alerts and notifications should be sent out so that the responsible team could look at the problem and then deploy a new version with a fix which would then replace the "frozen" canary and the whole cycle starts again.
I hope that was clear.
Thanks in advance for any hints!

aryan9600 · 2022-05-06T11:12:42Z

Flagger by default does not roll back the canary release, it just routes all traffic to the primary if an analysis fails. Ref: https://docs.flagger.app/tutorials/istio-progressive-delivery#automated-rollback

stefanprodan · 2022-05-06T11:39:32Z

Then it might be more desirable to leave a app in a failed state instead of rolling it back.

For Flagger, rollback means scaling to zero the canary deployment and routing all traffic back to primary. Currently there is no way to tell Flagger to freeze on failures. I guess you would want the traffic to go to the canary even if it's in a failed state, which contradicts with what Flagger was design to do: route all traffic to the previous version as soon as errors are detected.

paichinger · 2022-05-08T21:48:43Z

Thanks for you responses @aryan9600 + @stefanprodan !

I was suspecting that flagger wouldn't support this use case, but good to have it confirmed.
However, what would you think about adding that feature to flagger?
I'm envisioning the following flow:
Let's assume we're using the podinfo app and apply flagger:

flagger creates the podinfo-primary deployment and routes all traffic to it
on change it deploys the new version using the podinfo deployment and routes according to the analysis configuration
in case of a failed analysis flagger would create a podinfo-dirty deployment and copy the current podinfo deployment there alongside its routes, weight, etc...
a new change would be deployed using podinfo again. podinfo would get the exact same state as podinfo-dirty and all the traffic podinfo-dirty has at this moment. Then podinfo-dirty could be removed (or scaled to 0) and the analysis could start again.
Once the analysis is successful podinfo would get promoted to podinfo-primary (as it is already atm).

That way it would be guaranteed that once a client hit a new version it will never end up at an older version again. I think this behavior could be configured by a flag in the analysis (like freezeDirtyCanaries or alike).

I had a peak into the code already (props btw, it's really clean and easy to dive in) and I think it should be doable. But of course I'd like to check this feature with the community first, before I would work on a PR.

Any thoughts?

stefanprodan · 2022-05-09T09:09:42Z

That way it would be guaranteed that once a client hit a new version it will never end up at an older version again.

As far as I know, sticky sessions do not work with weighted traffic split no matter which service mesh or ingress controller you're using. How are you doing this type of routing today without Flagger?

paichinger · 2022-05-09T09:51:50Z

Sorry, I phrased it incorrectly. We're not using weighted traffic, we're using A/B-testing with istio leveraging header bases routing. Our services and databases are multitenant, every customer is identified by a header. We intend to use a group of "progressive" customers and expose them to the canaries. For the forementioned reasons we don't want them to be fall back on previous versions in case an analysis fails.
IMHO this is would be a common use case for many more mature systems with with multitenancy and databases involved.
Hope I could clarify it!

stefanprodan · 2022-05-09T10:02:05Z

I see no need for a podinfo-dirty deployment clone. We could allow users to disable the rollback (scale to zero, route all traffic to primary) with a new optional field in the analysis spec. But this requires a lot of changes to Flagger’s state machine and status conditions.

paichinger · 2022-05-09T17:46:04Z

That was my first idea as well @stefanprodan . But when I hacked that in yesterday I noticed: When there is a frozen podinfo and a new version is deployed all traffic would go to primary for a short while until the new version is ready. That's why I was thinking of podinfo-dirty as a bridge for this gap. But if it's possible without it, even better!

paichinger · 2022-05-11T20:02:11Z

I created a protoype for the feature, you can look at the draft PR here:
#1196
Basically you just need to add the flag freezeOnFailure: true to an analysis config and it will not rollback (scale to 0) the canary after a failed analysis.
We as a company would be greatly interested in this feature and with your approval we could dedicate a bunch of developers to it for a few days to get it finished and polished. Before that we probably need to discuss some details (e.g. the naming, or if it should just be an option for A/B-testing?, hooks mabye?)
Looking forward to your feedback!

paichinger · 2022-05-11T20:17:44Z

ah yeah and you were right, a dirty version seems to be not needed!

paichinger mentioned this issue May 11, 2022

Prototype for frozen canaries. #1196

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip Rollback #1186

Skip Rollback #1186

paichinger commented May 5, 2022 •

edited

Loading

aryan9600 commented May 6, 2022

stefanprodan commented May 6, 2022 •

edited

Loading

paichinger commented May 8, 2022

stefanprodan commented May 9, 2022

paichinger commented May 9, 2022

stefanprodan commented May 9, 2022

paichinger commented May 9, 2022 •

edited

Loading

paichinger commented May 11, 2022

paichinger commented May 11, 2022

Skip Rollback #1186

Skip Rollback #1186

Comments

paichinger commented May 5, 2022 • edited Loading

aryan9600 commented May 6, 2022

stefanprodan commented May 6, 2022 • edited Loading

paichinger commented May 8, 2022

stefanprodan commented May 9, 2022

paichinger commented May 9, 2022

stefanprodan commented May 9, 2022

paichinger commented May 9, 2022 • edited Loading

paichinger commented May 11, 2022

paichinger commented May 11, 2022

paichinger commented May 5, 2022 •

edited

Loading

stefanprodan commented May 6, 2022 •

edited

Loading

paichinger commented May 9, 2022 •

edited

Loading