Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip Rollback #1186

Open
paichinger opened this issue May 5, 2022 · 9 comments
Open

Skip Rollback #1186

paichinger opened this issue May 5, 2022 · 9 comments

Comments

@paichinger
Copy link

paichinger commented May 5, 2022

Hey!
We're currently looking into flagger to drive our staged rollout and the following question popped up:
Can flagger be configured to not do a rollback even when the analysis failed?
Our use case:
We have applications which might perform non-backwards-compatible database migrations when they're updated. Then it might be more desirable to leave a app in a failed state instead of rolling it back. The consequences of a rollback might be even worse.
So we were wondering if flagger could be configured in a way so that it would just kind of "freeze" as soon as the analysis fails. Of course it should not promote further either. Alerts and notifications should be sent out so that the responsible team could look at the problem and then deploy a new version with a fix which would then replace the "frozen" canary and the whole cycle starts again.
I hope that was clear.
Thanks in advance for any hints!

@aryan9600
Copy link
Member

Flagger by default does not roll back the canary release, it just routes all traffic to the primary if an analysis fails. Ref: https://docs.flagger.app/tutorials/istio-progressive-delivery#automated-rollback

@stefanprodan
Copy link
Member

stefanprodan commented May 6, 2022

Then it might be more desirable to leave a app in a failed state instead of rolling it back.

For Flagger, rollback means scaling to zero the canary deployment and routing all traffic back to primary. Currently there is no way to tell Flagger to freeze on failures. I guess you would want the traffic to go to the canary even if it's in a failed state, which contradicts with what Flagger was design to do: route all traffic to the previous version as soon as errors are detected.

@paichinger
Copy link
Author

Thanks for you responses @aryan9600 + @stefanprodan !

I was suspecting that flagger wouldn't support this use case, but good to have it confirmed.
However, what would you think about adding that feature to flagger?
I'm envisioning the following flow:
Let's assume we're using the podinfo app and apply flagger:

  • flagger creates the podinfo-primary deployment and routes all traffic to it
  • on change it deploys the new version using the podinfo deployment and routes according to the analysis configuration
  • in case of a failed analysis flagger would create a podinfo-dirty deployment and copy the current podinfo deployment there alongside its routes, weight, etc...
  • a new change would be deployed using podinfo again. podinfo would get the exact same state as podinfo-dirty and all the traffic podinfo-dirty has at this moment. Then podinfo-dirty could be removed (or scaled to 0) and the analysis could start again.
  • Once the analysis is successful podinfo would get promoted to podinfo-primary (as it is already atm).

That way it would be guaranteed that once a client hit a new version it will never end up at an older version again. I think this behavior could be configured by a flag in the analysis (like freezeDirtyCanaries or alike).

I had a peak into the code already (props btw, it's really clean and easy to dive in) and I think it should be doable. But of course I'd like to check this feature with the community first, before I would work on a PR.

Any thoughts?

@stefanprodan
Copy link
Member

That way it would be guaranteed that once a client hit a new version it will never end up at an older version again.

As far as I know, sticky sessions do not work with weighted traffic split no matter which service mesh or ingress controller you're using. How are you doing this type of routing today without Flagger?

@paichinger
Copy link
Author

Sorry, I phrased it incorrectly. We're not using weighted traffic, we're using A/B-testing with istio leveraging header bases routing. Our services and databases are multitenant, every customer is identified by a header. We intend to use a group of "progressive" customers and expose them to the canaries. For the forementioned reasons we don't want them to be fall back on previous versions in case an analysis fails.
IMHO this is would be a common use case for many more mature systems with with multitenancy and databases involved.
Hope I could clarify it!

@stefanprodan
Copy link
Member

I see no need for a podinfo-dirty deployment clone. We could allow users to disable the rollback (scale to zero, route all traffic to primary) with a new optional field in the analysis spec. But this requires a lot of changes to Flagger’s state machine and status conditions.

@paichinger
Copy link
Author

paichinger commented May 9, 2022

That was my first idea as well @stefanprodan . But when I hacked that in yesterday I noticed: When there is a frozen podinfo and a new version is deployed all traffic would go to primary for a short while until the new version is ready. That's why I was thinking of podinfo-dirty as a bridge for this gap. But if it's possible without it, even better!

@paichinger
Copy link
Author

I created a protoype for the feature, you can look at the draft PR here:
#1196
Basically you just need to add the flag freezeOnFailure: true to an analysis config and it will not rollback (scale to 0) the canary after a failed analysis.
We as a company would be greatly interested in this feature and with your approval we could dedicate a bunch of developers to it for a few days to get it finished and polished. Before that we probably need to discuss some details (e.g. the naming, or if it should just be an option for A/B-testing?, hooks mabye?)
Looking forward to your feedback!

@paichinger
Copy link
Author

ah yeah and you were right, a dirty version seems to be not needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants