Skip to content
This repository has been archived by the owner on May 3, 2022. It is now read-only.

Allow rollouts to continue even if the incumbent is unhealthy #99

Closed
parhamdoustdar opened this issue Jun 3, 2019 · 4 comments · Fixed by #212
Closed

Allow rollouts to continue even if the incumbent is unhealthy #99

parhamdoustdar opened this issue Jun 3, 2019 · 4 comments · Fixed by #212
Labels
enhancement New feature or request question Further information is requested
Milestone

Comments

@parhamdoustdar
Copy link
Contributor

In cases where the pods of the incumbent release are crashing, our efforts to modify the number of available pods fail, because the pods never come up. This means that a new release is impossible to rollout, because incumbent will be stuck at waiting for capacity for ever.
We should allow a way to ignore the incumbent, or maybe come up with a way of heuristically detecting this and continuing even though modifications to the incumbent would fail.

@osdrv osdrv added enhancement New feature or request question Further information is requested labels Jul 15, 2019
@osdrv
Copy link
Contributor

osdrv commented Jul 15, 2019

I wonder if a failing incumbent is something we might consider to be an emergency case and intentionally put a human in the loop so they could consciously "force-kick" the rollout and make it ignore the incumbent.

@parhamdoustdar
Copy link
Contributor Author

Good point. If I understand you correctly, that could mean providing an annotation that a user would set to tell Shipper to continue on without checking the incumbent.

@osdrv
Copy link
Contributor

osdrv commented Jul 29, 2019

@parhamdoustdar very much yes. The sentiment is to make sure the great shipper contract to transition between stable states of an application is still valid and we're not moving a client into a "cross-my-fingers" state. We should only let a client do this action if they fully understand the potential consequences.

@osdrv osdrv added this to the release-0.7 milestone Aug 1, 2019
juliogreff added a commit that referenced this issue Aug 9, 2019
Our error messages are now (hopefully) a bit clearer during rollouts.
The bigger pain points this aims to solve:

- Users would see "clusters pending capacity adjustments", and have no
idea what to do next on their own. Now we gently reveal the existence of
InstallationTarget, CapacityTarget and TrafficTarget objects by showing
the exact kubectl command users can run to know more.

- Whenever a rollout would get stuck because the incument is unhealthy
(#99), it wasn't immediately obvious that that was the case, as one
would need to check the condition name (IncumbentAchievedCapacity,
IncumbentAchievedTraffic), and they sorta all blend together, so users
would think that the problem was with the contender (it's the most
obvious impulse, after all, since we're rolling out a new thing). Now,
we specifically say that the incumbent is unhealthy.
juliogreff added a commit that referenced this issue Aug 9, 2019
Our error messages are now (hopefully) a bit clearer during rollouts.
The bigger pain points this aims to solve:

- Users would see "clusters pending capacity adjustments", and have no
idea what to do next on their own. Now we gently reveal the existence of
InstallationTarget, CapacityTarget and TrafficTarget objects by showing
the exact kubectl command users can run to know more.

- Whenever a rollout would get stuck because the incument is unhealthy
(#99), it wasn't immediately obvious that that was the case, as one
would need to check the condition name (IncumbentAchievedCapacity,
IncumbentAchievedTraffic), and they sorta all blend together, so users
would think that the problem was with the contender (it's the most
obvious impulse, after all, since we're rolling out a new thing). Now,
we specifically say that the incumbent is unhealthy.
juliogreff added a commit that referenced this issue Aug 9, 2019
Our error messages are now (hopefully) a bit clearer during rollouts.
The bigger pain points this aims to solve:

- Users would see "clusters pending capacity adjustments", and have no
idea what to do next on their own. Now we gently reveal the existence of
InstallationTarget, CapacityTarget and TrafficTarget objects by showing
the exact kubectl command users can run to know more.

- Whenever a rollout would get stuck because the incument is unhealthy
(#99), it wasn't immediately obvious that that was the case, as one
would need to check the condition name (IncumbentAchievedCapacity,
IncumbentAchievedTraffic), and they sorta all blend together, so users
would think that the problem was with the contender (it's the most
obvious impulse, after all, since we're rolling out a new thing). Now,
we specifically say that the incumbent is unhealthy.
juliogreff added a commit that referenced this issue Aug 9, 2019
Our error messages are now (hopefully) a bit clearer during rollouts.
The bigger pain points this aims to solve:

- Users would see "clusters pending capacity adjustments", and have no
idea what to do next on their own. Now we gently reveal the existence of
InstallationTarget, CapacityTarget and TrafficTarget objects by showing
the exact kubectl command users can run to know more.

- Whenever a rollout would get stuck because the incument is unhealthy
(#99), it wasn't immediately obvious that that was the case, as one
would need to check the condition name (IncumbentAchievedCapacity,
IncumbentAchievedTraffic), and they sorta all blend together, so users
would think that the problem was with the contender (it's the most
obvious impulse, after all, since we're rolling out a new thing). Now,
we specifically say that the incumbent is unhealthy.
juliogreff added a commit that referenced this issue Aug 19, 2019
Our error messages are now (hopefully) a bit clearer during rollouts.
The bigger pain points this aims to solve:

- Users would see "clusters pending capacity adjustments", and have no
idea what to do next on their own. Now we gently reveal the existence of
InstallationTarget, CapacityTarget and TrafficTarget objects by showing
the exact kubectl command users can run to know more.

- Whenever a rollout would get stuck because the incument is unhealthy
(#99), it wasn't immediately obvious that that was the case, as one
would need to check the condition name (IncumbentAchievedCapacity,
IncumbentAchievedTraffic), and they sorta all blend together, so users
would think that the problem was with the contender (it's the most
obvious impulse, after all, since we're rolling out a new thing). Now,
we specifically say that the incumbent is unhealthy.
osdrv pushed a commit that referenced this issue Aug 19, 2019
Our error messages are now (hopefully) a bit clearer during rollouts.
The bigger pain points this aims to solve:

- Users would see "clusters pending capacity adjustments", and have no
idea what to do next on their own. Now we gently reveal the existence of
InstallationTarget, CapacityTarget and TrafficTarget objects by showing
the exact kubectl command users can run to know more.

- Whenever a rollout would get stuck because the incument is unhealthy
(#99), it wasn't immediately obvious that that was the case, as one
would need to check the condition name (IncumbentAchievedCapacity,
IncumbentAchievedTraffic), and they sorta all blend together, so users
would think that the problem was with the contender (it's the most
obvious impulse, after all, since we're rolling out a new thing). Now,
we specifically say that the incumbent is unhealthy.
@juliogreff
Copy link
Contributor

To my knowledge, we had two main issues that caused unhealthy incumbents. The most annoying one was missing deployments in the target clusters, and that couldn't be fixed without deleting past releases until we found one that would work. Once merged, #205 will take care of that.

The remaining issue is just that the app itself is having issues that cause pods to be sad. In that case, there's not much of a solution besides moving forward hoping that the next release will take care of it. Since shipper doesn't care if a previous step in the strategy was achieved before trying to apply the current one, the way to "force" a rollout to continue with an unhealthy incumbent is just to... continue the rollout... by just progressing to the next step in the strategy until the incumbent is healthy enough, or just gone once you reach the last step. I opened #212 with some short documentation on it. The whole "Typical failure scenarios" should probably be refreshed now that we know more about actual typical failure scenarios, but I'm hoping this goes in the right direction, at least.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants