-
Notifications
You must be signed in to change notification settings - Fork 38
Allow rollouts to continue even if the incumbent is unhealthy #99
Comments
I wonder if a failing incumbent is something we might consider to be an emergency case and intentionally put a human in the loop so they could consciously "force-kick" the rollout and make it ignore the incumbent. |
Good point. If I understand you correctly, that could mean providing an annotation that a user would set to tell Shipper to continue on without checking the incumbent. |
@parhamdoustdar very much yes. The sentiment is to make sure the great shipper contract to transition between stable states of an application is still valid and we're not moving a client into a "cross-my-fingers" state. We should only let a client do this action if they fully understand the potential consequences. |
Our error messages are now (hopefully) a bit clearer during rollouts. The bigger pain points this aims to solve: - Users would see "clusters pending capacity adjustments", and have no idea what to do next on their own. Now we gently reveal the existence of InstallationTarget, CapacityTarget and TrafficTarget objects by showing the exact kubectl command users can run to know more. - Whenever a rollout would get stuck because the incument is unhealthy (#99), it wasn't immediately obvious that that was the case, as one would need to check the condition name (IncumbentAchievedCapacity, IncumbentAchievedTraffic), and they sorta all blend together, so users would think that the problem was with the contender (it's the most obvious impulse, after all, since we're rolling out a new thing). Now, we specifically say that the incumbent is unhealthy.
Our error messages are now (hopefully) a bit clearer during rollouts. The bigger pain points this aims to solve: - Users would see "clusters pending capacity adjustments", and have no idea what to do next on their own. Now we gently reveal the existence of InstallationTarget, CapacityTarget and TrafficTarget objects by showing the exact kubectl command users can run to know more. - Whenever a rollout would get stuck because the incument is unhealthy (#99), it wasn't immediately obvious that that was the case, as one would need to check the condition name (IncumbentAchievedCapacity, IncumbentAchievedTraffic), and they sorta all blend together, so users would think that the problem was with the contender (it's the most obvious impulse, after all, since we're rolling out a new thing). Now, we specifically say that the incumbent is unhealthy.
Our error messages are now (hopefully) a bit clearer during rollouts. The bigger pain points this aims to solve: - Users would see "clusters pending capacity adjustments", and have no idea what to do next on their own. Now we gently reveal the existence of InstallationTarget, CapacityTarget and TrafficTarget objects by showing the exact kubectl command users can run to know more. - Whenever a rollout would get stuck because the incument is unhealthy (#99), it wasn't immediately obvious that that was the case, as one would need to check the condition name (IncumbentAchievedCapacity, IncumbentAchievedTraffic), and they sorta all blend together, so users would think that the problem was with the contender (it's the most obvious impulse, after all, since we're rolling out a new thing). Now, we specifically say that the incumbent is unhealthy.
Our error messages are now (hopefully) a bit clearer during rollouts. The bigger pain points this aims to solve: - Users would see "clusters pending capacity adjustments", and have no idea what to do next on their own. Now we gently reveal the existence of InstallationTarget, CapacityTarget and TrafficTarget objects by showing the exact kubectl command users can run to know more. - Whenever a rollout would get stuck because the incument is unhealthy (#99), it wasn't immediately obvious that that was the case, as one would need to check the condition name (IncumbentAchievedCapacity, IncumbentAchievedTraffic), and they sorta all blend together, so users would think that the problem was with the contender (it's the most obvious impulse, after all, since we're rolling out a new thing). Now, we specifically say that the incumbent is unhealthy.
Our error messages are now (hopefully) a bit clearer during rollouts. The bigger pain points this aims to solve: - Users would see "clusters pending capacity adjustments", and have no idea what to do next on their own. Now we gently reveal the existence of InstallationTarget, CapacityTarget and TrafficTarget objects by showing the exact kubectl command users can run to know more. - Whenever a rollout would get stuck because the incument is unhealthy (#99), it wasn't immediately obvious that that was the case, as one would need to check the condition name (IncumbentAchievedCapacity, IncumbentAchievedTraffic), and they sorta all blend together, so users would think that the problem was with the contender (it's the most obvious impulse, after all, since we're rolling out a new thing). Now, we specifically say that the incumbent is unhealthy.
Our error messages are now (hopefully) a bit clearer during rollouts. The bigger pain points this aims to solve: - Users would see "clusters pending capacity adjustments", and have no idea what to do next on their own. Now we gently reveal the existence of InstallationTarget, CapacityTarget and TrafficTarget objects by showing the exact kubectl command users can run to know more. - Whenever a rollout would get stuck because the incument is unhealthy (#99), it wasn't immediately obvious that that was the case, as one would need to check the condition name (IncumbentAchievedCapacity, IncumbentAchievedTraffic), and they sorta all blend together, so users would think that the problem was with the contender (it's the most obvious impulse, after all, since we're rolling out a new thing). Now, we specifically say that the incumbent is unhealthy.
To my knowledge, we had two main issues that caused unhealthy incumbents. The most annoying one was missing deployments in the target clusters, and that couldn't be fixed without deleting past releases until we found one that would work. Once merged, #205 will take care of that. The remaining issue is just that the app itself is having issues that cause pods to be sad. In that case, there's not much of a solution besides moving forward hoping that the next release will take care of it. Since shipper doesn't care if a previous step in the strategy was achieved before trying to apply the current one, the way to "force" a rollout to continue with an unhealthy incumbent is just to... continue the rollout... by just progressing to the next step in the strategy until the incumbent is healthy enough, or just gone once you reach the last step. I opened #212 with some short documentation on it. The whole "Typical failure scenarios" should probably be refreshed now that we know more about actual typical failure scenarios, but I'm hoping this goes in the right direction, at least. |
In cases where the pods of the incumbent release are crashing, our efforts to modify the number of available pods fail, because the pods never come up. This means that a new release is impossible to rollout, because incumbent will be stuck at
waiting for capacity
for ever.We should allow a way to ignore the incumbent, or maybe come up with a way of heuristically detecting this and continuing even though modifications to the incumbent would fail.
The text was updated successfully, but these errors were encountered: