Allow rollouts to continue even if the incumbent is unhealthy #99

parhamdoustdar · 2019-06-03T10:10:44Z

In cases where the pods of the incumbent release are crashing, our efforts to modify the number of available pods fail, because the pods never come up. This means that a new release is impossible to rollout, because incumbent will be stuck at waiting for capacity for ever.
We should allow a way to ignore the incumbent, or maybe come up with a way of heuristically detecting this and continuing even though modifications to the incumbent would fail.

The text was updated successfully, but these errors were encountered:

osdrv · 2019-07-15T09:51:57Z

I wonder if a failing incumbent is something we might consider to be an emergency case and intentionally put a human in the loop so they could consciously "force-kick" the rollout and make it ignore the incumbent.

parhamdoustdar · 2019-07-15T09:57:23Z

Good point. If I understand you correctly, that could mean providing an annotation that a user would set to tell Shipper to continue on without checking the incumbent.

osdrv · 2019-07-29T12:49:28Z

@parhamdoustdar very much yes. The sentiment is to make sure the great shipper contract to transition between stable states of an application is still valid and we're not moving a client into a "cross-my-fingers" state. We should only let a client do this action if they fully understand the potential consequences.

Our error messages are now (hopefully) a bit clearer during rollouts. The bigger pain points this aims to solve: - Users would see "clusters pending capacity adjustments", and have no idea what to do next on their own. Now we gently reveal the existence of InstallationTarget, CapacityTarget and TrafficTarget objects by showing the exact kubectl command users can run to know more. - Whenever a rollout would get stuck because the incument is unhealthy (#99), it wasn't immediately obvious that that was the case, as one would need to check the condition name (IncumbentAchievedCapacity, IncumbentAchievedTraffic), and they sorta all blend together, so users would think that the problem was with the contender (it's the most obvious impulse, after all, since we're rolling out a new thing). Now, we specifically say that the incumbent is unhealthy.

juliogreff · 2019-10-02T08:40:05Z

To my knowledge, we had two main issues that caused unhealthy incumbents. The most annoying one was missing deployments in the target clusters, and that couldn't be fixed without deleting past releases until we found one that would work. Once merged, #205 will take care of that.

The remaining issue is just that the app itself is having issues that cause pods to be sad. In that case, there's not much of a solution besides moving forward hoping that the next release will take care of it. Since shipper doesn't care if a previous step in the strategy was achieved before trying to apply the current one, the way to "force" a rollout to continue with an unhealthy incumbent is just to... continue the rollout... by just progressing to the next step in the strategy until the incumbent is healthy enough, or just gone once you reach the last step. I opened #212 with some short documentation on it. The whole "Typical failure scenarios" should probably be refreshed now that we know more about actual typical failure scenarios, but I'm hoping this goes in the right direction, at least.

osdrv added enhancement New feature or request question Further information is requested labels Jul 15, 2019

osdrv added this to the release-0.7 milestone Aug 1, 2019

juliogreff mentioned this issue Aug 9, 2019

executor: point people the right way on errors #153

Merged

juliogreff mentioned this issue Oct 2, 2019

docs: document rolling out with unhealthy incumbent #212

Merged

juliogreff closed this as completed in #212 Oct 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow rollouts to continue even if the incumbent is unhealthy #99

Allow rollouts to continue even if the incumbent is unhealthy #99

parhamdoustdar commented Jun 3, 2019

osdrv commented Jul 15, 2019

parhamdoustdar commented Jul 15, 2019

osdrv commented Jul 29, 2019

juliogreff commented Oct 2, 2019

Allow rollouts to continue even if the incumbent is unhealthy #99

Allow rollouts to continue even if the incumbent is unhealthy #99

Comments

parhamdoustdar commented Jun 3, 2019

osdrv commented Jul 15, 2019

parhamdoustdar commented Jul 15, 2019

osdrv commented Jul 29, 2019

juliogreff commented Oct 2, 2019