This repository has been archived by the owner on May 3, 2022. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fixes some of the known issues with disabled resyncs (#88)
* clusterclientstore: take resync period as a parameter We are currently investigating if Shipper works correctly when resyncs are disabled for #77. Before we can carry on with that investigation, though, we need to ensure that Shipper does in fact disable all resyncs. As far as the logs show, this is the only piece that currently doesn't honor cmd/shipper -resync. * e2e: improve log output for tests While investigating some test breakage, I noticed that several functions in e2e tests were either not outputting enough information to be useful, or just flat out lying about what they were doing: - waitForRelease, waitForComplete, and other related code did not inform which release they were operating on. That makes it difficult to relate to what's running on kubernetes. Now the release name is being output as well. - waitForReleaseStrategyState always said it was waiting for command, regardless of which waitFor argument was passed. Now it correctly informs which state it waited for. * release controller: listen to deletion events When running e2e tests with `cmd/shipper` started with `-resync=0`, they fail with the following error: ``` --- FAIL: TestRolloutAbort (73.78s) e2e_test.go:751: waiting for release "my-test-app-77cf3473-0" took 1.597339649s e2e_test.go:840: waiting for completion of "my-test-app-77cf3473-0" took 18.399953573s e2e_test.go:617: waiting for contender release to appear after editing app "my-test-app" e2e_test.go:751: waiting for release "my-test-app-3037a9fd-0" took 1.598389769s e2e_test.go:624: setting contender release "my-test-app-3037a9fd-0" targetStep to 0 e2e_test.go:627: waiting for contender release "my-test-app-3037a9fd-0" to achieve waitingForCommand for targetStep 0 e2e_test.go:761: release strategy state transition: "" -> "release \"my-test-app-3037a9fd-0\" has no achievedStep reported yet" e2e_test.go:761: release strategy state transition: "release \"my-test-app-3037a9fd-0\" has no achievedStep reported yet" -> "{installation: False, capacity: False, traffic: False, command: True}" e2e_test.go:819: waiting for command took 10.398270118s e2e_test.go:633: checking that incumbent "my-test-app-77cf3473-0" has 4 pods and contender "my-test-app-3037a9fd-0" has 1 pods (strategy step 0 -- 100/1) e2e_test.go:624: setting contender release "my-test-app-3037a9fd-0" targetStep to 1 e2e_test.go:627: waiting for contender release "my-test-app-3037a9fd-0" to achieve waitingForCommand for targetStep 1 e2e_test.go:761: release strategy state transition: "" -> "{installation: False, capacity: True, traffic: False, command: False}" e2e_test.go:761: release strategy state transition: "{installation: False, capacity: True, traffic: False, command: False}" -> "{installation: False, capacity: False, traffic: True, command: False}" e2e_test.go:761: release strategy state transition: "{installation: False, capacity: False, traffic: True, command: False}" -> "{installation: False, capacity: False, traffic: False, command: True}" e2e_test.go:819: waiting for command took 6.397543921s e2e_test.go:633: checking that incumbent "my-test-app-77cf3473-0" has 2 pods and contender "my-test-app-3037a9fd-0" has 2 pods (strategy step 1 -- 50/50) e2e_test.go:761: release strategy state transition: "" -> "{installation: False, capacity: False, traffic: False, command: False}" e2e_test.go:814: timed out waiting for release to be waiting for capacity: waited 30s. final state: {installation: False, capacity: False, traffic: False, command: False} FAIL ``` This happens when we delete the contender release mid-rollout to perform a rollback. No event handler gets called in that situation, so the capacity in the incumbent is never adjusted, so the rollback is "stuck". Triggering a sync on the Application solves the problem, since the strategy will be re-evaluated, Shipper will notice that the incumbent has lower capacity than intended, and it'll scale it up. Although this didn't show up in the tests, the release controller suffers from the same issue when listening to all of the target objects. This commit also addresses those. This is a small step in the direction of having #77 done. * capacity controller: retry partially synced capacity targets We're trying to make Shipper work without resyncs (see #77). After our Continuous Integration pipeline got configured to actually run without resyncs in b62a574, we started to see a lot more flakiness in the E2E tests, and it was impossible to reproduce on my local environment. After some added instrumentation in the CI environment, I found this curious log line: ``` Event(v1.ObjectReference{Kind:"CapacityTarget", Namespace:"test-new-application-moving-strategy-backwards", Name:"my-test-app-ce62d141-0", UID:"5165927d-5abf-11e9-8071-42010a1e0283", APIVersion:"shipper.booking.com", ResourceVersion:"471", FieldPath:""}): type: 'Warning' reason: 'FailedCapacityChange' cluster not ready yet for use; cluster client is being initialized ``` In itself, that's not a problem, since the sync will be retried. Except it wasn't being retried :) While iterating on a list of clusters, any failures were just logged and then ignored, presumably waiting for a resync to fix the problem eventually. Now, if the capacity controller had issues with any cluster, it'll retry the work on that capacity target explicitly. * installation controller: fix log message This was probably just bad copy paste, but it took me for a ride because it was lying. * installation controller: check contender only once For reasons I couldn't exactly ascertain, we were trying to check if the InstallationTarget's Release is the current contender release: once by asking the lister, and once by checking the Application's history. The first one works alright, but the second one suffers from a race condition: the application controller would create a release, hand off work to the release controller to create the installation target, and only then update the Application resource's history. The second check in the installation controller would then fail if it executed before the Application was updated, since the history would be out of date. This has only worked before because of resyncs, and in the context of testing shipper with resyncs disabled (see #77) we eventually found this race condition. As we have this check twice, this was easy: let's just remove the broken one :)
- Loading branch information