Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes some of the known issues with disabled resyncs #88

Merged
merged 6 commits into from Apr 24, 2019

Conversation

Projects
None yet
3 participants
@juliogreff
Copy link
Contributor

commented Apr 8, 2019

This tries very hard to get to #77, but beware that even though all tests now pass, this is not a complete solution. I'll update that ticket with what still needs to be done going forward, but I'd also like to get these small fixes merged because they look like good things to have.

As a bonus, the E2E tests now run with resyncs disabled, so hopefully we won't be able to introduce any new bugs like these ^_^

@juliogreff juliogreff force-pushed the jgreff/delete-event-handlers branch from fb017a1 to b62a574 Apr 8, 2019

juliogreff added some commits Apr 4, 2019

clusterclientstore: take resync period as a parameter
We are currently investigating if Shipper works correctly when resyncs
are disabled for #77. Before we can carry on with that investigation,
though, we need to ensure that Shipper does in fact disable all resyncs.
As far as the logs show, this is the only piece that currently doesn't
honor cmd/shipper -resync.
e2e: improve log output for tests
While investigating some test breakage, I noticed that several functions
in e2e tests were either not outputting enough information to be useful,
or just flat out lying about what they were doing:

- waitForRelease, waitForComplete, and other related code did not inform
which release they were operating on. That makes it difficult to relate
to what's running on kubernetes. Now the release name is being output as
well.

- waitForReleaseStrategyState always said it was waiting for command,
regardless of which waitFor argument was passed. Now it correctly
informs which state it waited for.
release controller: listen to deletion events
When running e2e tests with `cmd/shipper` started with `-resync=0`, they
fail with the following error:

```
--- FAIL: TestRolloutAbort (73.78s)
    e2e_test.go:751: waiting for release "my-test-app-77cf3473-0" took 1.597339649s
    e2e_test.go:840: waiting for completion of "my-test-app-77cf3473-0" took 18.399953573s
    e2e_test.go:617: waiting for contender release to appear after editing app "my-test-app"
    e2e_test.go:751: waiting for release "my-test-app-3037a9fd-0" took 1.598389769s
    e2e_test.go:624: setting contender release "my-test-app-3037a9fd-0" targetStep to 0
    e2e_test.go:627: waiting for contender release "my-test-app-3037a9fd-0" to achieve waitingForCommand for targetStep 0
    e2e_test.go:761: release strategy state transition: "" -> "release \"my-test-app-3037a9fd-0\" has no achievedStep reported yet"
    e2e_test.go:761: release strategy state transition: "release \"my-test-app-3037a9fd-0\" has no achievedStep reported yet" -> "{installation: False, capacity: False, traffic: False, command: True}"
    e2e_test.go:819: waiting for command took 10.398270118s
    e2e_test.go:633: checking that incumbent "my-test-app-77cf3473-0" has 4 pods and contender "my-test-app-3037a9fd-0" has 1 pods (strategy step 0 -- 100/1)
    e2e_test.go:624: setting contender release "my-test-app-3037a9fd-0" targetStep to 1
    e2e_test.go:627: waiting for contender release "my-test-app-3037a9fd-0" to achieve waitingForCommand for targetStep 1
    e2e_test.go:761: release strategy state transition: "" -> "{installation: False, capacity: True, traffic: False, command: False}"
    e2e_test.go:761: release strategy state transition: "{installation: False, capacity: True, traffic: False, command: False}" -> "{installation: False, capacity: False, traffic: True, command: False}"
    e2e_test.go:761: release strategy state transition: "{installation: False, capacity: False, traffic: True, command: False}" -> "{installation: False, capacity: False, traffic: False, command: True}"
    e2e_test.go:819: waiting for command took 6.397543921s
    e2e_test.go:633: checking that incumbent "my-test-app-77cf3473-0" has 2 pods and contender "my-test-app-3037a9fd-0" has 2 pods (strategy step 1 -- 50/50)
    e2e_test.go:761: release strategy state transition: "" -> "{installation: False, capacity: False, traffic: False, command: False}"
    e2e_test.go:814: timed out waiting for release to be waiting for capacity: waited 30s. final state: {installation: False, capacity: False, traffic: False, command: False}
FAIL
```

This happens when we delete the contender release mid-rollout to perform
a rollback. No event handler gets called in that situation, so the
capacity in the incumbent is never adjusted, so the rollback is "stuck".
Triggering a sync on the Application solves the problem, since the
strategy will be re-evaluated, Shipper will notice that the incumbent
has lower capacity than intended, and it'll scale it up.

Although this didn't show up in the tests, the release controller
suffers from the same issue when listening to all of the target objects.
This commit also addresses those.

This is a small step in the direction of having #77 done.
capacity controller: retry partially synced capacity targets
We're trying to make Shipper work without resyncs (see #77). After our
Continuous Integration pipeline got configured to actually run without
resyncs in b62a574, we started to see a
lot more flakiness in the E2E tests, and it was impossible to reproduce
on my local environment.

After some added instrumentation in the CI environment, I found this
curious log line:

```
Event(v1.ObjectReference{Kind:"CapacityTarget",
Namespace:"test-new-application-moving-strategy-backwards",
Name:"my-test-app-ce62d141-0",
UID:"5165927d-5abf-11e9-8071-42010a1e0283",
APIVersion:"shipper.booking.com", ResourceVersion:"471", FieldPath:""}):
type: 'Warning' reason: 'FailedCapacityChange' cluster not ready yet for
use; cluster client is being initialized
```

In itself, that's not a problem, since the sync will be retried.

Except it wasn't being retried :) While iterating on a list of clusters,
any failures were just logged and then ignored, presumably waiting for a
resync to fix the problem eventually. Now, if the capacity controller
had issues with any cluster, it'll retry the work on that capacity
target explicitly.
installation controller: fix log message
This was probably just bad copy paste, but it took me for a ride because
it was lying.
installation controller: check contender only once
For reasons I couldn't exactly ascertain, we were trying to check if the
InstallationTarget's Release is the current contender release: once by
asking the lister, and once by checking the Application's history.

The first one works alright, but the second one suffers from a race
condition: the application controller would create a release, hand off
work to the release controller to create the installation target, and
only then update the Application resource's history. The second check in
the installation controller would then fail if it executed before the
Application was updated, since the history would be out of date. This
has only worked before because of resyncs, and in the context of testing
shipper with resyncs disabled (see #77) we eventually found this race
condition.

As we have this check twice, this was easy: let's just remove the broken
one :)

@juliogreff juliogreff force-pushed the jgreff/delete-event-handlers branch from 2fee381 to c280d23 Apr 11, 2019

@juliogreff juliogreff changed the base branch from olegs/release-controller-2 to master Apr 11, 2019

@juliogreff juliogreff changed the title Enables shipper to work with disabled resyncs Fixes some of the known issues with disabled resyncs Apr 11, 2019

@juliogreff juliogreff marked this pull request as ready for review Apr 11, 2019

@icanhazbroccoli icanhazbroccoli merged commit ab15361 into master Apr 24, 2019

2 checks passed

Travis CI - Branch Build Passed
Details
Travis CI - Pull Request Build Passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.