Skip to content
Permalink
Branch: master
Commits on May 22, 2019
  1. Introduce structured errors (#91)

    juliogreff authored and icanhazbroccoli committed May 22, 2019
    * errors: introduce structured errors
    
    * capacity controller: use structured errors
    
    * driveby: fix for with range clause and pointers
    
    I found it very odd when seeing this over and over in our logs:
    
    ```
     Found 2 pod conditions with the status set to `false`. The first has a
     type of ContainersReady, and the second has a type of ContainersReady.
    ```
    
    It didn't sound right that we'd have two `ContainersReady` conditions.
    It turns out that we don't, in fact we have one `ContainersReady` and
    another `Ready` condition, but since a for statement with a range clause
    reuses the `condition` variable declared in the for statement [1], we
    were getting an address pointing to memory being changed over and over.
    
    Now, we create a copy of the struct, and get a pointer to *that*, so Go
    can't sweep values from under us anymore.
    
    [1] https://golang.org/ref/spec#For_range
    
    * janitor controller: rename files
    
    This makes logs slightly easier to read, since glog reports on the name
    of the file, not the name of the package. Having someone report itself
    as "controller.go" doesn't exactly tell us which controller it is, so
    renaming the janitor controller files solves that, while also falling in
    line with our other controllers.
    
    * janitor controller: use structured errors
    
    * janitor controller: don't fire and forget
    
    * clusterclientstore: use structured errors
    
    * release controller: use structured errors
    
    * controller: move errors to pkg errors
    
    * driveby: remove duplicated func
    
    getAssociatedApplicationName already existed in a slightly different
    form in pkg releaseutils as ApplicationNameForRelease. It didn't return
    a more specialized error type, but this commit addresses that as well.
    
    * errors: better interface for KubeclientError
    
    * application controller: use structured errors
    
    * installation controller: use structured errors
    
    * installation controller: rename files
    
    This just renames controller_test.go to installation_controller_test.go,
    to match the code file installation_controller.go. This also makes
    vim-go happier when using :GoAlternate to navigate between code and test
    file ^_^
    
    * clustersecret controller: use structured errors
    
    * traffic controller: use structured errors
    
    * errors: introduce MultiError.Flatten()
    
    * errors: improve documentation
Commits on May 17, 2019
  1. Improve our Makefile (#95)

    juliogreff authored and icanhazbroccoli committed May 17, 2019
    * Makefile: fix usage of variables
    
    Using variables like $SHIPPER_IMAGE instead of $(SHIPPER_IMAGE) causes
    errors like this:
    
    	/bin/sh: HIPPER_IMAGE: command not found
    
    It appears make interpolates variables like $(THIS) instead of like $THIS.
    
    * Makefile: track dependencies of shipper binaries
    
    First of all, "shipper" is not a phony target [1]: it actually generates
    a binary called "shipper". I suppose it was declared as phony so `make
    shipper` would always be executed, instead of bailing out with "make:
    `shipper' is up to date."
    
    If we declare our dependencies correctly, though, it'll always execute
    if the dependencies are changed, which is most likely what we actually
    want, so let's do that.
    
    [1] https://www.gnu.org/software/make/manual/html_node/Phony-Targets.html
    
    * Makefile: allow variables to be overridden by the environment
    
    It's very useful to be able to override names of images and commands
    executed by the Makefile during testing. I can have my own image
    registry, put shipper in any namespace I want, and I can also teach it
    to work with microk8s.kubectl without having to resort to symlinks :)
    
    * Makefile: depend on nested dirs inside pkg/ vendor/
    
    Silly me, pkg/* doesn't actually go deeper than 1 level, pkg/**/* is
    needed for that. Without it, modifying any files inside any directory in
    pkg/ or vendor/ and then calling `make shipper` wouldn't actually result
    in the task being executed.
Commits on May 3, 2019
  1. Merge pull request #60 from bookingcom/isutton/admission-webhooks

    juliogreff committed May 3, 2019
    Validating webhook
Commits on Apr 24, 2019
  1. clusterclientstore: keep clients per controller per cluster (#83)

    juliogreff authored and icanhazbroccoli committed Apr 24, 2019
    * clusterclientstore: keep clients per controller per cluster
    
    clusterclientstore keeps distinct clients per target cluster. It makes
    sense to change that to per target cluster per controller, so that that
    client throughput is not shared across multiple controllers talking to
    the same target cluster (think traffic controller getting throttled
    because installation controller is actively retrying something).
    
    This also gives us extra visibility into how many times each controller
    is calling the api-server, by grouping them into their own user agent,
    instead of using the default user agent that's shared by all
    controllers.
    
    * clusterclientstore: clarify GetClient parameters
    
    * clusterclientstore: protect clients map behind sync.Mutex
    
    This replaces the usage of sync.RWMutex by the simpler sync.Mutex. The
    former is only necessary when there are multiple readers and a single
    writer. In our case here, we always have a single reader/writer, so
    RWMutex's RLocker is never used nor necessary.
    
    * clusterclientstore: remove clientBuilderFunc type
    
    clientBuilderFunc was never exported anyway, so let's use the func
    signature inline where it's needed. This removes the need for an
    anonymous func inside package store.
  2. Fixes some of the known issues with disabled resyncs (#88)

    juliogreff authored and icanhazbroccoli committed Apr 24, 2019
    * clusterclientstore: take resync period as a parameter
    
    We are currently investigating if Shipper works correctly when resyncs
    are disabled for #77. Before we can carry on with that investigation,
    though, we need to ensure that Shipper does in fact disable all resyncs.
    As far as the logs show, this is the only piece that currently doesn't
    honor cmd/shipper -resync.
    
    * e2e: improve log output for tests
    
    While investigating some test breakage, I noticed that several functions
    in e2e tests were either not outputting enough information to be useful,
    or just flat out lying about what they were doing:
    
    - waitForRelease, waitForComplete, and other related code did not inform
    which release they were operating on. That makes it difficult to relate
    to what's running on kubernetes. Now the release name is being output as
    well.
    
    - waitForReleaseStrategyState always said it was waiting for command,
    regardless of which waitFor argument was passed. Now it correctly
    informs which state it waited for.
    
    * release controller: listen to deletion events
    
    When running e2e tests with `cmd/shipper` started with `-resync=0`, they
    fail with the following error:
    
    ```
    --- FAIL: TestRolloutAbort (73.78s)
        e2e_test.go:751: waiting for release "my-test-app-77cf3473-0" took 1.597339649s
        e2e_test.go:840: waiting for completion of "my-test-app-77cf3473-0" took 18.399953573s
        e2e_test.go:617: waiting for contender release to appear after editing app "my-test-app"
        e2e_test.go:751: waiting for release "my-test-app-3037a9fd-0" took 1.598389769s
        e2e_test.go:624: setting contender release "my-test-app-3037a9fd-0" targetStep to 0
        e2e_test.go:627: waiting for contender release "my-test-app-3037a9fd-0" to achieve waitingForCommand for targetStep 0
        e2e_test.go:761: release strategy state transition: "" -> "release \"my-test-app-3037a9fd-0\" has no achievedStep reported yet"
        e2e_test.go:761: release strategy state transition: "release \"my-test-app-3037a9fd-0\" has no achievedStep reported yet" -> "{installation: False, capacity: False, traffic: False, command: True}"
        e2e_test.go:819: waiting for command took 10.398270118s
        e2e_test.go:633: checking that incumbent "my-test-app-77cf3473-0" has 4 pods and contender "my-test-app-3037a9fd-0" has 1 pods (strategy step 0 -- 100/1)
        e2e_test.go:624: setting contender release "my-test-app-3037a9fd-0" targetStep to 1
        e2e_test.go:627: waiting for contender release "my-test-app-3037a9fd-0" to achieve waitingForCommand for targetStep 1
        e2e_test.go:761: release strategy state transition: "" -> "{installation: False, capacity: True, traffic: False, command: False}"
        e2e_test.go:761: release strategy state transition: "{installation: False, capacity: True, traffic: False, command: False}" -> "{installation: False, capacity: False, traffic: True, command: False}"
        e2e_test.go:761: release strategy state transition: "{installation: False, capacity: False, traffic: True, command: False}" -> "{installation: False, capacity: False, traffic: False, command: True}"
        e2e_test.go:819: waiting for command took 6.397543921s
        e2e_test.go:633: checking that incumbent "my-test-app-77cf3473-0" has 2 pods and contender "my-test-app-3037a9fd-0" has 2 pods (strategy step 1 -- 50/50)
        e2e_test.go:761: release strategy state transition: "" -> "{installation: False, capacity: False, traffic: False, command: False}"
        e2e_test.go:814: timed out waiting for release to be waiting for capacity: waited 30s. final state: {installation: False, capacity: False, traffic: False, command: False}
    FAIL
    ```
    
    This happens when we delete the contender release mid-rollout to perform
    a rollback. No event handler gets called in that situation, so the
    capacity in the incumbent is never adjusted, so the rollback is "stuck".
    Triggering a sync on the Application solves the problem, since the
    strategy will be re-evaluated, Shipper will notice that the incumbent
    has lower capacity than intended, and it'll scale it up.
    
    Although this didn't show up in the tests, the release controller
    suffers from the same issue when listening to all of the target objects.
    This commit also addresses those.
    
    This is a small step in the direction of having #77 done.
    
    * capacity controller: retry partially synced capacity targets
    
    We're trying to make Shipper work without resyncs (see #77). After our
    Continuous Integration pipeline got configured to actually run without
    resyncs in b62a574, we started to see a
    lot more flakiness in the E2E tests, and it was impossible to reproduce
    on my local environment.
    
    After some added instrumentation in the CI environment, I found this
    curious log line:
    
    ```
    Event(v1.ObjectReference{Kind:"CapacityTarget",
    Namespace:"test-new-application-moving-strategy-backwards",
    Name:"my-test-app-ce62d141-0",
    UID:"5165927d-5abf-11e9-8071-42010a1e0283",
    APIVersion:"shipper.booking.com", ResourceVersion:"471", FieldPath:""}):
    type: 'Warning' reason: 'FailedCapacityChange' cluster not ready yet for
    use; cluster client is being initialized
    ```
    
    In itself, that's not a problem, since the sync will be retried.
    
    Except it wasn't being retried :) While iterating on a list of clusters,
    any failures were just logged and then ignored, presumably waiting for a
    resync to fix the problem eventually. Now, if the capacity controller
    had issues with any cluster, it'll retry the work on that capacity
    target explicitly.
    
    * installation controller: fix log message
    
    This was probably just bad copy paste, but it took me for a ride because
    it was lying.
    
    * installation controller: check contender only once
    
    For reasons I couldn't exactly ascertain, we were trying to check if the
    InstallationTarget's Release is the current contender release: once by
    asking the lister, and once by checking the Application's history.
    
    The first one works alright, but the second one suffers from a race
    condition: the application controller would create a release, hand off
    work to the release controller to create the installation target, and
    only then update the Application resource's history. The second check in
    the installation controller would then fail if it executed before the
    Application was updated, since the history would be out of date. This
    has only worked before because of resyncs, and in the context of testing
    shipper with resyncs disabled (see #77) we eventually found this race
    condition.
    
    As we have this check twice, this was easy: let's just remove the broken
    one :)
Commits on Apr 10, 2019
  1. Merge pull request #70 from bookingcom/olegs/release-controller-2

    juliogreff committed Apr 10, 2019
    Implement Release controller as a sub for strategy and schedule controllers
Commits on Apr 9, 2019
  1. Merge pull request #82 from u5surf/issue-81

    juliogreff committed Apr 9, 2019
    Bugfix installation controller reports object values #81
Commits on Apr 3, 2019
  1. installer: add test case for invalid deployment name

    juliogreff committed Apr 3, 2019
    Adds a test case for the feature introduced by
    3a52154.
Commits on Apr 2, 2019
  1. Print additional columns for Release objects (#80)

    juliogreff authored and icanhazbroccoli committed Apr 2, 2019
    When doing a `kubectl get releases`, we can now show additional columns
    with more useful information about the Release, namely the clusters it
    was deployed to, and which step the release is currently in.
    
    Output will now look like this:
    
    ```
    NAME                         CLUSTERS                STEP      AGE
    frontend-3bb976cc-0          eu-north-n2             full on   14h
    backend-3c678546-0           eu-north-n1             staging   4d
    ```
    
    This would be really handy with --watch flag and also will save some
    scripting from user point of view.
    
    Fixes #67.
Commits on Mar 19, 2019
  1. Less verbose description of Deployment object

    juliogreff committed Mar 19, 2019
    Logging the entire Deployment object makes the error quite hard to read.
    Just logging the name of the Deployment that has an issue should be more
    than enough.
  2. Validate Deployment name when installing a chart

    juliogreff committed Mar 19, 2019
    We need the Deployment in the chart to have a unique name, meaning that
    two different Releases need to generate two different Deployments.
    Otherwise, we try to overwrite a previous Deployment, and that fails
    with a "field is immutable" error.
    
    Closes #72.
You can’t perform that action at this time.