Skip to content
This repository has been archived by the owner on May 3, 2022. It is now read-only.

Diagnosing rollout progress: fleet summary in the Capacity Target object #21

Closed
kanatohodets opened this issue Sep 27, 2018 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@kanatohodets
Copy link
Contributor

kanatohodets commented Sep 27, 2018

(ported from the internal repo design docs)

This document describes part of our plan for helping users diagnose how their rollout is going. This has two components: the low-level high detail view in the status of the CapacityTarget object; the high-level low-detail view in the status of the Release object. This document discusses the low-level high-detail CapacityTarget view: we think it'll be easier to start with a domain where we don't need to invent summarization/prioritization schemes.

Reporting Progress

Previously, we introduced the concept of sad pods, which allowed the
user to see the pods that were not ready. There were a few problems
with this approach:

  • It was hard to read: we were dumping the whole status of the pod
    into the capacity target, for every single pod that was not working
  • We were only using a hard limit of 5 to keep the objects small. This
    meant that different problems wouldn't be necessarily surfaced.
  • The user wouldn't see the positive things (i.e. pods running), so
    it'd be hard to see if the release was progressing, or for tooling
    to show the status of the whole release across multiple clusters.

So, we decided to summarize the status of all the pods per cluster.

Criteria For Summarizing

Owner

The first level of the summary is the owner of the pod.

Multiple Kubernetes objects can lead to one or more pods being
created. DaemonSets, Deployments, Jobs, ReplicaSets, and StatefulSets
can create new pods, which means that later down the hierarchy, we
might have container names that clash. To prevent that, we are using
the owner of a pod to categorize the summary report at the top level.

Pod Condition: Type, Reason and Status

Under each owner, there is a pod status breakdown. This breakdown is
grouped by the following fields, in order:

  1. Pod condition type (e.g. Ready)
  2. Pod condition reason (e.g. ContainersNotReady)
  3. Pod condition status (True, False, or Unknown)

Apart from categorizing pods by their conditions, we also sort the
results with the same criteria to keep the ordering consistent across
multiple updates.

To aid humans in deciding which problem to look into, we also maintain
a count for the number of pods with this type, reason and status.

Container Name

In the containers field of each type + reason + status
combination, there is another grouping happening, and that is grouped
by container name.

This means that we have a report per container name. And that report
follows a pretty similar structure as the report for pods.

Container State: Type and Reason

Just as the pod status breakdown works with conditions, container
state breakdown works with container states. The only difference is
that unlike pods, container states are not as transparent, and we need
to infer type and reason through a logic of our own..

Type

Each container state has three nullable fields. They are called
Waiting, Running, and Terminated.

We use these to derive the container state Type. The type of a
container state is whichever field that is not null.

Reason

Containers keep two states, not one. The state called State is their
current state, and the one called LastTerminationState contains the
last state that happened. In other words, this is the state of the
container before it got restarted.

Reason is tricky mostly because it is not always informative. Based on
our experience so far, what users usually want to see is the Reason
of the current state, if the current state is Waiting.

Here are the steps we go through to come up with the Reason for a
container state:

  • If the State (i.e. the container's current state) is Waiting, we
    use the reason.
  • If it's not, the Reason is empty.

Constructing Examples

In each pod status breakdown, we have an example that contains a pod
name and a message. At best, this message helps the user know what is
wrong without having to switch to the target cluster. At worst, the
user can use the pod name to look through logs or events after
switching to the application cluster.

This example is picked from the list of pods which fall into that
breakdown. However, to keep this example pod consistent, pods are
sorted, and then the first pod is picked as the example.

The example contains only two fields, the pod name and a message.

Pod Name

This is copied, verbatim, from the name of the example pod.

Message

We are trying to show some useful information to the user through the
Message of the example. Here is where we get the message from:

  • If LastTerminationState.Terminated.Message is set, meaning that
    the user has written to the termination message path, we choose
    it.
  • If it's not set, we construct a message ourselves. The initial
    proposal is to go with a string like Terminated with exit code <exitcode> or Terminated with signal <signal> if there is a
    signal instead of an exit code.

Example

To bring it all together, here is an example of what a capacity target would
look like with a replica set maintaining 20 pods with 2 containers (app and
envoy):

status:
  clusters:
  - name: us-west1
    report:
    - owner: 
        name: replicaset/reviewsapi-$hash-0-$hash
      breakdown:
      - type: Ready
        status: True
        count: 12
        containers:
        - name: app
          states:
          - type: Running
            count: 12
            example:
              pod: reviewsapi-$hash-0-$hash-1234
        - name: envoy
          states:
          - type: Running
            count: 12
            example:
              pod: reviewsapi-$hash-0-$hash-1234
      - type: Ready
        status: False
        reason: ContainersNotReady
        count: 8
        containers:
        - name: app
          states:
          - type: Waiting
            reason: ImagePullBackOff
            count: 6
            example:
              pod: reviewsapi-$hash-0-$hash-4567
              message: "failed to pull reviewsapi:abcd123"
          - type: Waiting
            reason: ContainerCreating
            count: 1
            example:
              pod: reviewsapi-$hash-0-$hash-4567
          - type: Waiting
            reason: CrashLoopBackOff
            count: 2
            example:
              pod: reviewsapi-$hash-0-$hash-4567
              message: 'Terminated with exit code 1' # constructed by Shipper from `LastState.Terminated.ExitCode`
        - name: envoy
          states:
          - type: Waiting
            reason: CrashLoopBackOff
            count: 8
            example:
              pod: reviewsapi-$hash-0-$hash-4567
              message: 'cannot fetch service mesh config. argh!' # Read from terminationMessagePath

Caveats

Memory impact of pod informer for each cluster

This scheme is predicated on maintaining a pod informer for each cluster. For very large clusters with hundreds of thousands of pods, this may add up to a significant memory impact. Taking an extreme case, consider a management cluster orchestrating 10 Kubernetes clusters each with 5000 nodes and 100 pods per node: this represents about 50gb of heap if we think each pod is about ~10kb in memory.

Update rate for informers subscribing to a very large number of pod changes

We're not sure how client-go will handle a very high churn subscription on big clusters.

CPU impact of doing crunchy summarization work

The summarization scheme we're proposing, implemented involves a lot of aggregation over the set of pods and their containers. This might end up being a lot of CPU load for multiple very large clusters.

API call throttling updating capacity target objects for a high-churn pod fleet

We're likely to run into the client-go throttling limits when attempting to keep a CapacityTarget object up-to-date with a very large pod fleet. In this case, it should be safe to drop updates and re-process at the next resync period, or retry after a certain amount of time. None of the state depends on catching each update.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants