Skip to content

A kubernetes-aware cloudpool proxy that offers graceful scale-down functionality

License

Notifications You must be signed in to change notification settings

elastisys/kubeaware-cloudpool-proxy

Repository files navigation

Go Report Card Build Status Coverage

kubeaware-cloudpool-proxy

The kubeaware-cloudpool-proxy is a proxy that is placed between a cloudpool and its clients (for example, an autoscaler). In essence, the kubeaware-cloudpool-proxy adds Kubernetes-awareness to an existing cloudpool implementation. The Kubernetes-awareness allows worker node scale-downs to be handled with less disruption by taking the current Kubernetes cluster state into account, carefully selecting a node, and evacuating its pods prior to terminating the cloud machine instead of just brutally killing a "random" worker node (at least appearing "random" from the Kubernetes-perspective).

The kubeaware-cloudpool-proxy delegates all cloud-specific actions to its backend cloudpool. In fact, most REST API operations are directly forwarded to the backend cloudpool as-is. There are two notable exceptions, that require the proxy to take action, both of which could lead to a scale-down:

  • set desired size: If a scale-down is suggested (desiredSize lower than the current pool size), victims need to be carefully selected and gracefully shut down (see below).
  • terminate machine: Is only allowed if the machine is a viable scale-down victim and if so, the machine needs to be gracefully shut down (see below).

When a node needs to be removed, the kubeaware-cloudpool-proxy communicates with the Kubernetes API server to determine the current cluster state. These interactions are illustrated in the image below.

architecture

When asked to scale down, the kubeaware-cloudpool-proxy takes care of taking down nodes in a controlled manner by:

  • Carefully determining which (if any) nodes are candidates for being removed. A node qualifies as a scale-down candidate if it satisfies all of the following conditions:

    • the node must not be protected with a cluster-autoscaler.kubernetes.io/scale-down-disabled annotation.
    • the node must not be a master node (as indicated by it running a pod in namespace kube-system named kube-apiserver-<host> or having a component label with value kube-apiserver)
    • there must be other remaining non-master nodes that are Ready and Schedulable
    • the node's pods must be possible to evacuate to the remaining nodes:
      • the sum of pod-requested CPU/memory on the node must not exceed free space on remaining nodes
      • the node must not have any pods without controller (such as deployment/replication controller), since such pods would not be recreated on a different node when evicted.
      • the node must not have any pods with (node-)local storage
      • the node must not have pods with a pod disruption budget that would be violated
      • taints on the remaining nodes must not prevent the node's pods from being evacuated (the pods must have matching tolerations for such cases)
      • the node pods must not have node selectors that prevent them from being moved
      • the node pods must not have node-affinity constraints that prevent them from being moved
  • Selecting the "best" victim node to kill (if at least one candidate was found in the prior step). In this context, the "best" node is typically the least loaded node -- the node with the least amount of pods that need to be evacuated to another node.

  • If a victim node is found, it needs to be evacuated before it can be killed. This happens as follows:

    • The node is marked unschedulable via a node taint (to avoid new pods being scheduled onto the node).
    • The node is drained: all non-system pods are evicted (and will be rescheduled to the remaining nodes).
    • The node is deleted from the Kubernetes cluster.
    • Finally, the node is terminated in the cloud through the terminate machine call to the backend cloudpool.

Building

build.sh builds the binary and runs all tests (build.sh --help for build options).

The built binary is placed under bin. The main binary is kubeaware-cloudpool-proxy.

Test coverage output is placed under build/coverage/ and can be viewed as HTML via:

go tool cover -html build/coverage/<package>.out

Configuring

The kubeaware-cloudpool-proxy requires a JSON-formatted configuration file. It has the following structure:

{
  "server": {
      "timeout": "60s"
  },

  "apiServer": {
      "url": "https://<host>:<port>",
      "auth": {
        ... authentication mechanism ...
      },
      "timeout": "10s",
  },

  "backend": {
      "url": "http://<host>:<port>",
      "timeout": "300s",
  }

}

The authentication part can be specified either with a concrete client certicate/key pair and a CA cert or via a kubeconfig file.

With a kubeconfig file, the auth is specified as follows:

...
  "apiServer": {
      "url": "https://<host>:<port>",
      "auth": {
        "kubeConfigPath": "/home/me/.kube/config"
      }
  },
...

With a specific client cert/key the auth configuration looks as follows:

...
  "apiServer": {
      "url": "https://<host>:<port>",
      "auth": {
        "clientCertPath": "/path/to/admin.pem",
        "clientKeyPath": "/path/to/admin-key.pem",
        "caCertPath": "/path/to/ca.pem",
      }
  },
...

The fields carry the following semantics:

  • server: proxy server settings
    • timeout: read timeout on client requests. Default: 60s
  • apiServer: settings for the Kubernets API server
    • url: URL is the base address used to contact the API server. For example, https://master:6443.
    • auth: client authentication credentials
      • kubeConfigPath: a file system path to a kubeconfig file, the type of configuration file that is used by kubectl. When specified, any other auth fields are ignored (as they are all included in the kubeconfig). The kubeconfig must contain cluster credentials for a cluster with an API server with the specified url.
      • clientCertPath: a file system path to a pem-encoded API server client/admin cert. Ignored if kubeConfigPath is specified.
      • clientKeyPath: a file system path to a pem-encoded API server client/admin key. Ignored if kubeConfigPath is specified.
      • caCertPath: a file system path to a pem-encoded CA cert for the API server. Ignored if kubeConfigPath is specified.
    • timeout: request timeout used when communicating with the API server. Default: 60s.
  • backend: settings for communicating with the backend cloudpool that the proxy sits in front of.
    • url: the base URL where the cloudpool REST API can be reached. For example, http://cloudpool:9010.
    • timeout: the connection timeout to use when contacting the backend. Default: 300s. Note: you may need to set a quite substantial timeout for the backend since some cloudprovider operations may be quite time-consuming (e.g. terminating a machine in Azure)

Running

After building, run the proxy via:

./bin/kubeaware-cloudpool-proxy --config-file=<path>

To enable a different glog log level use something like:

./bin/kubeaware-cloudpool-proxy --config-file=<path> --v=4

Docker

To build a docker image, run

./build.sh --docker

To run the docker image, run something similar to:

docker run --rm -p 8080:8080 \
   -v <config-dir>:/etc/elastisys \
   -v <kubessl-dir>:/etc/kubessl \
   elastisys/kubeaware-cloudpool-proxy:1.0.0 \
   --config-file=/etc/elastisys/config.json --port 8080

In this example, <config-dir> is a host directory that contains a config.json file for the kubeaware-cloudpool-proxy. Furthermore, <kubessl-dir> must contain the pem-encoded certificate/key/CA files required to talk to the Kubernetes API server. These cert files are referenced from the config.json which, in this case, could look something like:

{
    "apiServer": {
        "url": "https://<hostname>",
        "auth": {
            "clientCertPath": "/etc/kubessl/admin.pem",
            "clientKeyPath": "/etc/kubessl/admin-key.pem",
            "caCertPath": "/etc/kubessl/ca.pem"
        }
    },
    "backend": {
        "url": "http://<hostname>:9010",
        "timeout": "10s"
    }
}

Developer notes

Dependencies

dep is used for dependency management. Make sure it is installed.

To introduce a new dependency, add it to Gopkg.toml, edit some piece of code to import a package from the dependency, and then run:

dep ensure

to get the right version into the vendor folder.

Testing

The regular go test command can be used for testing.

To test a certain package, and to see logs (for a certain glog v-level), run something like:

go test -v ./pkg/kube -args -v=4 -logtostderr=true

For some tests, mock clients are used to fake interactions with "backend services". More specifically, these interfaces are KubeClient, CloudPoolClient, and NodeScaler. Should any of these interfaces change, the mocks need to be recreated (before editing the test code to modify expectations, etc). This can be achieved via the mockery tool.

  1. Installing mockery: go get github.com/vektra/mockery/...

  2. Generating the mocks

    mockery -dir pkg/kube/ -name KubeClient -output pkg/kube/mocks
    
    mockery -dir pkg/kube/ -name NodeScaler -output pkg/proxy/mocks
    mockery -dir pkg/cloudpool/ -name CloudPoolClient -output pkg/proxy/mocks
    

    The generated mocks should end up under pkg/mocks/

Useful references

Ideas for future work

In some cases, we would like to see more rapid utilization of newly introduced worker nodes, to make sure that it immediately starts accepting a share of the workload. Typically, what we've seen so far, is that a new node gets started, but once it is up it is typically very lightly loaded (if at all). It would be nice to see some pods being pushed over to the node. Furthermore, it would be useful to make sure that all required docker images are pulled to new nodes as early as possible to avoid unnecssary delays later when pods are scheduled onto the node.

About

A kubernetes-aware cloudpool proxy that offers graceful scale-down functionality

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published