cops-vigilante

There are some problems in our day-to-day operations at Conplement AG which are simply hair pulling :)

CoreOps Vigilante project is an attempt to take matters in our own hands, and try to implement some automation or self-healing capabilities to such problems. "Tasks" which are performed via cops-vigilante are explained below in the Tasks section.

Installation

Installation is supported via Helm.

helm repo add cops-vigilante https://conplementag.github.io/cops-vigilante/charts
helm repo update

helm install cops-vigilante cops-vigilante --namespace <<your-namespace>> --create-namespace --version v1.2.1

Tasks

AKS Windows nodes SNAT issue with complex networking setups (VPN, hub-spoke networks, etc.)

There is a SNAT issue regarding Windows nodes in AKS clusters, which is not fixed for a very long time now. To summarize, the issue is that in AzureCNI AKS networking configuration, Windows nodes always perform SNAT, meaning the pods sending packages will have their originating IPs replaced by the node IP. This should normally not be an issue (to have NAT-ing in a network), but the problem is that this does not work correctly (packages are dropped "randomly", performance issues due to port exhaustion etc.).

The only fix currently is to exclude certain CIDR ranges from this functionality. To do this, a certain 10-azure.conflist has to be modified and loaded, which can be done either via VM Extensions or host-process windows containers.

The problem is however, to apply this config, a new pod has to be scheduled on a node and there is no process which runs on the node to read this config in the background. Config is applied once kubelet is activated on pod scheduling, which in turn calls azurecni, which then loads the config. As the config creation / update does not occur atomically with node creation, we need a process which will schedule pod creations for some time, so that it is made sure the config is loaded at some point.

This SNAT task in cops-vigilante keeps a track of ready windows nodes, and schedules windows containers on them for ca. 30 minutes, after which the nodes are marked as "healed". Windows container used is as small as possible, currently we use the mcr.microsoft.com/oss/kubernetes/pause image, which is under 1 MB size!

If the node is considered healed, an annotation

cops-vigilante-snat-node-healed: "true"

will be added. Setting this annotation to "false", or removing it completely, will restart the healing process for this node.

General features

TLS via CertManager

To enable TLS for internal HTTP endpoint, you can use the Helm flag "create_certificates". Also make sure to set the "tls" option in the config section to true to load these certificates.

Prometheus integration via ServiceMonitor

Use the Helm flag "create_service_monitor" to enable the deplyoment of Prometheus Operator's ServiceMonitor resource. Metrics provided are prefixed with "cops_vigilante_".

Development

Check CONTRIBUTING.md

License

Check LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github/workflows		.github/workflows
.idea		.idea
charts		charts
cmd		cmd
config		config
internal/vigilante		internal/vigilante
pkg/vigilante/helpers		pkg/vigilante/helpers
.dockerignore		.dockerignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cops-vigilante

Installation

Tasks

AKS Windows nodes SNAT issue with complex networking setups (VPN, hub-spoke networks, etc.)

General features

TLS via CertManager

Prometheus integration via ServiceMonitor

Development

License

About

Releases

Packages

Languages

License

conplementAG/cops-vigilante

Folders and files

Latest commit

History

Repository files navigation

cops-vigilante

Installation

Tasks

AKS Windows nodes SNAT issue with complex networking setups (VPN, hub-spoke networks, etc.)

General features

TLS via CertManager

Prometheus integration via ServiceMonitor

Development

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages