Handle updates to certs or CA info #34

johnsca · 2019-06-27T20:21:05Z

If the certificates or CA info is updated (re-issued, transferred to a different CA, etc), the charm needs to handle that by updating the files on disk (already done) and restarting the services (added here).

Depends on: charmed-kubernetes/layer-etcd#158 Depends on: charmed-kubernetes/charm-kubeapi-load-balancer#2 Depends on: charmed-kubernetes/charm-kubernetes-control-plane#34 Depends on: charmed-kubernetes/charm-kubernetes-worker#22

Cynerva · 2019-06-28T19:10:08Z

tls_client.certs.changed is currently handled by kick_api_server. The new update_certs handler will conflict with it.

I do think that kick_api_server doesn't do enough. Presumably we'd want to restart kube-controller-manager, too, given that it takes in the CA and cert/key as options.

We probably don't need to restart kube-scheduler, but I also don't see any real harm in doing it.

All that said, I recommend removing the kick_api_server handler in favor of the new update_certs one.

johnsca · 2019-07-01T14:58:39Z

Updated per review. As per the other PRs (copying here for completeness), this was tested manually on AWS by doing the following:

Deploy CDK with the four patched charms & docker subordinate, wait to settle
Run some test pod in the cluster (I used busybox and left processes running on the pod)
Add Vault to the cluster, wait to settle
Unseal Vault, wait to settle
Remove EasyRSA, wait to settle
Update kubectl config
Confirm pod is still functioning

Cynerva · 2019-07-01T19:16:41Z

Unseal Vault, wait to settle

In testing, during this step, I found that there is a race that can leave kubernetes-master in a bad state where kube-apiserver is no longer able to connect to etcd.

The following sequence occurred:

certificates-relation-changed hook ran, resulting in new certs being written, and kube-apiserver being restarted as expected. However, cert information used to connect to etcd was -not- updated during this time, as those come from the etcd relation (handled in handle_etcd_relation), not the certificates relation.
etcd-relation-changed hook ran, in which the etcd relation provided new client cert and CA info. However, start_master did not run again, so handle_etcd_relation was not called again, so the new certs were never written.

I confirmed with relation-get that the new cert info was available on the etcd relation; it was just never written again.

In this state, all kubectl calls fail. Hooks on kubernetes-master and kubernetes-worker are affected (staying in "executing" state for a long time, as they retry kubectl commands), and the cluster has not recovered on its own. Removing easyrsa did not help.

Cynerva · 2019-07-01T19:20:31Z

Flannel also failed to switch to the new etcd client cert info. From journalctl -o cat -u flannel:

E0701 19:17:33.771678   29435 watch.go:171] Subnet watch failed: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate signed by unknown authority
; error #1: x509: certificate signed by unknown authority
; error #2: x509: certificate signed by unknown authority
E0701 19:17:34.511649   29435 watch.go:43] Watch subnets: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate signed by unknown authority
; error #1: x509: certificate signed by unknown authority
; error #2: x509: certificate signed by unknown authority
E0701 19:17:34.787105   29435 watch.go:171] Subnet watch failed: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate signed by unknown authority
; error #1: x509: certificate signed by unknown authority
; error #2: x509: certificate signed by unknown authority
E0701 19:17:35.527465   29435 watch.go:43] Watch subnets: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate signed by unknown authority
; error #1: x509: certificate signed by unknown authority
; error #2: x509: certificate signed by unknown authority
E0701 19:17:35.805219   29435 watch.go:171] Subnet watch failed: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate signed by unknown authority
; error #1: x509: certificate signed by unknown authority

I suspect flannel, calico, canal, and tigera-secure-ee all need similar work to account for this transition, as all of these services connect directly to etcd.

johnsca · 2019-07-01T20:18:28Z

@Cynerva Please take a look at the following PRs to address the issue in the other charms as well:

Cynerva · 2019-07-10T18:58:39Z

Ready for review. Tested the latest commit as follows:

Deployed charmed-kubernetes with modified charms (bundle.yaml).
Deployed vault in auto-unlock mode:

juju deploy cs:percona-cluster
juju deploy cs:~openstack-charmers-next/vault
juju config vault auto-generate-root-ca-cert=true totally-unsecure-auto-unlock=true
juju relate vault percona-cluster

Waited for cluster to settle.
Related vault to etcd:

juju relate vault:certificates etcd

Waited for cluster to settle (took ~5 minutes).
Related vault to kubeapi-load-balancer, kubernetes-master, kubernetes-worker:

juju relate vault:certificates kubeapi-load-balancer
juju relate vault:certificates kubernetes-master
juju relate vault:certificates kubernetes-worker

Waited for cluster to settle (took another ~5 minutes).
Removed easyrsa:

juju remove-application easyrsa

Some comments:

Addons that needed restarting: coredns, metrics-server, heapster, nginx-ingress-controller. I suspect others likely need it too, e.g. csi-rbdplugin. To be safe, I made the charm restart all addons that are managed by CDK. We will need to document for users that they may need to manually restart addons they have brought themselves.
DNS impact is small, as CoreDNS still responds to DNS lookups while it is unable to reach k8s. Since the restart of CoreDNS is a rolling restart, there is no downtime.
Ingress impact is moderate, as ingress is temporarily unavailable while nginx-ingress-controller is restarted (because it is a DaemonSet that binds to host ports).
Doing the migration in two steps (etcd first, then kubernetes) reduces overall downtime, and in particular, seems to cause much less downtime for ingress.

Cynerva · 2019-07-11T21:53:23Z

This now depends on charmed-kubernetes/cdk-addons#133

I updated the addon restart code to select on a new cdk-restart-on-ca-change label.

Tested again as described in #34 (comment), went well. Needs review.

reactive/kubernetes_master.py

Cynerva · 2019-07-16T13:58:49Z

Do not merge, needs retest.

Updated kubectl calls in this PR to use charms.layer.kubernetes_common.kubectl, and changed the new remove_state calls to clear_flag. Needs review.

reactive/kubernetes_master.py

Cynerva · 2019-07-16T18:53:05Z

Done testing, awaiting approval.

hyperbolic2346 · 2019-07-16T21:01:35Z

reactive/kubernetes_master.py

@@ -2466,7 +2466,7 @@ def restart_addons_for_ca():
        service_accounts = []
        for namespace, name in service_account_names:
            output = kubectl(
-                'kubectl', 'get', 'ServiceAccount', name,
+                'get', 'ServiceAccount', name,


totally read over that, good catch.

* Handle updates to certs or CA info If the certificates or CA info is updated (re-issued, transferred to a different CA, etc), the charm needs to handle that by updating the files on disk (already done) and restarting the services (added here). * Remove kick_api_server in favor of update_certs * Detect changes to etcd client cert and ensure restart * Lower timeouts for kube-system pod status check * Restart addons after CA changes * Ensure ServiceAccount secrets are updated before restarting addons * Use cdk-restart-on-ca-change label to select addons for restart * Use kubernetes_common.kubectl and clear_flag * Fix broken call to kubectl('kubectl', ...)

Handle updates to certs or CA info

9c224e1

If the certificates or CA info is updated (re-issued, transferred to a different CA, etc), the charm needs to handle that by updating the files on disk (already done) and restarting the services (added here).

johnsca mentioned this pull request Jun 27, 2019

Add documentation on transitioning CDK from EasyRSA to Vault charmed-kubernetes/kubernetes-docs#219

Merged

Remove kick_api_server in favor of update_certs

9b72c10

Detect changes to etcd client cert and ensure restart

7eddd00

Cynerva added 3 commits July 3, 2019 10:17

Lower timeouts for kube-system pod status check

4448169

Restart addons after CA changes

ed6493c

Ensure ServiceAccount secrets are updated before restarting addons

b0d3092

This was referenced Jul 10, 2019

Handle changes to etcd client cert charmed-kubernetes/charm-flannel#55

Merged

Handle updates to certs or CA info charmed-kubernetes/charm-kubernetes-worker#22

Merged

Handle updates to certs or CA info charmed-kubernetes/charm-kubeapi-load-balancer#2

Merged

Use cdk-restart-on-ca-change label to select addons for restart

786e77f

Cynerva mentioned this pull request Jul 11, 2019

Add cdk-restart-on-ca-change label to addons charmed-kubernetes/cdk-addons#133

Merged

hyperbolic2346 reviewed Jul 15, 2019

View reviewed changes

reactive/kubernetes_master.py Outdated Show resolved Hide resolved

hyperbolic2346 reviewed Jul 15, 2019

View reviewed changes

reactive/kubernetes_master.py Show resolved Hide resolved

hyperbolic2346 reviewed Jul 15, 2019

View reviewed changes

reactive/kubernetes_master.py Outdated Show resolved Hide resolved

Use kubernetes_common.kubectl and clear_flag

d0fa1cb

hyperbolic2346 reviewed Jul 16, 2019

View reviewed changes

reactive/kubernetes_master.py Show resolved Hide resolved

Fix broken call to kubectl('kubectl', ...)

69c64d4

hyperbolic2346 reviewed Jul 16, 2019

View reviewed changes

hyperbolic2346 approved these changes Jul 16, 2019

View reviewed changes

Cynerva merged commit eef3a3c into master Jul 16, 2019

Cynerva deleted the johnsca/bug/cert-ca-updates branch July 16, 2019 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle updates to certs or CA info #34

Handle updates to certs or CA info #34

johnsca commented Jun 27, 2019

Cynerva commented Jun 28, 2019 •

edited

johnsca commented Jul 1, 2019

Cynerva commented Jul 1, 2019 •

edited

Cynerva commented Jul 1, 2019 •

edited

johnsca commented Jul 1, 2019

Cynerva commented Jul 10, 2019

Cynerva commented Jul 11, 2019

Cynerva commented Jul 16, 2019

Cynerva commented Jul 16, 2019

hyperbolic2346 Jul 16, 2019

Handle updates to certs or CA info #34

Handle updates to certs or CA info #34

Conversation

johnsca commented Jun 27, 2019

Cynerva commented Jun 28, 2019 • edited

johnsca commented Jul 1, 2019

Cynerva commented Jul 1, 2019 • edited

Cynerva commented Jul 1, 2019 • edited

johnsca commented Jul 1, 2019

Cynerva commented Jul 10, 2019

Cynerva commented Jul 11, 2019

Cynerva commented Jul 16, 2019

Cynerva commented Jul 16, 2019

hyperbolic2346 Jul 16, 2019

Choose a reason for hiding this comment

Cynerva commented Jun 28, 2019 •

edited

Cynerva commented Jul 1, 2019 •

edited

Cynerva commented Jul 1, 2019 •

edited