Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle updates to certs or CA info #34

Merged
merged 9 commits into from
Jul 16, 2019
Merged

Conversation

johnsca
Copy link
Contributor

@johnsca johnsca commented Jun 27, 2019

If the certificates or CA info is updated (re-issued, transferred to a different CA, etc), the charm needs to handle that by updating the files on disk (already done) and restarting the services (added here).

If the certificates or CA info is updated (re-issued, transferred to a
different CA, etc), the charm needs to handle that by updating the files
on disk (already done) and restarting the services (added here).
@Cynerva
Copy link
Contributor

Cynerva commented Jun 28, 2019

tls_client.certs.changed is currently handled by kick_api_server. The new update_certs handler will conflict with it.

I do think that kick_api_server doesn't do enough. Presumably we'd want to restart kube-controller-manager, too, given that it takes in the CA and cert/key as options.

We probably don't need to restart kube-scheduler, but I also don't see any real harm in doing it.

All that said, I recommend removing the kick_api_server handler in favor of the new update_certs one.

@johnsca
Copy link
Contributor Author

johnsca commented Jul 1, 2019

Updated per review. As per the other PRs (copying here for completeness), this was tested manually on AWS by doing the following:

  1. Deploy CDK with the four patched charms & docker subordinate, wait to settle
  2. Run some test pod in the cluster (I used busybox and left processes running on the pod)
  3. Add Vault to the cluster, wait to settle
  4. Unseal Vault, wait to settle
  5. Remove EasyRSA, wait to settle
  6. Update kubectl config
  7. Confirm pod is still functioning

@Cynerva
Copy link
Contributor

Cynerva commented Jul 1, 2019

  1. Unseal Vault, wait to settle

In testing, during this step, I found that there is a race that can leave kubernetes-master in a bad state where kube-apiserver is no longer able to connect to etcd.

The following sequence occurred:

  1. certificates-relation-changed hook ran, resulting in new certs being written, and kube-apiserver being restarted as expected. However, cert information used to connect to etcd was -not- updated during this time, as those come from the etcd relation (handled in handle_etcd_relation), not the certificates relation.
  2. etcd-relation-changed hook ran, in which the etcd relation provided new client cert and CA info. However, start_master did not run again, so handle_etcd_relation was not called again, so the new certs were never written.

I confirmed with relation-get that the new cert info was available on the etcd relation; it was just never written again.

In this state, all kubectl calls fail. Hooks on kubernetes-master and kubernetes-worker are affected (staying in "executing" state for a long time, as they retry kubectl commands), and the cluster has not recovered on its own. Removing easyrsa did not help.

@Cynerva
Copy link
Contributor

Cynerva commented Jul 1, 2019

Flannel also failed to switch to the new etcd client cert info. From journalctl -o cat -u flannel:

E0701 19:17:33.771678   29435 watch.go:171] Subnet watch failed: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate signed by unknown authority
; error #1: x509: certificate signed by unknown authority
; error #2: x509: certificate signed by unknown authority
E0701 19:17:34.511649   29435 watch.go:43] Watch subnets: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate signed by unknown authority
; error #1: x509: certificate signed by unknown authority
; error #2: x509: certificate signed by unknown authority
E0701 19:17:34.787105   29435 watch.go:171] Subnet watch failed: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate signed by unknown authority
; error #1: x509: certificate signed by unknown authority
; error #2: x509: certificate signed by unknown authority
E0701 19:17:35.527465   29435 watch.go:43] Watch subnets: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate signed by unknown authority
; error #1: x509: certificate signed by unknown authority
; error #2: x509: certificate signed by unknown authority
E0701 19:17:35.805219   29435 watch.go:171] Subnet watch failed: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate signed by unknown authority
; error #1: x509: certificate signed by unknown authority

I suspect flannel, calico, canal, and tigera-secure-ee all need similar work to account for this transition, as all of these services connect directly to etcd.

@johnsca
Copy link
Contributor Author

johnsca commented Jul 1, 2019

@Cynerva
Copy link
Contributor

Cynerva commented Jul 10, 2019

Ready for review. Tested the latest commit as follows:

  1. Deployed charmed-kubernetes with modified charms (bundle.yaml).
  2. Deployed vault in auto-unlock mode:
juju deploy cs:percona-cluster
juju deploy cs:~openstack-charmers-next/vault
juju config vault auto-generate-root-ca-cert=true totally-unsecure-auto-unlock=true
juju relate vault percona-cluster
  1. Waited for cluster to settle.
  2. Related vault to etcd:
juju relate vault:certificates etcd
  1. Waited for cluster to settle (took ~5 minutes).
  2. Related vault to kubeapi-load-balancer, kubernetes-master, kubernetes-worker:
juju relate vault:certificates kubeapi-load-balancer
juju relate vault:certificates kubernetes-master
juju relate vault:certificates kubernetes-worker
  1. Waited for cluster to settle (took another ~5 minutes).
  2. Removed easyrsa:
juju remove-application easyrsa

Some comments:

  1. Addons that needed restarting: coredns, metrics-server, heapster, nginx-ingress-controller. I suspect others likely need it too, e.g. csi-rbdplugin. To be safe, I made the charm restart all addons that are managed by CDK. We will need to document for users that they may need to manually restart addons they have brought themselves.
  2. DNS impact is small, as CoreDNS still responds to DNS lookups while it is unable to reach k8s. Since the restart of CoreDNS is a rolling restart, there is no downtime.
  3. Ingress impact is moderate, as ingress is temporarily unavailable while nginx-ingress-controller is restarted (because it is a DaemonSet that binds to host ports).
  4. Doing the migration in two steps (etcd first, then kubernetes) reduces overall downtime, and in particular, seems to cause much less downtime for ingress.

@Cynerva
Copy link
Contributor

Cynerva commented Jul 11, 2019

This now depends on charmed-kubernetes/cdk-addons#133

I updated the addon restart code to select on a new cdk-restart-on-ca-change label.

Tested again as described in #34 (comment), went well. Needs review.

@Cynerva
Copy link
Contributor

Cynerva commented Jul 16, 2019

Do not merge, needs retest.

Updated kubectl calls in this PR to use charms.layer.kubernetes_common.kubectl, and changed the new remove_state calls to clear_flag. Needs review.

@Cynerva
Copy link
Contributor

Cynerva commented Jul 16, 2019

Done testing, awaiting approval.

@@ -2466,7 +2466,7 @@ def restart_addons_for_ca():
service_accounts = []
for namespace, name in service_account_names:
output = kubectl(
'kubectl', 'get', 'ServiceAccount', name,
'get', 'ServiceAccount', name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally read over that, good catch.

@Cynerva Cynerva merged commit eef3a3c into master Jul 16, 2019
@Cynerva Cynerva deleted the johnsca/bug/cert-ca-updates branch July 16, 2019 21:05
Cynerva pushed a commit that referenced this pull request Jul 30, 2019
* Handle updates to certs or CA info

If the certificates or CA info is updated (re-issued, transferred to a
different CA, etc), the charm needs to handle that by updating the files
on disk (already done) and restarting the services (added here).

* Remove kick_api_server in favor of update_certs

* Detect changes to etcd client cert and ensure restart

* Lower timeouts for kube-system pod status check

* Restart addons after CA changes

* Ensure ServiceAccount secrets are updated before restarting addons

* Use cdk-restart-on-ca-change label to select addons for restart

* Use kubernetes_common.kubectl and clear_flag

* Fix broken call to kubectl('kubectl', ...)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants