Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm update fails if 6443/tcp is missing from webhook egress networkpolicy #5787

Closed
ExNG opened this issue Feb 9, 2023 · 2 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@ExNG
Copy link
Contributor

ExNG commented Feb 9, 2023

Describe the bug:
When updating cert-manager from v1.8.0 to v1.11.0 on OKD 4.12 a Network Policy with egress rules for the new webhook pod is created.
When afterwards the new webhook deployment is applied the resulting pod cannot connect to https://172.30.0.1:443/api thus helm install fails.
I understand this might be a bug verry specific to our environment but it's an easy fix.

Log output before fix:

cert-manager-webhook-7c4b9d44bc-j44rl cert-manager-webhook I0209 10:18:48.934179       1 feature_gate.go:249] feature gates: &{map[]}
cert-manager-webhook-7c4b9d44bc-j44rl cert-manager-webhook W0209 10:18:48.934378       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
cert-manager-webhook-7c4b9d44bc-j44rl cert-manager-webhook E0209 10:19:18.942014       1 webhook.go:122] cert-manager "msg"="Failed initialising server" "error"="error building admission chain: Get \"https://172.30.0.1:443/api\": dial tcp 172.30.0.1:443: i/o timeout" 

And after:

cert-manager-webhook-7c4b9d44bc-p9lwq cert-manager-webhook I0209 10:20:35.793543       1 feature_gate.go:249] feature gates: &{map[]}
cert-manager-webhook-7c4b9d44bc-p9lwq cert-manager-webhook W0209 10:20:35.794006       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
cert-manager-webhook-7c4b9d44bc-p9lwq cert-manager-webhook I0209 10:20:35.828322       1 webhook.go:129] cert-manager "msg"="using dynamic certificate generating using CA stored in Secret resource" "secret_name"="cert-manager-webhook-ca" "secret_namespace"="cert-manager"
cert-manager-webhook-7c4b9d44bc-p9lwq cert-manager-webhook I0209 10:20:35.829816       1 server.go:133] cert-manager/webhook "msg"="listening for insecure healthz connections" "address"=":6080"
cert-manager-webhook-7c4b9d44bc-p9lwq cert-manager-webhook I0209 10:20:35.830066       1 server.go:197] cert-manager/webhook "msg"="listening for secure connections" "address"=":10250"
cert-manager-webhook-7c4b9d44bc-p9lwq cert-manager-webhook I0209 10:20:36.842200       1 dynamic_source.go:266] cert-manager/webhook "msg"="Updated cert-manager webhook TLS certificate" "DNSNames"=["cert-manager-webhook","cert-manager-webhook.cert-manager","cert-manager-webhook.cert-manager.svc"]

Expected behaviour:
The webhook pod shall start without timeout error.

Steps to reproduce the bug:

  1. Install cert-manager v1.8.0 on OKD 4.11 using the helm chart
  2. Update to v1.11.0

Anything else we need to know?:

Environment details::

  • Kubernetes version: 1.25
  • Cloud-provider/provisioner: OKD 4.12 using terraform
  • cert-manager version: 1.11.0
  • Install method: helm

/kind bug

@jetstack-bot jetstack-bot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 9, 2023
@maelvls
Copy link
Member

maelvls commented Feb 10, 2023

Here is a recap of the investigation discussion we had in the PR:

By default, network policies are "allow all", including with OVN-kubernetes (source). This is why the cert-manager controller pod is able to talk to the Kubernetes API server without a problem since there is no network policy attached to it, thus "allow all".

When using --set webhook.networkPolicy=true, the traffic from and to the cert-manager webhook pod is "denied by default" with the exception of 443/TCP (among others). But 443/TCP doesn't allow traffic to the Kubernetes API server in OKD, because OpenShift and OKD clusters use the port 6443 for the Kubernetes API server.

When the webhook pod tries to talk to the API server, the packet's destination IPs are re-written: dst 172.30.0.1:443 is changed to dst 100.64.0.1:6443.

If you would like to know if you are also affected by this issue, check whether your Kubernetes API server is served on port 6443 by running the following command:

$ k get endpoints -n default kubernetes
NAME         ENDPOINTS            AGE
kubernetes   100.64.0.1:6443      26d

The reason other people haven't hit this issue until today is because most Kubernetes clusters use 443 to expose kube-apiserver. For example, on GKE, the Kubernetes API server is listening on port 443:

$ k get endpoints -n default kubernetes
NAME         ENDPOINTS            AGE
kubernetes   104.199.89.236:443   3y140d

And here is a diagram showing the webhook trying to open a TCP connection to kube-apiserver:

 worker node                                  
 host IP:  100.64.0.2                         
 pod cidr: 10.28.0.0/24                       
 +-------------------------------------------+
 |                                           |
 |   +----------------------------------+    |
 |   |     cert-manager-webhook pod     |    |
 |   |                                  |    |
 |   | src: 10.28.0.5:60123 (podIP)     |    |
 |   | dst: 172.30.0.1:443  (clusterIP) |    |
 |   |             |                    |    |
 |   +-------------|--------------------+    |
 |                 |                         |
 |                 v                         |
 |      src: 10.28.0.5:60123    (podIP)      |
 |     -dst: 172.30.0.1:443     (clusterIP)  |
 |     +dst: 100.64.0.1:6443                 |
 |                 |                         |
 |                 |                         |
 |                 |                         |
 |                 |                         |
 +-----------------|-------------------------+
                   |                          
                   |                          
                   X   REFUSED
                   |                          
                   |                          
                   v                          
    +-----------------------------------+     
    |         kube-apiserver            |     
    +-----------------------------------+     
    control plane node                        
    host IP: 100.64.0.1   

@ExNG
Copy link
Contributor Author

ExNG commented Feb 13, 2023

My patch for this issue #5788 has been merged, @maelvls thank you verry much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants