Helm update fails if 6443/tcp is missing from webhook egress networkpolicy #5787

ExNG · 2023-02-09T10:44:11Z

Describe the bug:
When updating cert-manager from v1.8.0 to v1.11.0 on OKD 4.12 a Network Policy with egress rules for the new webhook pod is created.
When afterwards the new webhook deployment is applied the resulting pod cannot connect to https://172.30.0.1:443/api thus helm install fails.
I understand this might be a bug verry specific to our environment but it's an easy fix.

Log output before fix:

cert-manager-webhook-7c4b9d44bc-j44rl cert-manager-webhook I0209 10:18:48.934179       1 feature_gate.go:249] feature gates: &{map[]}
cert-manager-webhook-7c4b9d44bc-j44rl cert-manager-webhook W0209 10:18:48.934378       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
cert-manager-webhook-7c4b9d44bc-j44rl cert-manager-webhook E0209 10:19:18.942014       1 webhook.go:122] cert-manager "msg"="Failed initialising server" "error"="error building admission chain: Get \"https://172.30.0.1:443/api\": dial tcp 172.30.0.1:443: i/o timeout"

And after:

cert-manager-webhook-7c4b9d44bc-p9lwq cert-manager-webhook I0209 10:20:35.793543       1 feature_gate.go:249] feature gates: &{map[]}
cert-manager-webhook-7c4b9d44bc-p9lwq cert-manager-webhook W0209 10:20:35.794006       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
cert-manager-webhook-7c4b9d44bc-p9lwq cert-manager-webhook I0209 10:20:35.828322       1 webhook.go:129] cert-manager "msg"="using dynamic certificate generating using CA stored in Secret resource" "secret_name"="cert-manager-webhook-ca" "secret_namespace"="cert-manager"
cert-manager-webhook-7c4b9d44bc-p9lwq cert-manager-webhook I0209 10:20:35.829816       1 server.go:133] cert-manager/webhook "msg"="listening for insecure healthz connections" "address"=":6080"
cert-manager-webhook-7c4b9d44bc-p9lwq cert-manager-webhook I0209 10:20:35.830066       1 server.go:197] cert-manager/webhook "msg"="listening for secure connections" "address"=":10250"
cert-manager-webhook-7c4b9d44bc-p9lwq cert-manager-webhook I0209 10:20:36.842200       1 dynamic_source.go:266] cert-manager/webhook "msg"="Updated cert-manager webhook TLS certificate" "DNSNames"=["cert-manager-webhook","cert-manager-webhook.cert-manager","cert-manager-webhook.cert-manager.svc"]

Expected behaviour:
The webhook pod shall start without timeout error.

Steps to reproduce the bug:

Install cert-manager v1.8.0 on OKD 4.11 using the helm chart
Update to v1.11.0

Anything else we need to know?:

Environment details::

Kubernetes version: 1.25
Cloud-provider/provisioner: OKD 4.12 using terraform
cert-manager version: 1.11.0
Install method: helm

/kind bug

The text was updated successfully, but these errors were encountered:

maelvls · 2023-02-10T13:59:23Z

Here is a recap of the investigation discussion we had in the PR:

By default, network policies are "allow all", including with OVN-kubernetes (source). This is why the cert-manager controller pod is able to talk to the Kubernetes API server without a problem since there is no network policy attached to it, thus "allow all".

When using --set webhook.networkPolicy=true, the traffic from and to the cert-manager webhook pod is "denied by default" with the exception of 443/TCP (among others). But 443/TCP doesn't allow traffic to the Kubernetes API server in OKD, because OpenShift and OKD clusters use the port 6443 for the Kubernetes API server.

When the webhook pod tries to talk to the API server, the packet's destination IPs are re-written: dst 172.30.0.1:443 is changed to dst 100.64.0.1:6443.

If you would like to know if you are also affected by this issue, check whether your Kubernetes API server is served on port 6443 by running the following command:

$ k get endpoints -n default kubernetes
NAME         ENDPOINTS            AGE
kubernetes   100.64.0.1:6443      26d

The reason other people haven't hit this issue until today is because most Kubernetes clusters use 443 to expose kube-apiserver. For example, on GKE, the Kubernetes API server is listening on port 443:

$ k get endpoints -n default kubernetes
NAME         ENDPOINTS            AGE
kubernetes   104.199.89.236:443   3y140d

And here is a diagram showing the webhook trying to open a TCP connection to kube-apiserver:

 worker node                                  
 host IP:  100.64.0.2                         
 pod cidr: 10.28.0.0/24                       
 +-------------------------------------------+
 |                                           |
 |   +----------------------------------+    |
 |   |     cert-manager-webhook pod     |    |
 |   |                                  |    |
 |   | src: 10.28.0.5:60123 (podIP)     |    |
 |   | dst: 172.30.0.1:443  (clusterIP) |    |
 |   |             |                    |    |
 |   +-------------|--------------------+    |
 |                 |                         |
 |                 v                         |
 |      src: 10.28.0.5:60123    (podIP)      |
 |     -dst: 172.30.0.1:443     (clusterIP)  |
 |     +dst: 100.64.0.1:6443                 |
 |                 |                         |
 |                 |                         |
 |                 |                         |
 |                 |                         |
 +-----------------|-------------------------+
                   |                          
                   |                          
                   X   REFUSED
                   |                          
                   |                          
                   v                          
    +-----------------------------------+     
    |         kube-apiserver            |     
    +-----------------------------------+     
    control plane node                        
    host IP: 100.64.0.1

ExNG · 2023-02-13T16:02:16Z

My patch for this issue #5788 has been merged, @maelvls thank you verry much!

jetstack-bot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 9, 2023

ExNG mentioned this issue Feb 9, 2023

Add 6443/TCP to webhook egress NetworkPolicy #5788

Merged

ExNG closed this as completed Feb 13, 2023

wallrj mentioned this issue Nov 9, 2023

Network Policy Recommendations cert-manager/website#1344

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helm update fails if 6443/tcp is missing from webhook egress networkpolicy #5787

Helm update fails if 6443/tcp is missing from webhook egress networkpolicy #5787

ExNG commented Feb 9, 2023 •

edited

maelvls commented Feb 10, 2023 •

edited

ExNG commented Feb 13, 2023 •

edited

Helm update fails if 6443/tcp is missing from webhook egress networkpolicy #5787

Helm update fails if 6443/tcp is missing from webhook egress networkpolicy #5787

Comments

ExNG commented Feb 9, 2023 • edited

maelvls commented Feb 10, 2023 • edited

ExNG commented Feb 13, 2023 • edited

ExNG commented Feb 9, 2023 •

edited

maelvls commented Feb 10, 2023 •

edited

ExNG commented Feb 13, 2023 •

edited