Bug: Longhorn pod `longhorn-admission-webhook` stuck in Init state #598

Despire · 2023-02-21T10:46:27Z

The longhorn-admission-webhook pods get stuck in the Init state forever.

This issue arises occasionally in the CI and usually a retry of the failed run will resolve the issue.

However, while working on #575 I was changing around the test-sets and I've encountered this issues most of the time in the CI with a specific config. Instead of keeping ts3-c-1 in test-set3/2.yaml keep ts3-c-2 in test-set3/2.yml.

Steps To Reproduce

Change the test-set3/2.yaml so that the ts3-c-2 cluster will be left instead of ts3-c-1
i.e

kubernetes:
  clusters:
    - name: ts3-c-2
      version: v1.22.0
      network: 192.168.2.0/24
      pools:
        control:
          - aws-control
          - gcp-control
          - oci-control
        compute:
          - aws-compute
          - gcp-compute
          - oci-compute

Run the CI with the changed manifests.
Building this config in the CI will work however the testing framework will error out on the longhorn timeout as the longhorn-admission-webhook will be stuck in the Init state

At the time of testing the issue was 100% persistent with the given config.

The text was updated successfully, but these errors were encountered:

MiroslavRepka · 2023-03-02T15:59:54Z

Encountered this bug in ts3-c1 cluster of the test set 3.
Here is some info from the cluster:

$ kc get nodes -o wide --kubeconfig c
NAME                STATUS   ROLES                  AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
aws-compute-1       Ready    <none>                 63m   v1.22.0   192.168.2.2   <none>        Ubuntu 22.04.1 LTS   5.15.0-1026-aws      containerd://1.6.18
aws-control-1       Ready    control-plane,master   64m   v1.22.0   192.168.2.5   <none>        Ubuntu 22.04.1 LTS   5.15.0-1026-aws      containerd://1.6.18
hetzner-compute-1   Ready    <none>                 63m   v1.22.0   192.168.2.4   <none>        Ubuntu 22.04.1 LTS   5.15.0-56-generic    containerd://1.6.18
hetzner-control-1   Ready    control-plane,master   66m   v1.22.0   192.168.2.7   <none>        Ubuntu 22.04.1 LTS   5.15.0-56-generic    containerd://1.6.18
oci-compute-1       Ready    <none>                 63m   v1.22.0   192.168.2.3   <none>        Ubuntu 22.04.1 LTS   5.15.0-1021-oracle   containerd://1.6.18
oci-control-1       Ready    control-plane,master   65m   v1.22.0   192.168.2.6   <none>        Ubuntu 22.04.1 LTS   5.15.0-1021-oracle   containerd://1.6.18

$ kc --kubeconfig c get pods -n longhorn-system -o wide
NAME                                           READY   STATUS             RESTARTS        AGE     IP               NODE                NOMINATED NODE   READINESS GATES
csi-attacher-84b96d64c8-2dsrj                  1/1     Running            0               35m     10.244.132.217   hetzner-compute-1   <none>           <none>
csi-attacher-84b96d64c8-5vjj7                  1/1     Running            0               35m     10.244.132.202   hetzner-compute-1   <none>           <none>
csi-attacher-84b96d64c8-6qjgq                  1/1     Running            0               35m     10.244.132.220   hetzner-compute-1   <none>           <none>
csi-provisioner-6ccbfbf86f-9md7q               1/1     Running            0               35m     10.244.132.214   hetzner-compute-1   <none>           <none>
csi-provisioner-6ccbfbf86f-mtv79               1/1     Running            0               35m     10.244.132.212   hetzner-compute-1   <none>           <none>
csi-provisioner-6ccbfbf86f-wgzbr               1/1     Running            0               35m     10.244.132.207   hetzner-compute-1   <none>           <none>
csi-resizer-6dd8bd4c97-6nqcr                   1/1     Running            0               35m     10.244.132.216   hetzner-compute-1   <none>           <none>
csi-resizer-6dd8bd4c97-gnnbs                   1/1     Running            0               35m     10.244.132.205   hetzner-compute-1   <none>           <none>
csi-resizer-6dd8bd4c97-pxsw4                   1/1     Running            0               35m     10.244.132.210   hetzner-compute-1   <none>           <none>
csi-snapshotter-86f65d8bc-4qklp                1/1     Running            0               35m     10.244.132.211   hetzner-compute-1   <none>           <none>
csi-snapshotter-86f65d8bc-bxm7r                1/1     Running            0               35m     10.244.132.204   hetzner-compute-1   <none>           <none>
csi-snapshotter-86f65d8bc-xzwfj                1/1     Running            0               35m     10.244.132.209   hetzner-compute-1   <none>           <none>
engine-image-ei-766a591b-8kjbc                 1/1     Running            0               62m     10.244.70.130    oci-compute-1       <none>           <none>
engine-image-ei-766a591b-bqvpv                 1/1     Running            0               62m     10.244.167.9     aws-compute-1       <none>           <none>
engine-image-ei-766a591b-psp5d                 1/1     Running            0               62m     10.244.132.194   hetzner-compute-1   <none>           <none>
instance-manager-e-21e82048                    1/1     Running            0               30m     10.244.132.221   hetzner-compute-1   <none>           <none>
instance-manager-e-7b5850bf                    1/1     Running            0               31m     10.244.70.140    oci-compute-1       <none>           <none>
instance-manager-e-9399e183                    1/1     Running            0               31m     10.244.167.16    aws-compute-1       <none>           <none>
instance-manager-r-0a56cdfe                    1/1     Running            0               31m     10.244.70.141    oci-compute-1       <none>           <none>
instance-manager-r-31944f51                    1/1     Running            0               30m     10.244.132.222   hetzner-compute-1   <none>           <none>
instance-manager-r-4dbc45ad                    1/1     Running            0               31m     10.244.167.17    aws-compute-1       <none>           <none>
longhorn-admission-webhook-fcdb78d7c-fbpwc     0/1     Init:0/1           0               9m38s   10.244.70.144    oci-compute-1       <none>           <none>
longhorn-admission-webhook-fcdb78d7c-ss99j     0/1     Init:0/1           0               9m26s   10.244.167.18    aws-compute-1       <none>           <none>
longhorn-conversion-webhook-76c9cc7b6d-8w8gk   1/1     Running            0               35m     10.244.132.203   hetzner-compute-1   <none>           <none>
longhorn-conversion-webhook-76c9cc7b6d-rtnj6   1/1     Running            0               35m     10.244.132.213   hetzner-compute-1   <none>           <none>
longhorn-csi-plugin-bn27b                      2/2     Running            0               62m     10.244.70.138    oci-compute-1       <none>           <none>
longhorn-csi-plugin-k8cwh                      2/2     Running            0               62m     10.244.132.201   hetzner-compute-1   <none>           <none>
longhorn-csi-plugin-wqhwm                      2/2     Running            0               62m     10.244.167.14    aws-compute-1       <none>           <none>
longhorn-driver-deployer-74c5d667d7-vb8tv      0/1     Init:0/1           0               13m     10.244.70.143    oci-compute-1       <none>           <none>
longhorn-manager-fhll2                         1/1     Running            0               63m     10.244.132.193   hetzner-compute-1   <none>           <none>
longhorn-manager-kd4ks                         1/1     Running            0               63m     10.244.167.4     aws-compute-1       <none>           <none>
longhorn-manager-qdnp6                         1/1     Running            0               63m     10.244.70.129    oci-compute-1       <none>           <none>
longhorn-ui-85498cd6fb-sr2gf                   0/1     CrashLoopBackOff   10 (3m4s ago)   35m     10.244.132.215   hetzner-compute-1   <none>           <none>

$ kc --kubeconfig c get svc -n longhorn-system -o wide
NAME                          TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE   SELECTOR
csi-attacher                  ClusterIP   10.108.127.170   <none>        12345/TCP   62m   app=csi-attacher
csi-provisioner               ClusterIP   10.105.116.109   <none>        12345/TCP   62m   app=csi-provisioner
csi-resizer                   ClusterIP   10.105.0.112     <none>        12345/TCP   62m   app=csi-resizer
csi-snapshotter               ClusterIP   10.103.151.120   <none>        12345/TCP   62m   app=csi-snapshotter
longhorn-admission-webhook    ClusterIP   10.96.207.204    <none>        9443/TCP    63m   app=longhorn-admission-webhook
longhorn-backend              ClusterIP   10.108.202.237   <none>        9500/TCP    63m   app=longhorn-manager
longhorn-conversion-webhook   ClusterIP   10.107.166.19    <none>        9443/TCP    63m   app=longhorn-conversion-webhook
longhorn-engine-manager       ClusterIP   None             <none>        <none>      63m   longhorn.io/component=instance-manager,longhorn.io/instance-manager-type=engine
longhorn-frontend             ClusterIP   10.107.239.68    <none>        80/TCP      63m   app=longhorn-ui
longhorn-replica-manager      ClusterIP   None             <none>        <none>      63m   longhorn.io/component=instance-manager,longhorn.io/instance-manager-type=replica

Checking the logs of containers in the init state was not helpful. The longhorn-ui was also crash looping, could not find out why (cluster got deleted)

cloudziu · 2023-03-08T08:29:20Z

From what I see the initContaienrs have a simple while loop where they are waiting for other services to get to the Ready state: https://github.com/longhorn/longhorn/blob/92fd5b54edda215b1d7b410d003c57a672a486bf/deploy/longhorn.yaml#LL4257C26-L4257C26

longhorn-admission-webhook is waiting for longhorn-conversion-webhook

I see that in both above cases, the longhorn-conversion-webhook was already in Running state, but I can also see that in @MiroslavRepka both of the conversion-webhooks where located on Hetzner nodes (I truly want to believe that this is not associated in any way)

UPDATE:
Could not reproduce it locally. Also while running latest E2E test this bug did not occur. Next time if you will encounter this bug, please review Kubernetes network components:

Service endpoint - kubectl -n longhorn-system get ep -o wide
Can Pod can connect to a different one by IP address
CoreDNS status, and if pods resolve DNS queries
Kubernetes CNI status
Kube-proxy status
You can also HMU when this will occur again :)

Despire · 2023-04-18T12:14:09Z

I wasn't able to re-create the bug.

I tried going through the test-sets multiple times locally, and also in our GCP cluster.

I went over the issues in the longhorn repository and stumbled upon a very similar issue to ours longhorn/longhorn#5645 where the issue was with the DNS

MiroslavRepka · 2023-05-02T21:15:28Z

I got this bug in a recent CI, and I tried to restart the core DNS to no avail. The pods were still stuck in the init state and there was no endpoint for the webhook. I even tried to restart calico but nothing helped.

$ kc get ep -n longhorn-system 
NAME                          ENDPOINTS                                                      AGE
csi-attacher                  10.244.132.197:12345,10.244.70.150:12345,10.244.70.154:12345   41m
csi-provisioner               10.244.132.198:12345,10.244.70.151:12345,10.244.70.153:12345   41m
csi-resizer                   10.244.132.201:12345,10.244.70.139:12345,10.244.70.142:12345   41m
csi-snapshotter               10.244.132.199:12345,10.244.70.145:12345,10.244.70.149:12345   41m
longhorn-admission-webhook                                                                   42m
longhorn-backend              10.244.132.193:9500,10.244.167.2:9500,10.244.70.129:9500       42m
longhorn-conversion-webhook   10.244.132.203:9443,10.244.167.22:9443                         42m
longhorn-engine-manager       10.244.132.195,10.244.167.20,10.244.70.157                     42m
longhorn-frontend             10.244.70.144:8000,10.244.70.152:8000                          42m
longhorn-recovery-backend     10.244.70.140:9600,10.244.70.146:9600                          42m
longhorn-replica-manager      10.244.132.194,10.244.167.19,10.244.70.156                     42m

Despire · 2023-05-09T11:10:16Z

I've also noticed that sometimes when we delete a node from a cluster the deletion gets stuck forvever on deleting nodes.longhorn.io .... https://github.com/berops/claudie/blob/master/services/kuber/server/nodes/delete.go#L172

however, unsure if that's relevant to this isse.

Despire · 2023-08-23T07:32:51Z

resolved by #984

defyjoy · 2023-10-18T16:03:39Z

Why is this closed ? This issue is still persisting . It completely is out of any IPs in longhorn repository . The webhook fails due to this issue . I installed cilium even with version v1.15.0-pre.1 instead of v1.14.2 hoping it's fixed .

Despire · 2023-10-18T16:09:32Z

Yes you're right, originally the issue was fixed when we switch to cilium with eBPF but that introduce other problems and we had to fallback to kube-proxy where this issue was re-introduced.

This issue should be re-opened, which I'll do.

defyjoy · 2023-10-25T04:30:41Z

Is there any resolution to this for cilium ebpf ? I do not see any traction so was wondering what might be the way to fix this .

bernardhalas · 2023-10-25T06:26:52Z

Currently we don't have a solution for this issue. We know that using Cilium as a CNI without kube-proxy solves this problem (and introduces others though). Any inputs are welcome.

JKBGIT1 · 2023-11-23T09:50:06Z

Hi @defyjoy , could you provide your InputManifest for which you encountered this issue?

Despire · 2024-05-20T12:41:19Z

This issue has not been seen since #1366
It possible that other bug fixes might have also fixed it.

I'll close this as done for now, If it will resurface again will re-open it.

Despire added the bug Something isn't working label Feb 21, 2023

Despire self-assigned this Apr 14, 2023

Despire removed their assignment Apr 18, 2023

MarioUhrik added the groomed Task that everybody agrees to pass the gatekeeper label May 12, 2023

Despire closed this as completed Aug 23, 2023

Despire reopened this Oct 18, 2023

bernardhalas mentioned this issue Oct 27, 2023

Bug: Cilium with eBPF is not working as intended #1036

Closed

Despire closed this as completed May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Longhorn pod `longhorn-admission-webhook` stuck in Init state #598

Bug: Longhorn pod `longhorn-admission-webhook` stuck in Init state #598

Despire commented Feb 21, 2023 •

edited

Loading

MiroslavRepka commented Mar 2, 2023

cloudziu commented Mar 8, 2023 •

edited

Loading

Despire commented Apr 18, 2023

MiroslavRepka commented May 2, 2023

Despire commented May 9, 2023

Despire commented Aug 23, 2023

defyjoy commented Oct 18, 2023

Despire commented Oct 18, 2023

defyjoy commented Oct 25, 2023

bernardhalas commented Oct 25, 2023

JKBGIT1 commented Nov 23, 2023

Despire commented May 20, 2024

Bug: Longhorn pod longhorn-admission-webhook stuck in Init state #598

Bug: Longhorn pod longhorn-admission-webhook stuck in Init state #598

Comments

Despire commented Feb 21, 2023 • edited Loading

Steps To Reproduce

MiroslavRepka commented Mar 2, 2023

cloudziu commented Mar 8, 2023 • edited Loading

Despire commented Apr 18, 2023

MiroslavRepka commented May 2, 2023

Despire commented May 9, 2023

Despire commented Aug 23, 2023

defyjoy commented Oct 18, 2023

Despire commented Oct 18, 2023

defyjoy commented Oct 25, 2023

bernardhalas commented Oct 25, 2023

JKBGIT1 commented Nov 23, 2023

Despire commented May 20, 2024

Bug: Longhorn pod `longhorn-admission-webhook` stuck in Init state #598

Bug: Longhorn pod `longhorn-admission-webhook` stuck in Init state #598

Despire commented Feb 21, 2023 •

edited

Loading

cloudziu commented Mar 8, 2023 •

edited

Loading