Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Longhorn pod longhorn-admission-webhook stuck in Init state #598

Closed
Despire opened this issue Feb 21, 2023 · 12 comments
Closed

Bug: Longhorn pod longhorn-admission-webhook stuck in Init state #598

Despire opened this issue Feb 21, 2023 · 12 comments
Labels
bug Something isn't working groomed Task that everybody agrees to pass the gatekeeper

Comments

@Despire
Copy link
Contributor

Despire commented Feb 21, 2023

The longhorn-admission-webhook pods get stuck in the Init state forever.
Screenshot 2023-02-21 at 08 17 26

This issue arises occasionally in the CI and usually a retry of the failed run will resolve the issue.

However, while working on #575 I was changing around the test-sets and I've encountered this issues most of the time in the CI with a specific config. Instead of keeping ts3-c-1 in test-set3/2.yaml keep ts3-c-2 in test-set3/2.yml.

Steps To Reproduce

  1. Change the test-set3/2.yaml so that the ts3-c-2 cluster will be left instead of ts3-c-1
    i.e
kubernetes:
  clusters:
    - name: ts3-c-2
      version: v1.22.0
      network: 192.168.2.0/24
      pools:
        control:
          - aws-control
          - gcp-control
          - oci-control
        compute:
          - aws-compute
          - gcp-compute
          - oci-compute
  1. Run the CI with the changed manifests.
  2. Building this config in the CI will work however the testing framework will error out on the longhorn timeout as the longhorn-admission-webhook will be stuck in the Init state

At the time of testing the issue was 100% persistent with the given config.

@Despire Despire added the bug Something isn't working label Feb 21, 2023
@MiroslavRepka
Copy link
Contributor

Encountered this bug in ts3-c1 cluster of the test set 3.
Here is some info from the cluster:

$ kc get nodes -o wide --kubeconfig c
NAME                STATUS   ROLES                  AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
aws-compute-1       Ready    <none>                 63m   v1.22.0   192.168.2.2   <none>        Ubuntu 22.04.1 LTS   5.15.0-1026-aws      containerd://1.6.18
aws-control-1       Ready    control-plane,master   64m   v1.22.0   192.168.2.5   <none>        Ubuntu 22.04.1 LTS   5.15.0-1026-aws      containerd://1.6.18
hetzner-compute-1   Ready    <none>                 63m   v1.22.0   192.168.2.4   <none>        Ubuntu 22.04.1 LTS   5.15.0-56-generic    containerd://1.6.18
hetzner-control-1   Ready    control-plane,master   66m   v1.22.0   192.168.2.7   <none>        Ubuntu 22.04.1 LTS   5.15.0-56-generic    containerd://1.6.18
oci-compute-1       Ready    <none>                 63m   v1.22.0   192.168.2.3   <none>        Ubuntu 22.04.1 LTS   5.15.0-1021-oracle   containerd://1.6.18
oci-control-1       Ready    control-plane,master   65m   v1.22.0   192.168.2.6   <none>        Ubuntu 22.04.1 LTS   5.15.0-1021-oracle   containerd://1.6.18

$ kc --kubeconfig c get pods -n longhorn-system -o wide
NAME                                           READY   STATUS             RESTARTS        AGE     IP               NODE                NOMINATED NODE   READINESS GATES
csi-attacher-84b96d64c8-2dsrj                  1/1     Running            0               35m     10.244.132.217   hetzner-compute-1   <none>           <none>
csi-attacher-84b96d64c8-5vjj7                  1/1     Running            0               35m     10.244.132.202   hetzner-compute-1   <none>           <none>
csi-attacher-84b96d64c8-6qjgq                  1/1     Running            0               35m     10.244.132.220   hetzner-compute-1   <none>           <none>
csi-provisioner-6ccbfbf86f-9md7q               1/1     Running            0               35m     10.244.132.214   hetzner-compute-1   <none>           <none>
csi-provisioner-6ccbfbf86f-mtv79               1/1     Running            0               35m     10.244.132.212   hetzner-compute-1   <none>           <none>
csi-provisioner-6ccbfbf86f-wgzbr               1/1     Running            0               35m     10.244.132.207   hetzner-compute-1   <none>           <none>
csi-resizer-6dd8bd4c97-6nqcr                   1/1     Running            0               35m     10.244.132.216   hetzner-compute-1   <none>           <none>
csi-resizer-6dd8bd4c97-gnnbs                   1/1     Running            0               35m     10.244.132.205   hetzner-compute-1   <none>           <none>
csi-resizer-6dd8bd4c97-pxsw4                   1/1     Running            0               35m     10.244.132.210   hetzner-compute-1   <none>           <none>
csi-snapshotter-86f65d8bc-4qklp                1/1     Running            0               35m     10.244.132.211   hetzner-compute-1   <none>           <none>
csi-snapshotter-86f65d8bc-bxm7r                1/1     Running            0               35m     10.244.132.204   hetzner-compute-1   <none>           <none>
csi-snapshotter-86f65d8bc-xzwfj                1/1     Running            0               35m     10.244.132.209   hetzner-compute-1   <none>           <none>
engine-image-ei-766a591b-8kjbc                 1/1     Running            0               62m     10.244.70.130    oci-compute-1       <none>           <none>
engine-image-ei-766a591b-bqvpv                 1/1     Running            0               62m     10.244.167.9     aws-compute-1       <none>           <none>
engine-image-ei-766a591b-psp5d                 1/1     Running            0               62m     10.244.132.194   hetzner-compute-1   <none>           <none>
instance-manager-e-21e82048                    1/1     Running            0               30m     10.244.132.221   hetzner-compute-1   <none>           <none>
instance-manager-e-7b5850bf                    1/1     Running            0               31m     10.244.70.140    oci-compute-1       <none>           <none>
instance-manager-e-9399e183                    1/1     Running            0               31m     10.244.167.16    aws-compute-1       <none>           <none>
instance-manager-r-0a56cdfe                    1/1     Running            0               31m     10.244.70.141    oci-compute-1       <none>           <none>
instance-manager-r-31944f51                    1/1     Running            0               30m     10.244.132.222   hetzner-compute-1   <none>           <none>
instance-manager-r-4dbc45ad                    1/1     Running            0               31m     10.244.167.17    aws-compute-1       <none>           <none>
longhorn-admission-webhook-fcdb78d7c-fbpwc     0/1     Init:0/1           0               9m38s   10.244.70.144    oci-compute-1       <none>           <none>
longhorn-admission-webhook-fcdb78d7c-ss99j     0/1     Init:0/1           0               9m26s   10.244.167.18    aws-compute-1       <none>           <none>
longhorn-conversion-webhook-76c9cc7b6d-8w8gk   1/1     Running            0               35m     10.244.132.203   hetzner-compute-1   <none>           <none>
longhorn-conversion-webhook-76c9cc7b6d-rtnj6   1/1     Running            0               35m     10.244.132.213   hetzner-compute-1   <none>           <none>
longhorn-csi-plugin-bn27b                      2/2     Running            0               62m     10.244.70.138    oci-compute-1       <none>           <none>
longhorn-csi-plugin-k8cwh                      2/2     Running            0               62m     10.244.132.201   hetzner-compute-1   <none>           <none>
longhorn-csi-plugin-wqhwm                      2/2     Running            0               62m     10.244.167.14    aws-compute-1       <none>           <none>
longhorn-driver-deployer-74c5d667d7-vb8tv      0/1     Init:0/1           0               13m     10.244.70.143    oci-compute-1       <none>           <none>
longhorn-manager-fhll2                         1/1     Running            0               63m     10.244.132.193   hetzner-compute-1   <none>           <none>
longhorn-manager-kd4ks                         1/1     Running            0               63m     10.244.167.4     aws-compute-1       <none>           <none>
longhorn-manager-qdnp6                         1/1     Running            0               63m     10.244.70.129    oci-compute-1       <none>           <none>
longhorn-ui-85498cd6fb-sr2gf                   0/1     CrashLoopBackOff   10 (3m4s ago)   35m     10.244.132.215   hetzner-compute-1   <none>           <none>

$ kc --kubeconfig c get svc -n longhorn-system -o wide
NAME                          TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE   SELECTOR
csi-attacher                  ClusterIP   10.108.127.170   <none>        12345/TCP   62m   app=csi-attacher
csi-provisioner               ClusterIP   10.105.116.109   <none>        12345/TCP   62m   app=csi-provisioner
csi-resizer                   ClusterIP   10.105.0.112     <none>        12345/TCP   62m   app=csi-resizer
csi-snapshotter               ClusterIP   10.103.151.120   <none>        12345/TCP   62m   app=csi-snapshotter
longhorn-admission-webhook    ClusterIP   10.96.207.204    <none>        9443/TCP    63m   app=longhorn-admission-webhook
longhorn-backend              ClusterIP   10.108.202.237   <none>        9500/TCP    63m   app=longhorn-manager
longhorn-conversion-webhook   ClusterIP   10.107.166.19    <none>        9443/TCP    63m   app=longhorn-conversion-webhook
longhorn-engine-manager       ClusterIP   None             <none>        <none>      63m   longhorn.io/component=instance-manager,longhorn.io/instance-manager-type=engine
longhorn-frontend             ClusterIP   10.107.239.68    <none>        80/TCP      63m   app=longhorn-ui
longhorn-replica-manager      ClusterIP   None             <none>        <none>      63m   longhorn.io/component=instance-manager,longhorn.io/instance-manager-type=replica

Checking the logs of containers in the init state was not helpful. The longhorn-ui was also crash looping, could not find out why (cluster got deleted)

@cloudziu
Copy link
Contributor

cloudziu commented Mar 8, 2023

From what I see the initContaienrs have a simple while loop where they are waiting for other services to get to the Ready state: https://github.com/longhorn/longhorn/blob/92fd5b54edda215b1d7b410d003c57a672a486bf/deploy/longhorn.yaml#LL4257C26-L4257C26

longhorn-admission-webhook is waiting for longhorn-conversion-webhook

I see that in both above cases, the longhorn-conversion-webhook was already in Running state, but I can also see that in @MiroslavRepka both of the conversion-webhooks where located on Hetzner nodes (I truly want to believe that this is not associated in any way)

UPDATE:
Could not reproduce it locally. Also while running latest E2E test this bug did not occur. Next time if you will encounter this bug, please review Kubernetes network components:

  • Service endpoint - kubectl -n longhorn-system get ep -o wide
  • Can Pod can connect to a different one by IP address
  • CoreDNS status, and if pods resolve DNS queries
  • Kubernetes CNI status
  • Kube-proxy status
    You can also HMU when this will occur again :)

@Despire Despire self-assigned this Apr 14, 2023
@Despire
Copy link
Contributor Author

Despire commented Apr 18, 2023

I wasn't able to re-create the bug.

I tried going through the test-sets multiple times locally, and also in our GCP cluster.

I went over the issues in the longhorn repository and stumbled upon a very similar issue to ours longhorn/longhorn#5645 where the issue was with the DNS

@Despire Despire removed their assignment Apr 18, 2023
@MiroslavRepka
Copy link
Contributor

I got this bug in a recent CI, and I tried to restart the core DNS to no avail. The pods were still stuck in the init state and there was no endpoint for the webhook. I even tried to restart calico but nothing helped.

$ kc get ep -n longhorn-system 
NAME                          ENDPOINTS                                                      AGE
csi-attacher                  10.244.132.197:12345,10.244.70.150:12345,10.244.70.154:12345   41m
csi-provisioner               10.244.132.198:12345,10.244.70.151:12345,10.244.70.153:12345   41m
csi-resizer                   10.244.132.201:12345,10.244.70.139:12345,10.244.70.142:12345   41m
csi-snapshotter               10.244.132.199:12345,10.244.70.145:12345,10.244.70.149:12345   41m
longhorn-admission-webhook                                                                   42m
longhorn-backend              10.244.132.193:9500,10.244.167.2:9500,10.244.70.129:9500       42m
longhorn-conversion-webhook   10.244.132.203:9443,10.244.167.22:9443                         42m
longhorn-engine-manager       10.244.132.195,10.244.167.20,10.244.70.157                     42m
longhorn-frontend             10.244.70.144:8000,10.244.70.152:8000                          42m
longhorn-recovery-backend     10.244.70.140:9600,10.244.70.146:9600                          42m
longhorn-replica-manager      10.244.132.194,10.244.167.19,10.244.70.156                     42m

@Despire
Copy link
Contributor Author

Despire commented May 9, 2023

I've also noticed that sometimes when we delete a node from a cluster the deletion gets stuck forvever on deleting nodes.longhorn.io .... https://github.com/berops/claudie/blob/master/services/kuber/server/nodes/delete.go#L172

however, unsure if that's relevant to this isse.

@MarioUhrik MarioUhrik added the groomed Task that everybody agrees to pass the gatekeeper label May 12, 2023
@Despire
Copy link
Contributor Author

Despire commented Aug 23, 2023

resolved by #984

@Despire Despire closed this as completed Aug 23, 2023
@defyjoy
Copy link

defyjoy commented Oct 18, 2023

Why is this closed ? This issue is still persisting . It completely is out of any IPs in longhorn repository . The webhook fails due to this issue . I installed cilium even with version v1.15.0-pre.1 instead of v1.14.2 hoping it's fixed .

image

@Despire
Copy link
Contributor Author

Despire commented Oct 18, 2023

Yes you're right, originally the issue was fixed when we switch to cilium with eBPF but that introduce other problems and we had to fallback to kube-proxy where this issue was re-introduced.

This issue should be re-opened, which I'll do.

@Despire Despire reopened this Oct 18, 2023
@defyjoy
Copy link

defyjoy commented Oct 25, 2023

Is there any resolution to this for cilium ebpf ? I do not see any traction so was wondering what might be the way to fix this .

@bernardhalas
Copy link
Member

Currently we don't have a solution for this issue. We know that using Cilium as a CNI without kube-proxy solves this problem (and introduces others though). Any inputs are welcome.

@JKBGIT1
Copy link
Contributor

JKBGIT1 commented Nov 23, 2023

Hi @defyjoy , could you provide your InputManifest for which you encountered this issue?

@Despire
Copy link
Contributor Author

Despire commented May 20, 2024

This issue has not been seen since #1366
It possible that other bug fixes might have also fixed it.

I'll close this as done for now, If it will resurface again will re-open it.

@Despire Despire closed this as completed May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working groomed Task that everybody agrees to pass the gatekeeper
Projects
None yet
Development

No branches or pull requests

7 participants