-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Longhorn pod longhorn-admission-webhook
stuck in Init state
#598
Comments
Encountered this bug in
Checking the logs of containers in the init state was not helpful. The |
From what I see the
I see that in both above cases, the UPDATE:
|
I wasn't able to re-create the bug. I tried going through the test-sets multiple times locally, and also in our GCP cluster. I went over the issues in the longhorn repository and stumbled upon a very similar issue to ours longhorn/longhorn#5645 where the issue was with the DNS |
I got this bug in a recent CI, and I tried to restart the core DNS to no avail. The pods were still stuck in the init state and there was no endpoint for the webhook. I even tried to restart calico but nothing helped. $ kc get ep -n longhorn-system
NAME ENDPOINTS AGE
csi-attacher 10.244.132.197:12345,10.244.70.150:12345,10.244.70.154:12345 41m
csi-provisioner 10.244.132.198:12345,10.244.70.151:12345,10.244.70.153:12345 41m
csi-resizer 10.244.132.201:12345,10.244.70.139:12345,10.244.70.142:12345 41m
csi-snapshotter 10.244.132.199:12345,10.244.70.145:12345,10.244.70.149:12345 41m
longhorn-admission-webhook 42m
longhorn-backend 10.244.132.193:9500,10.244.167.2:9500,10.244.70.129:9500 42m
longhorn-conversion-webhook 10.244.132.203:9443,10.244.167.22:9443 42m
longhorn-engine-manager 10.244.132.195,10.244.167.20,10.244.70.157 42m
longhorn-frontend 10.244.70.144:8000,10.244.70.152:8000 42m
longhorn-recovery-backend 10.244.70.140:9600,10.244.70.146:9600 42m
longhorn-replica-manager 10.244.132.194,10.244.167.19,10.244.70.156 42m
|
I've also noticed that sometimes when we delete a node from a cluster the deletion gets stuck forvever on however, unsure if that's relevant to this isse. |
resolved by #984 |
Yes you're right, originally the issue was fixed when we switch to cilium with eBPF but that introduce other problems and we had to fallback to kube-proxy where this issue was re-introduced. This issue should be re-opened, which I'll do. |
Is there any resolution to this for cilium ebpf ? I do not see any traction so was wondering what might be the way to fix this . |
Currently we don't have a solution for this issue. We know that using Cilium as a CNI without kube-proxy solves this problem (and introduces others though). Any inputs are welcome. |
Hi @defyjoy , could you provide your InputManifest for which you encountered this issue? |
This issue has not been seen since #1366 I'll close this as done for now, If it will resurface again will re-open it. |
The
longhorn-admission-webhook
pods get stuck in theInit
state forever.This issue arises occasionally in the CI and usually a retry of the failed run will resolve the issue.
However, while working on #575 I was changing around the test-sets and I've encountered this issues most of the time in the CI with a specific config. Instead of keeping
ts3-c-1
intest-set3/2.yaml
keepts3-c-2
intest-set3/2.yml
.Steps To Reproduce
test-set3/2.yaml
so that thets3-c-2
cluster will be left instead ofts3-c-1
i.e
longhorn-admission-webhook
will be stuck in the Init stateAt the time of testing the issue was 100% persistent with the given config.
The text was updated successfully, but these errors were encountered: