-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional "exec format error" #2352
Comments
Hi, thanks for opening this issue. Is Bottlerocket 1.9.0 the first version you've used? I'm wondering if you've hit this problem in previous Bottlerocket releases. This could help us narrow down the issue. Otherwise, I'm wondering if you could check |
I'm going to try to reproduce this issue. How did you install Argo, Prometheus, etc? via Helm charts, or some other method? |
I didn't see that in 1.8, at least for the month before 1.9 was released. Occasionally used older releases for POC just to present this flavour to wider group of techinical people. No userdata provided at all, nor API - just vanilla AMI from AWS. For dmesg, I'll need to reprovision nodes with a SSH key first then figure out how to run admin container in it (never had a need to run it before). I'll do that next week and come back with more details next time I see this error. |
Do you think https://kubearmor.io might be the cause as well? |
I see. Would it be possible for you to try some things out in order to help me narrow down the issue?:
Could you let me know what you find? |
FYI I've brought my cluster to the point where I could observe that error. Waiting for some failures, and I'll report back ASAP. |
So I have this happened again:
Container IDs:
But:
Assume since this container it not running, I can not have access to its filesystem with the patch pattern you have mentioned @mchaker Thought it might be due to disk being full, but looks good:
Node events:
Pods:
Also bottlerocket-update-operator:
Guessing these reboots are related to operator updating to the newest release. |
Think I've got something. This is how it looks thanos:
For prometheus:
So indeed entrypoint is corrupted. |
@mchaker any luck reproducing this issue? Or maybe an idea about corrupted entrypoints? Maybe it is worth to ping friends fron containerd project? |
Think the issue might be too low disk attached to bottlerocket, since I've extended it this never happened 🤦 |
If the container layer fails to unpack because the filesystem is out of space, I suppose that could leave zero-sized files around. I'm surprised that this wouldn't bubble up as an error from containerd, relayed to kubelet via CRI. But perhaps the partially unpacked layer might not get cleaned up and would then be reused by a later attempt to run the same pod. In any case this seems like a good direction to investigate - if it's easier to repro with a nearly full disk, that will help with finding the root cause. |
This happened to me today again for a container with |
Not a lot to go on here. If/when this is hit again, can we capture some of the |
I don't have any new information to share yet - but I will say that we've started seeing this over the last few weeks intermittently as well. We saw it mostly on Bottlerocket 1.13.1.. and we have upgraded to 1.14.1. When we see the issue next, we'll do some deep diving on it. Are there particular logs/things we should look into when it happens? |
We've just hit this, with some redis-sentinel containers. Have confirmed that the
|
Image I'm using:
AMI:
bottlerocket-aws-k8s-1.22-x86_64-v1.9.0-159e4ced
EKS:
Occasionally containers are refusing to start, for example:
Happens randomly to various containers, like prometheus etc. No, I'm not trying to run amd64 compiled ones.
How to reproduce the problem:
Honestly can not figure this out, happens randomly. Thought it might be related to kubearmor or cilium (IPAM mode) but its not.
NOTE Left clusters running on AL2 image for few days, never happened.
Any help appreciated.
The text was updated successfully, but these errors were encountered: