New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail to start container, rootfs seems not ready. #1785
Comments
This doesn't always happen, and I did check some image, I believe those images should contain the missing files. With this fixed, our e2e test should be much greener. |
The exec error is coming from runc from exec'ing the process. The other passwd error probably comes from parsing the passwd file to get the uid/gid. Are these two separate errors or for the same container during a start? |
@crosbymichael Yeah, these are different failures in different test builds. |
I've got more failures, such as:
I'm not sure what is happening, will look into our code to make sure it's not a race on ourside. Although it's hard to imagine that this is a race on ourside, because we don't create the rootfs ourselves. |
Are these new errors or existing? I mean, new from updating to the latest beta release? |
I'm not 100% percent sure, because our e2e test flakes since the first day, we are trying to stabilize it, and it become better now. I checked all our old builds, we updated containerd from 564600e to 1.0.0-beta.3 on Nov. 9th.
Note that cri-containerd version is not changed across the failures above. It was just at the commit we update to containerd 1.0.0-beta.3. |
I checked more builds, this really looks like a recent regression, after 564600e. |
Thanks |
The only thing looks related is the lease api change. However, I've no idea how that could cause this failure. Since our build is red since the first day, I can't 100% sure say this is a regression in v1.0.0-beta.3. However, I did check most builds before we switched to containerd 1.0.0-beta.3, and didn't see such failure. But I could easily find such failure after the upgrade. Tell me if there is anything I could help to identify this, e.g. adding specific debug information, etc. |
I did have one fix since then related to remapped snapshots #1732, a regression caused by leases. I would want to look at how the snapshot is being managed to make sure there is not another similar regression caused by it. |
@dmcgowan Do you know how GC could cause this? If a snapshot is GCed before rootfs is mounted, won't any error be returned? |
Maybe be this: |
@dmcgowan Do we have new findings on this? |
There is a new bug. ref #1802 |
I believe #1802 should be able to fix this. |
Still see this with #1802.
|
Maybe there are another problem yet. Today I will review it again. |
Another occurrence
|
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-cri-containerd-e2e-gci-gce/505
The fix seems not helping much. We still need to figure out this. :( |
One weird thing is that it usually only happens to a fixed group of images, not all images. |
@Random-Liu what images? I'm still unable to reproduce and have been running stress tests with 256 exec's per containers and I'm unable to reproduce |
@crosbymichael Not just schema1 image. $ ctr images ls | grep defaultbackend
gcr.io/google_containers/defaultbackend:1.3 application/vnd.docker.distribution.manifest.v2+json sha256:fb91f9395ddf44df1ca3adf68b25dcbc269e5d08ba14a40de9abdedafacf93d4 1.3 MiB linux/amd64 I'll debug it again today.
It's not related to exec I think. Just sometimes, image/snapshot may be corrupted. |
@Random-Liu if you can find a way to reproduce outside of the kube test CI that would help ;) |
@dmcgowan To help you debug the issue. Here is log links for the failures mentioned in #1785 (comment). build log: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-cri-containerd-e2e-gci-gce/527/build-log.txt |
|
Debug PR: containerd/cri#463 cri-containerd.log: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-cri-containerd-e2e-ubuntu-gce/620/artifacts/bootstrap-e2e-minion-group-s15x/cri-containerd.log The rootfs is empty since pulled/unpacked. After pull/unpack, create a snapshot and check it:
After
Check container snapshot and image after start failure:
It's not related to mount namespace. The image rootfs is empty since pull/unpack. And IIRC, once this happens, the image will become completely unusable. All following snapshot created will be constantly empty. @dmcgowan @yanxuean Does the lease here actually work? https://github.com/kubernetes-incubator/cri-containerd/blob/master/vendor/github.com/containerd/containerd/image.go#L91-L95 |
The image have only a layer. |
If the lease don't work, There is a window its parent snapshot maybe be GCed. Then Subsequent snapshot.Mounts will fail. So the lease should has worked. |
The files that are showing up later on are temporary files needed to run the container.
I see nothing in the logs that would indicate that GC is removing the snapshot. The snapshot deletion is happening after the container has already been deleted. I don't know of anything else that would delete the directory. The fact that the rootfs is empty after unpack makes me think that the image was never unpacked correctly in the first place. I am going to look for some races in the differ apply method. |
I think this could cause it if unmount failed, the remove all would get triggered on the data. https://github.com/containerd/containerd/blob/master/diff/walking/differ.go#L97 This should probably just be a remove dir and log the errors. @yanxuean wdyt? |
@dmcgowan Yeah. I agree with you. we can log for unmount fail |
I think we can print the mounts gotten from snapshotter.Mounts in checkSnapshot. |
|
@dmcgowan Ya, i think that is it. I looked at the code for those spec functions and they are using the snapshot key of the container so if unmount does fail, the remove all will delete everything in the snapshot. I thought about this but initially blew it off because I thought we were using a temp snapshot key and it was not the original snapshot for the container. I'm going to work on a patch for unmount that handles EBUSY and retries we well as seeing if we can use a readonly view for these functions. |
This is another WIP to fix containerd#1785. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Did you all add cadvisor support around the same time as the last containerd version bump on the CI? |
Looks likely. Hope this time we could fix it. :)
In current integration, Cadvisor doesn't monitor containerd rootfs. We still get rootfs stats from containerd. However, could containerd snapshot |
Since we think this is fixed in the current master, I am going to move to the 1.0.1 milestone for confirmation. |
I think this has been fixed. Doesn't see this failure any more. |
@Random-Liu |
@yanxuean i think this bug can be closed. The specific errors in this issue are addressed. I think remaining issues could be split between containerd and cri-containerd. After watching the tests on cri-containerd, they are still pretty flakey and those need resolved to get accurate results. It feels like there could be some races in the code in the usage. Our CI is very stable and its very rare for us not to have a legitimate failure. Our stress tests are also stable with accurate results so we should look at getting the cri-containerd tests stable and double checking races in that codebase as well. |
@crosbymichael |
Our e2e test is still fairly flaky, many of them are caused by similar runc errors:
It seems that rootfs is not complete. Do you know what could cause this? Have you ever seen this? Do you have any suggestion on how to debug this?
/cc @stevvooe @crosbymichael @dmcgowan
The text was updated successfully, but these errors were encountered: