-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'failed to reserve container name' #4604
Comments
It looks like there is a container with name web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0 and id Could you show |
Did you manage to resolve this issue @sadortun? We are experiencing the same. Also on GKE with containerd runtime.
We are deploying the same image to multiple deployments (30 - 40 pods) at the same time. No such issues with docker runtime. Eventually, kubelet is able to resolve this issue without manual intervention, however, it is significantly slowing the deployment of new images during the release (extra 2-3 minutes to resolve name conflicts). |
Hi @pfuhrmann We did investigate this quite deeply with GKE Dev team and we were not able to reproduce it. That said, We are pretty convince the issue comes from one of the two following issue:
Unfortunately after a month of back and forth with GKE devs, we were not able to find the solution. The good new is, for us, we refactored our application and were able to reduce the number of starting pods from about 20, down to 5. Since then, we had no issues. You might also want to increase node boot drive size. It seems to help too. |
any update on this? did anybody manage to solve this? we are facing the same issue |
We are also seeing the same issue, GKE with containerd. It does seem to be correlated with starting many pods at once. Switching from cos_containerd back to cos (docker based) seems to have resolved the situation, at least in the short term. |
Same for us once we switched back to cos with docker everything worked |
At the end we still had occasional issues and We also had to switch back to |
jotting down some notes here, apologies if it's lengthy: Let me try to explain/figure out the reason you got "failed to reserve container name" .. Kubelet tried to create a container that it had already asked containerd to create at least once.. when containerd tried the first time it received a variable in the container create meta data named What should have happened is kubelet should've incremented the So ok.. skimming over the kubelet code.. I believe this is the code that decides what attempt number we are on? https://github.com/kubernetes/kubernetes/blame/master/pkg/kubelet/kuberuntime/kuberuntime_container.go#L173-L292 In my skim.. I think I see a window where kubelet will try attempt 0 a second time after the first create attempt fails with a context timeout. But I may be reading the code wrong? @dims @feiskyer @Random-Liu |
Bumped into this issue as well. Switching back to cos with docker. |
Fwiw, we're hitting this this week. k8s 1.20.8-gke.900; containerd://1.4.3
In my case, the pod is owned by a (batch/v1)job, and the job by a (batch/v1beta1)cronjob. The reserved for item only appears in the error, nothing else seems to know about itUsing Google cloud logging, I can search:
w/ a search range of 2021-08-22 01:58:00.000 AM EDT..2021-08-22 02:03:00.000 AM EDT This is the first hit:
And this is the second hit:
There are additional hits, but they aren't exciting. For reference, this search (with the same time params) yields nothing:
This search
yields two entries:
(The are additional hits if i extend the time window forward, but as they appear to be identical, other than the timestamp, I don't see any value in repeating them.) Relevant log eventsThe best query I've found is:
(The former is to limit which part of GCloud to search, and the latter is the search.)
|
same, switching back to docker |
Same issue here, with gke version 1.20.9-gke.701 . |
@dims @feiskyer @Random-Liu @ehashman FYI.. kubelet is trying to create the same container twice using with the same start count "0", and also does not respond to the "failed to reserve container name" error message. Thoughts? |
Having this issue as well with GKE. Appreciate the detailed research folks have posted here. |
Is there a reproducer? Can you please file an issue against k/k? |
I confirm that moving back to docker solves the problem:
|
This also happens with UBUNTU_CONTAINERD, not just COS_CONTAINERD |
@mikebrow I investigated a reported issue in k/k before kubernetes/kubernetes#94085 My summary is that kubelet has correct logic for incrementing the restart number which is set to "current_restart + 1". See this kubelet code.
|
Hello @qiutongs, my k8s cluster used to run Spark jobs, and recently we got alot of |
does anyone has any other solution to the issue except |
Unfortunately, the only working solution is to move back to cos with docker. |
Have a look at how many containers are starting at the same time, and how much memory each of them take at startup. Make sure you have plenty left. In our case, even if we were short on RAM, having more free RAM seems to help, but we still had to revert to COS. |
Same problem here
|
Just had this happen to me on GKE. |
@danielfoehrKn If the root cause is because of slow disk, I wonder why docker is not suffer from the same issue. |
At least for us, the disk was most likely the reason why the initial request to the CRI timed out. We have not tested it on docker (cannot switch either). However, if dockerd/docker-shim does not have the same problem due to keeping some false "state", then subsequent requests by the kubelet to create the PodSandbox could work given the reason for the initial |
Per CRI, containerd is receiving and processing requests in parallel. Kubelet has a couple bugs/issues in that it's making to many parallel requests even under disk pressure, then when a timeout occurs on it's side, it fails to recognize the problem was of it's own creation, presumes asking the CRI to do the same thing again will make it work this time even though it's still doing what was previously requested, and containerd reports the error. Dockershim path with kubelet down through the docker api then through containerd (vs directly to containerd) has more serialization and different code in the now deleted docker-shim code (even though containerd was still in the path), thus producing different behavior. Sometimes more serialization is faster (such as when under to much resource pressure resulting in thrashing of resources, for example memory swapping to disk while creating snapshots and new containers loading up and garbage collecting memory used for older requests... etc.). Kubelet should be modified to recognize the timeout situation and avoid subsequent duplicate requests... or we can modify the CRI api to serialize service requests (I do not recommend we go this way) or we can change the API from parallel requests with client side timeouts to subscription requests with a first ack response and subsequent status change event responses. We could also have a change to the CRI api to request (by policy) serialization (through queuing) of requests when under pressure... or serialization by "failing" requests when currently processing parallel requests when under pressure (also not recommended). If we want us to "manage" requests at the CRI level we can do so but we're going to want to talk about node management policies. |
I have also seen something similiar with the Windows sandbox creation in the Windows e2e tests recently. I looked for an issue to track possible changes to kubelet and couldn't find one so created: kubernetes/kubernetes#107561 |
@mikebrow Do you have more details to share? More specifically, in terms of the rate control container runtime requests, how dockershim and containerd CRI plugin behave differently? |
Happened to me, it's really a serious bug when you are running your gitlab ci/cd runners in containerd based k8s because some pipelines are designed to run multiple containers in parallel and this bug happens very often. Is going back to docker really the only option here? |
yes, now it looks like it. |
Hi, @matti and @kubino148 and @sadortun and all subscribers, could you mind to provide the goroutine stack of containerd when you see the error? Thanks.
|
Amended Theory 1(See the original theory 1 in #4604 (comment)) Docker has a similar mechanism of "reserving container name" to prevent conflict. However, dockershim handles it in a different way from containerd CRI implementation.
https://github.com/kubernetes/kubernetes/blob/release-1.19/pkg/kubelet/dockershim/helpers.go#L284 In fact, this difference of retry leads to significantly different CRI rates between dockershim and containerd. In containerd, the This theory echos with a similar bug solved in CRI-O - https://bugzilla.redhat.com/show_bug.cgi?id=1785399., in which the solution says "Now, when systems are under load, CRI-O does everything it can to slow down the Kubelet and reduce load on the system." I believe our direction is also to slow down Kubelet sending too many requests. This might be aligned with Mike's comment- #4604 (comment) |
On GCP, only for a little while longer, though. Just got an email:
|
If I'm reading https://cloud.google.com/kubernetes-engine/docs/release-schedule and https://cloud.google.com/kubernetes-engine/versioning#lifecycle correctly, 1.23 will be supported through approximately 2023-06, which is a bit in the future. (This would require you to have created a static cluster as opposed to a regular upgrading cluster.) Personally, I'd rather figure out what's wrong here and get it fixed (but I'm currently snowed in under a bunch of other tasks, sorry). |
from GKE email:
How can we get someone from GKE on this thread? |
Hi Matti, I am from GKE. We are fully aware of this issue and are prioritizing it. |
@sadortun I was trying to reproduce it with high disk IO. But sorry, I didn't reproduce it. :( If the CreateContainer doesn't return, it maybe hang in two syscalls:
Docker uses lazy-umount which might hide the problem.
Thanks |
Thanks for your time on this issue. Unfortunately, I did stop using COS back in 2020 after we could not find a solution. I'm 97% sure we were using Sorry about that. |
In linux kernel, the umount writable-mountpoint will try to do sync-fs to make sure that the dirty pages to the underlying filesystems. The many number of umount actions in the same time maybe introduce performance issue in IOPS limited disk. When CRI-plugin creates container, it will temp-mount rootfs to read that UID/GID info for entrypoint. Basically, the rootfs is writable snapshotter and then after read, umount will invoke sync-fs action. For example, using overlayfs on ext4 and use bcc-tools to monitor ext4_sync_fs call. ``` // uname -a Linux chaofan 5.13.0-27-generic containerd#29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux // open terminal 1 kubectl run --image=nginx --image-pull-policy=IfNotPresent nginx-pod // open terminal 2 /usr/share/bcc/tools/stackcount ext4_sync_fs -i 1 -v -P ext4_sync_fs sync_filesystem ovl_sync_fs __sync_filesystem sync_filesystem generic_shutdown_super kill_anon_super deactivate_locked_super deactivate_super cleanup_mnt __cleanup_mnt task_work_run exit_to_user_mode_prepare syscall_exit_to_user_mode do_syscall_64 entry_SYSCALL_64_after_hwframe syscall.Syscall.abi0 github.com/containerd/containerd/mount.unmount github.com/containerd/containerd/mount.UnmountAll github.com/containerd/containerd/mount.WithTempMount.func2 github.com/containerd/containerd/mount.WithTempMount github.com/containerd/containerd/oci.WithUserID.func1 github.com/containerd/containerd/oci.WithUser.func1 github.com/containerd/containerd/oci.ApplyOpts github.com/containerd/containerd.WithSpec.func1 github.com/containerd/containerd.(*Client).NewContainer github.com/containerd/containerd/pkg/cri/server.(*criService).CreateContainer github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).CreateContainer k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler.func1 github.com/containerd/containerd/services/server.unaryNamespaceInterceptor github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1 k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler google.golang.org/grpc.(*Server).processUnaryRPC google.golang.org/grpc.(*Server).handleStream google.golang.org/grpc.(*Server).serveStreams.func1.2 runtime.goexit.abi0 containerd [34771] 1 ``` If there are comming several create requestes, the /var/lib/containerd's underlying disk mayb bring high IO pressure. After checkout the kernel code[1], the kernel will not call __sync_filesystem if the mount is readonly. Based on this, containerd should use readonly mount to get UID/GID information. Reference: * [1] https://elixir.bootlin.com/linux/v5.13/source/fs/sync.c#L61 Closes: containerd#4604 Signed-off-by: Wei Fu <fuweid89@gmail.com>
In linux kernel, the umount writable-mountpoint will try to do sync-fs to make sure that the dirty pages to the underlying filesystems. The many number of umount actions in the same time maybe introduce performance issue in IOPS limited disk. When CRI-plugin creates container, it will temp-mount rootfs to read that UID/GID info for entrypoint. Basically, the rootfs is writable snapshotter and then after read, umount will invoke sync-fs action. For example, using overlayfs on ext4 and use bcc-tools to monitor ext4_sync_fs call. ``` // uname -a Linux chaofan 5.13.0-27-generic containerd#29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux // open terminal 1 kubectl run --image=nginx --image-pull-policy=IfNotPresent nginx-pod // open terminal 2 /usr/share/bcc/tools/stackcount ext4_sync_fs -i 1 -v -P ext4_sync_fs sync_filesystem ovl_sync_fs __sync_filesystem sync_filesystem generic_shutdown_super kill_anon_super deactivate_locked_super deactivate_super cleanup_mnt __cleanup_mnt task_work_run exit_to_user_mode_prepare syscall_exit_to_user_mode do_syscall_64 entry_SYSCALL_64_after_hwframe syscall.Syscall.abi0 github.com/containerd/containerd/mount.unmount github.com/containerd/containerd/mount.UnmountAll github.com/containerd/containerd/mount.WithTempMount.func2 github.com/containerd/containerd/mount.WithTempMount github.com/containerd/containerd/oci.WithUserID.func1 github.com/containerd/containerd/oci.WithUser.func1 github.com/containerd/containerd/oci.ApplyOpts github.com/containerd/containerd.WithSpec.func1 github.com/containerd/containerd.(*Client).NewContainer github.com/containerd/containerd/pkg/cri/server.(*criService).CreateContainer github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).CreateContainer k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler.func1 github.com/containerd/containerd/services/server.unaryNamespaceInterceptor github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1 k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler google.golang.org/grpc.(*Server).processUnaryRPC google.golang.org/grpc.(*Server).handleStream google.golang.org/grpc.(*Server).serveStreams.func1.2 runtime.goexit.abi0 containerd [34771] 1 ``` If there are comming several create requestes, umount actions might bring high IO pressure on the /var/lib/containerd's underlying disk. After checkout the kernel code[1], the kernel will not call __sync_filesystem if the mount is readonly. Based on this, containerd should use readonly mount to get UID/GID information. Reference: * [1] https://elixir.bootlin.com/linux/v5.13/source/fs/sync.c#L61 Closes: containerd#4604 Signed-off-by: Wei Fu <fuweid89@gmail.com>
In linux kernel, the umount writable-mountpoint will try to do sync-fs to make sure that the dirty pages to the underlying filesystems. The many number of umount actions in the same time maybe introduce performance issue in IOPS limited disk. When CRI-plugin creates container, it will temp-mount rootfs to read that UID/GID info for entrypoint. Basically, the rootfs is writable snapshotter and then after read, umount will invoke sync-fs action. For example, using overlayfs on ext4 and use bcc-tools to monitor ext4_sync_fs call. ``` // uname -a Linux chaofan 5.13.0-27-generic containerd#29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux // open terminal 1 kubectl run --image=nginx --image-pull-policy=IfNotPresent nginx-pod // open terminal 2 /usr/share/bcc/tools/stackcount ext4_sync_fs -i 1 -v -P ext4_sync_fs sync_filesystem ovl_sync_fs __sync_filesystem sync_filesystem generic_shutdown_super kill_anon_super deactivate_locked_super deactivate_super cleanup_mnt __cleanup_mnt task_work_run exit_to_user_mode_prepare syscall_exit_to_user_mode do_syscall_64 entry_SYSCALL_64_after_hwframe syscall.Syscall.abi0 github.com/containerd/containerd/mount.unmount github.com/containerd/containerd/mount.UnmountAll github.com/containerd/containerd/mount.WithTempMount.func2 github.com/containerd/containerd/mount.WithTempMount github.com/containerd/containerd/oci.WithUserID.func1 github.com/containerd/containerd/oci.WithUser.func1 github.com/containerd/containerd/oci.ApplyOpts github.com/containerd/containerd.WithSpec.func1 github.com/containerd/containerd.(*Client).NewContainer github.com/containerd/containerd/pkg/cri/server.(*criService).CreateContainer github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).CreateContainer k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler.func1 github.com/containerd/containerd/services/server.unaryNamespaceInterceptor github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1 k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler google.golang.org/grpc.(*Server).processUnaryRPC google.golang.org/grpc.(*Server).handleStream google.golang.org/grpc.(*Server).serveStreams.func1.2 runtime.goexit.abi0 containerd [34771] 1 ``` If there are comming several create requestes, umount actions might bring high IO pressure on the /var/lib/containerd's underlying disk. After checkout the kernel code[1], the kernel will not call __sync_filesystem if the mount is readonly. Based on this, containerd should use readonly mount to get UID/GID information. Reference: * [1] https://elixir.bootlin.com/linux/v5.13/source/fs/sync.c#L61 Closes: containerd#4604 Signed-off-by: Wei Fu <fuweid89@gmail.com>
@sadortun I file pr to enhance this. #6478 (comment) No sure that what different between docker and containerd in GKE. sorry about that. |
We're on GKE 1.21.6-gke1500 and we've been seeing this problem for the last 1-2 months |
I got some good results showing this patch improves the latency of "CreateContainer".
Please note this is based on a couple of experiments, not ample data set. |
Summary (2022/02)"failed to reserve container name" error is returned by containerd CRI if there is an in-flight Don't panic. Given sufficient time, the container and pod will be created successfully, as long as you are using Root Cause and FixSlow disk operations((e.g. disk throttle on GKE) are the culprit. What generates lots of disk IO can come from a number of factors: user's disk-heavy workload, big images pulling and containerd CRI implementation. An unnecessary Please note there are perhaps other undiscovered reason contributing to this problem. Mitigation
|
In linux kernel, the umount writable-mountpoint will try to do sync-fs to make sure that the dirty pages to the underlying filesystems. The many number of umount actions in the same time maybe introduce performance issue in IOPS limited disk. When CRI-plugin creates container, it will temp-mount rootfs to read that UID/GID info for entrypoint. Basically, the rootfs is writable snapshotter and then after read, umount will invoke sync-fs action. For example, using overlayfs on ext4 and use bcc-tools to monitor ext4_sync_fs call. ``` // uname -a Linux chaofan 5.13.0-27-generic containerd#29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux // open terminal 1 kubectl run --image=nginx --image-pull-policy=IfNotPresent nginx-pod // open terminal 2 /usr/share/bcc/tools/stackcount ext4_sync_fs -i 1 -v -P ext4_sync_fs sync_filesystem ovl_sync_fs __sync_filesystem sync_filesystem generic_shutdown_super kill_anon_super deactivate_locked_super deactivate_super cleanup_mnt __cleanup_mnt task_work_run exit_to_user_mode_prepare syscall_exit_to_user_mode do_syscall_64 entry_SYSCALL_64_after_hwframe syscall.Syscall.abi0 github.com/containerd/containerd/mount.unmount github.com/containerd/containerd/mount.UnmountAll github.com/containerd/containerd/mount.WithTempMount.func2 github.com/containerd/containerd/mount.WithTempMount github.com/containerd/containerd/oci.WithUserID.func1 github.com/containerd/containerd/oci.WithUser.func1 github.com/containerd/containerd/oci.ApplyOpts github.com/containerd/containerd.WithSpec.func1 github.com/containerd/containerd.(*Client).NewContainer github.com/containerd/containerd/pkg/cri/server.(*criService).CreateContainer github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).CreateContainer k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler.func1 github.com/containerd/containerd/services/server.unaryNamespaceInterceptor github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1 k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler google.golang.org/grpc.(*Server).processUnaryRPC google.golang.org/grpc.(*Server).handleStream google.golang.org/grpc.(*Server).serveStreams.func1.2 runtime.goexit.abi0 containerd [34771] 1 ``` If there are comming several create requestes, umount actions might bring high IO pressure on the /var/lib/containerd's underlying disk. After checkout the kernel code[1], the kernel will not call __sync_filesystem if the mount is readonly. Based on this, containerd should use readonly mount to get UID/GID information. Reference: * [1] https://elixir.bootlin.com/linux/v5.13/source/fs/sync.c#L61 Closes: containerd#4604 Signed-off-by: Wei Fu <fuweid89@gmail.com> (cherry picked from commit 813a061) Signed-off-by: Qiutong Song <qiutongs@google.com>
In linux kernel, the umount writable-mountpoint will try to do sync-fs to make sure that the dirty pages to the underlying filesystems. The many number of umount actions in the same time maybe introduce performance issue in IOPS limited disk. When CRI-plugin creates container, it will temp-mount rootfs to read that UID/GID info for entrypoint. Basically, the rootfs is writable snapshotter and then after read, umount will invoke sync-fs action. For example, using overlayfs on ext4 and use bcc-tools to monitor ext4_sync_fs call. ``` // uname -a Linux chaofan 5.13.0-27-generic mxpv#29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux // open terminal 1 kubectl run --image=nginx --image-pull-policy=IfNotPresent nginx-pod // open terminal 2 /usr/share/bcc/tools/stackcount ext4_sync_fs -i 1 -v -P ext4_sync_fs sync_filesystem ovl_sync_fs __sync_filesystem sync_filesystem generic_shutdown_super kill_anon_super deactivate_locked_super deactivate_super cleanup_mnt __cleanup_mnt task_work_run exit_to_user_mode_prepare syscall_exit_to_user_mode do_syscall_64 entry_SYSCALL_64_after_hwframe syscall.Syscall.abi0 github.com/containerd/containerd/mount.unmount github.com/containerd/containerd/mount.UnmountAll github.com/containerd/containerd/mount.WithTempMount.func2 github.com/containerd/containerd/mount.WithTempMount github.com/containerd/containerd/oci.WithUserID.func1 github.com/containerd/containerd/oci.WithUser.func1 github.com/containerd/containerd/oci.ApplyOpts github.com/containerd/containerd.WithSpec.func1 github.com/containerd/containerd.(*Client).NewContainer github.com/containerd/containerd/pkg/cri/server.(*criService).CreateContainer github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).CreateContainer k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler.func1 github.com/containerd/containerd/services/server.unaryNamespaceInterceptor github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1 k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler google.golang.org/grpc.(*Server).processUnaryRPC google.golang.org/grpc.(*Server).handleStream google.golang.org/grpc.(*Server).serveStreams.func1.2 runtime.goexit.abi0 containerd [34771] 1 ``` If there are comming several create requestes, umount actions might bring high IO pressure on the /var/lib/containerd's underlying disk. After checkout the kernel code[1], the kernel will not call __sync_filesystem if the mount is readonly. Based on this, containerd should use readonly mount to get UID/GID information. Reference: * [1] https://elixir.bootlin.com/linux/v5.13/source/fs/sync.c#L61 Closes: containerd#4604 Signed-off-by: Wei Fu <fuweid89@gmail.com>
In linux kernel, the umount writable-mountpoint will try to do sync-fs to make sure that the dirty pages to the underlying filesystems. The many number of umount actions in the same time maybe introduce performance issue in IOPS limited disk. When CRI-plugin creates container, it will temp-mount rootfs to read that UID/GID info for entrypoint. Basically, the rootfs is writable snapshotter and then after read, umount will invoke sync-fs action. For example, using overlayfs on ext4 and use bcc-tools to monitor ext4_sync_fs call. ``` // uname -a Linux chaofan 5.13.0-27-generic containerd#29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux // open terminal 1 kubectl run --image=nginx --image-pull-policy=IfNotPresent nginx-pod // open terminal 2 /usr/share/bcc/tools/stackcount ext4_sync_fs -i 1 -v -P ext4_sync_fs sync_filesystem ovl_sync_fs __sync_filesystem sync_filesystem generic_shutdown_super kill_anon_super deactivate_locked_super deactivate_super cleanup_mnt __cleanup_mnt task_work_run exit_to_user_mode_prepare syscall_exit_to_user_mode do_syscall_64 entry_SYSCALL_64_after_hwframe syscall.Syscall.abi0 github.com/containerd/containerd/mount.unmount github.com/containerd/containerd/mount.UnmountAll github.com/containerd/containerd/mount.WithTempMount.func2 github.com/containerd/containerd/mount.WithTempMount github.com/containerd/containerd/oci.WithUserID.func1 github.com/containerd/containerd/oci.WithUser.func1 github.com/containerd/containerd/oci.ApplyOpts github.com/containerd/containerd.WithSpec.func1 github.com/containerd/containerd.(*Client).NewContainer github.com/containerd/containerd/pkg/cri/server.(*criService).CreateContainer github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).CreateContainer k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler.func1 github.com/containerd/containerd/services/server.unaryNamespaceInterceptor github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1 k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler google.golang.org/grpc.(*Server).processUnaryRPC google.golang.org/grpc.(*Server).handleStream google.golang.org/grpc.(*Server).serveStreams.func1.2 runtime.goexit.abi0 containerd [34771] 1 ``` If there are comming several create requestes, umount actions might bring high IO pressure on the /var/lib/containerd's underlying disk. After checkout the kernel code[1], the kernel will not call __sync_filesystem if the mount is readonly. Based on this, containerd should use readonly mount to get UID/GID information. Reference: * [1] https://elixir.bootlin.com/linux/v5.13/source/fs/sync.c#L61 Closes: containerd#4604 Signed-off-by: Wei Fu <fuweid89@gmail.com>
Description
Hi!
We are running containerd on GKE with pretty much all defaults. A dozen nodes, and a few hundreds pods. Plenty of memory and disk free.
We started to have many pods fail due to
failed to reserve container name
error in the last week or so. I do not recall any specific changes to the cluster, or containers themselves.Any help will be greatly appreciated!
Steps to reproduce the issue:
I have no clue how to specifically reproduce this issue.
Cluster have nothing special, deployment is straightforward. The only thing that could be relevant is that our images are quite large, around 3Gb.
I got a few more details here : https://serverfault.com/questions/1036683/gke-context-deadline-exceeded-createcontainererror-and-failed-to-reserve-contai
Describe the results you received:
Describe the results you expected:
Live an happy life, error free :)
Output of
containerd --version
:Any other relevant information:
The text was updated successfully, but these errors were encountered: