Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'failed to reserve container name' #4604

Closed
sadortun opened this issue Oct 7, 2020 · 50 comments · Fixed by #6478
Closed

'failed to reserve container name' #4604

sadortun opened this issue Oct 7, 2020 · 50 comments · Fixed by #6478

Comments

@sadortun
Copy link

sadortun commented Oct 7, 2020

Description

Hi!

We are running containerd on GKE with pretty much all defaults. A dozen nodes, and a few hundreds pods. Plenty of memory and disk free.

We started to have many pods fail due to failed to reserve container name error in the last week or so. I do not recall any specific changes to the cluster, or containers themselves.

Any help will be greatly appreciated!

Steps to reproduce the issue:
I have no clue how to specifically reproduce this issue.

Cluster have nothing special, deployment is straightforward. The only thing that could be relevant is that our images are quite large, around 3Gb.

I got a few more details here : https://serverfault.com/questions/1036683/gke-context-deadline-exceeded-createcontainererror-and-failed-to-reserve-contai

Describe the results you received:

2020-10-07T08:01:45Z Successfully assigned default/apps-abcd-6b6cb5876b-nn9md to gke-bap-mtl-1-preemptible-e2-s4-e6a8ddb4-ng3v I 
2020-10-07T08:01:50Z Pulling image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:16:45Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:18:45Z Error: context deadline exceeded W 
2020-10-07T08:18:45Z Container image "redis:4.0-alpine" already present on machine I 
2020-10-07T08:18:53Z Created container redis I 
2020-10-07T08:18:53Z Started container redis I 
2020-10-07T08:18:53Z Pulling image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:19:02Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:19:02Z Error: failed to reserve container name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0": name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0" is reserved for "8b21a9870e3ecc09bbb92da2036bd3c9b35f5829873d80cfbd14dc1e1827923f" W 
2020-10-07T08:19:03Z Pulling image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:19:20Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:19:20Z Error: failed to reserve container name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0": name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0" is reserved for "8b21a9870e3ecc09bbb92da2036bd3c9b35f5829873d80cfbd14dc1e1827923f" W 
2020-10-07T08:19:21Z Pulling image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:19:34Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:19:34Z Error: failed to reserve container name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0": name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0" is reserved for "8b21a9870e3ecc09bbb92da2036bd3c9b35f5829873d80cfbd14dc1e1827923f" W 
2020-10-07T08:19:35Z Pulling image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:19:44Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:19:44Z Error: failed to reserve container name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0": name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0" is reserved for "8b21a9870e3ecc09bbb92da2036bd3c9b35f5829873d80cfbd14dc1e1827923f" W 
2020-10-07T08:19:54Z Pulling image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:20:08Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:20:08Z Error: failed to reserve container name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0": name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0" is reserved for "8b21a9870e3ecc09bbb92da2036bd3c9b35f5829873d80cfbd14dc1e1827923f" W 
2020-10-07T08:20:18Z Pulling image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:20:30Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:20:30Z Error: failed to reserve container name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0": name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0" is reserved for "8b21a9870e3ecc09bbb92da2036bd3c9b35f5829873d80cfbd14dc1e1827923f" W 
2020-10-07T08:21:19Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:26:35Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:31:36Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:36:26Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:41:18Z Pulling image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
I 2020-10-07T08:46:41Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" 

Describe the results you expected:
Live an happy life, error free :)

Output of containerd --version:

containerd github.com/containerd/containerd 1.3.2 ff48f57fc83a8c44cf4ad5d672424a98ba37ded6

Any other relevant information:

@windniw
Copy link

windniw commented Nov 20, 2020

It looks like there is a container with name web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0 and id 8b21a9870e3ecc09bbb92da2036bd3c9b35f5829873d80cfbd14dc1e1827923f in containerd. While kubelet want to create to a new one with name web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0, cri plugin failed on it.

Could you show docker ps -a or ctr c list

@pfuhrmann
Copy link

pfuhrmann commented Dec 27, 2020

Did you manage to resolve this issue @sadortun? We are experiencing the same. Also on GKE with containerd runtime.

  Normal   Scheduled  2m21s               default-scheduler  Successfully assigned ***-77c979f8bf-px4v9 to gke-***-389a7e33-t1hl
  Warning  Failed     20s                 kubelet            Error: context deadline exceeded
  Normal   Pulled     7s (x3 over 2m20s)  kubelet            Container image "***" already present on machine
  Warning  Failed     7s (x2 over 19s)    kubelet            Error: failed to reserve container name ***-77c979f8bf-px4v9_***": name "***-77c979f8bf-px4v9_***" is reserved for "818fcfef09165d91ac8c86ed88714bb159a8358c3eca473ec07611a51d72b140"

We are deploying the same image to multiple deployments (30 - 40 pods) at the same time. No such issues with docker runtime.

Eventually, kubelet is able to resolve this issue without manual intervention, however, it is significantly slowing the deployment of new images during the release (extra 2-3 minutes to resolve name conflicts).

@sadortun
Copy link
Author

Hi @pfuhrmann

We did investigate this quite deeply with GKE Dev team and we were not able to reproduce it.

That said, We are pretty convince the issue comes from one of the two following issue:

  • Disk IO is to high and containerd timeout because of that
  • Starting 10-20+ pods at the same time on a single node cause a memory usage spike and at some point containerd process get killed.

Unfortunately after a month of back and forth with GKE devs, we were not able to find the solution.

The good new is, for us, we refactored our application and were able to reduce the number of starting pods from about 20, down to 5. Since then, we had no issues.

You might also want to increase node boot drive size. It seems to help too.

@kmarji
Copy link

kmarji commented Apr 25, 2021

any update on this? did anybody manage to solve this? we are facing the same issue

@chrisroat
Copy link

We are also seeing the same issue, GKE with containerd. It does seem to be correlated with starting many pods at once.

Switching from cos_containerd back to cos (docker based) seems to have resolved the situation, at least in the short term.

@kmarji
Copy link

kmarji commented May 22, 2021

We are also seeing the same issue, GKE with containerd. It does seem to be correlated with starting many pods at once.

Switching from cos_containerd back to cos (docker based) seems to have resolved the situation, at least in the short term.

Same for us once we switched back to cos with docker everything worked

@sadortun
Copy link
Author

sadortun commented May 22, 2021

Same for us once we switched back to cos with docker everything worked.

At the end we still had occasional issues and We also had to switch back to cos

@mikebrow
Copy link
Member

jotting down some notes here, apologies if it's lengthy:

Let me try to explain/figure out the reason you got "failed to reserve container name" ..

Kubelet tried to create a container that it had already asked containerd to create at least once.. when containerd tried the first time it received a variable in the container create meta data named attempt and that variable held the default value 0 .. then containerd reserved the unique name for attempt 0 that you see in your log (see _0 at end of name) "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0"... something happened causing a context timeout between kubelet and containerd .. the kubelet context timeout value is configurable.. "--runtime-request-timeout duration Default: 2m0s" a 2min timeout could happen for any number of reasons.. an unusually long garbage collection a file system hiccup, locked files, deadlocks while waiting, some very expensive init operation occurring in the node for one of your other containers.. who knows? That's why we have/need recovery procedures.

What should have happened is kubelet should've incremented the attempt number (or at least that's how I see it from this side (the containerd side) of the CRI api, but kubelet did not increment the attempt number and further containerd was still trying to create the container from the first request.. or the create on the containerd side may even be finished at this point, it is possible the timeout only happened on the kubelet side and containerd continued finishing the create, possibly even attempting to return the success result. If containerd actually failed it would have deleted the reservation for that container id as the immediate thing after we reserve the id in containerd is to defer it's removal on any error in the create.. https://github.com/containerd/containerd/blob/master/pkg/cri/server/container_create.go#L65-L84

So ok.. skimming over the kubelet code.. I believe this is the code that decides what attempt number we are on? https://github.com/kubernetes/kubernetes/blame/master/pkg/kubelet/kuberuntime/kuberuntime_container.go#L173-L292

In my skim.. I think I see a window where kubelet will try attempt 0 a second time after the first create attempt fails with a context timeout. But I may be reading the code wrong? @dims @feiskyer @Random-Liu

@CyberHippo
Copy link

Bumped into this issue as well. Switching back to cos with docker.

@jsoref
Copy link

jsoref commented Aug 26, 2021

Fwiw, we're hitting this this week.

k8s 1.20.8-gke.900; containerd://1.4.3

Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.8-gke.900", GitCommit:"28ab8501be88ea42e897ca8514d7cd0b436253d9", GitTreeState:"clean", BuildDate:"2021-06-30T09:23:36Z", GoVersion:"go1.15.13b5", Compiler:"gc", Platform:"linux/amd64"}

kubectl get nodes -o json | jq '.items[].status.nodeInfo.containerRuntimeVersion' |uniq
"containerd://1.4.3"

In my case, the pod is owned by a (batch/v1)job, and the job by a (batch/v1beta1)cronjob.

The reserved for item only appears in the error, nothing else seems to know about it

Using Google cloud logging, I can search:

"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0" OR "188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32"

w/ a search range of 2021-08-22 01:58:00.000 AM EDT..2021-08-22 02:03:00.000 AM EDT

This is the first hit:

⚠️ Warning 2021-08-22 02:02:44.000 EDT
backup-test-db-1629612000-cz8ks
"Error: failed to reserve container name "backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0": name "backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0" is reserved for "188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32""

And this is the second hit:

🌟 Default
2021-08-22 02:02:45.217 EDT
gke-default-cluster-default-pool-c90133be-6xkd
E0822 06:02:44.792364 1653 remote_runtime.go:227] CreateContainer in sandbox "c9f8cf0e4fc280b632bf8f4365dccf34f213c5aa4636a4424aab68940d579128" from runtime service failed: rpc error: code = Unknown desc = failed to reserve container name "backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0": name "backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0" is reserved for "188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32"

There are additional hits, but they aren't exciting.

For reference, this search (with the same time params) yields nothing:

("backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0" OR "188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32")
-"Attempt:0"
"Attempt"

This search

("backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0" OR "188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32")
"Attempt:0"

yields two entries:

Default
2021-08-22 02:02:45.219 EDT
gke-default-cluster-default-pool-c90133be-6xkd
time="2021-08-22T06:02:44.792102443Z" level=error msg="CreateContainer within sandbox \"c9f8cf0e4fc280b632bf8f4365dccf34f213c5aa4636a4424aab68940d579128\" for &ContainerMetadata{Name:backup-db,Attempt:0,} failed" error="failed to reserve container name \"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0\": name \"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0\" is reserved for \"188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32\""
Default
2021-08-22 02:02:56.899 EDT
gke-default-cluster-default-pool-c90133be-6xkd
time="2021-08-22T06:02:56.899853062Z" level=error msg="CreateContainer within sandbox \"c9f8cf0e4fc280b632bf8f4365dccf34f213c5aa4636a4424aab68940d579128\" for &ContainerMetadata{Name:backup-db,Attempt:0,} failed" error="failed to reserve container name \"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0\": name \"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0\" is reserved for \"188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32\""

(The are additional hits if i extend the time window forward, but as they appear to be identical, other than the timestamp, I don't see any value in repeating them.)

Relevant log events

The best query I've found is:

resource.labels.cluster_name="default-cluster"
"backup-test-db-1629612000-cz8ks_test"

(The former is to limit which part of GCloud to search, and the latter is the search.)

Default
2021-08-22 02:00:03.259 EDT
gke-default-cluster-default-pool-c90133be-6xkd
I0822 06:00:03.259074 1653 kubelet.go:1916] SyncLoop (ADD, "api"): "backup-test-db-1629612000-cz8ks_test(efe343a0-5641-427c-8a65-1b7dc939432d)"
Default
2021-08-22 02:00:03.569 EDT
gke-default-cluster-default-pool-c90133be-6xkd
I0822 06:00:03.569830 1653 kuberuntime_manager.go:445] No sandbox for pod "backup-test-db-1629612000-cz8ks_test(efe343a0-5641-427c-8a65-1b7dc939432d)" can be found. Need to start a new one
Default
2021-08-22 02:00:43.213 EDT
gke-default-cluster-default-pool-c90133be-6xkd
I0822 06:00:43.213133    1653 kubelet.go:1954] SyncLoop (PLEG): "backup-test-db-1629612000-cz8ks_test(efe343a0-5641-427c-8a65-1b7dc939432d)", event: &pleg.PodLifecycleEvent{ID:"efe343a0-5641-427c-8a65-1b7dc939432d", Type:"ContainerStarted", Data:"c9f8cf0e4fc280b632bf8f4365dccf34f213c5aa4636a4424aab68940d579128"}
2021-08-22 02:02:45.217 EDT
gke-default-cluster-default-pool-c90133be-6xkd
E0822 06:02:44.792364 1653 remote_runtime.go:227] CreateContainer in sandbox "c9f8cf0e4fc280b632bf8f4365dccf34f213c5aa4636a4424aab68940d579128" from runtime service failed: rpc error: code = Unknown desc = failed to reserve container name "backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0": name "backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0" is reserved for "188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32"
Default
2021-08-22 02:02:45.217 EDT
gke-default-cluster-default-pool-c90133be-6xkd
E0822 06:02:42.840645 1653 pod_workers.go:191] Error syncing pod efe343a0-5641-427c-8a65-1b7dc939432d ("backup-test-db-1629612000-cz8ks_test(efe343a0-5641-427c-8a65-1b7dc939432d)"), skipping: failed to "StartContainer" for "backup-db" with CreateContainerError: "context deadline exceeded"
Default
2021-08-22 02:02:45.217 EDT
gke-default-cluster-default-pool-c90133be-6xkd
E0822 06:02:44.792589 1653 pod_workers.go:191] Error syncing pod efe343a0-5641-427c-8a65-1b7dc939432d ("backup-test-db-1629612000-cz8ks_test(efe343a0-5641-427c-8a65-1b7dc939432d)"), skipping: failed to "StartContainer" for "backup-db" with CreateContainerError: "failed to reserve container name \"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0\": name \"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0\" is reserved for \"188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32\""
Default
2021-08-22 02:02:56.900 EDT
gke-default-cluster-default-pool-c90133be-6xkd
E0822 06:02:56.900057 1653 remote_runtime.go:227] CreateContainer in sandbox "c9f8cf0e4fc280b632bf8f4365dccf34f213c5aa4636a4424aab68940d579128" from runtime service failed: rpc error: code = Unknown desc = failed to reserve container name "backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0": name "backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0" is reserved for "188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32"
Default
2021-08-22 02:02:56.900 EDT
gke-default-cluster-default-pool-c90133be-6xkd
E0822 06:02:56.900309 1653 pod_workers.go:191] Error syncing pod efe343a0-5641-427c-8a65-1b7dc939432d ("backup-test-db-1629612000-cz8ks_test(efe343a0-5641-427c-8a65-1b7dc939432d)"), skipping: failed to "StartContainer" for "backup-db" with CreateContainerError: "failed to reserve container name \"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0\": name \"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0\" is reserved for \"188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32\""

@matti
Copy link

matti commented Sep 17, 2021

same, switching back to docker

@bduclaux
Copy link

Same issue here, with gke version 1.20.9-gke.701 .

@mikebrow
Copy link
Member

@dims @feiskyer @Random-Liu @ehashman

FYI.. kubelet is trying to create the same container twice using with the same start count "0", and also does not respond to the "failed to reserve container name" error message. Thoughts?

@donnyv12
Copy link

Having this issue as well with GKE. Appreciate the detailed research folks have posted here.

@ehashman
Copy link

@dims @feiskyer @Random-Liu @ehashman

FYI.. kubelet is trying to create the same container twice using with the same start count "0", and also does not respond to the "failed to reserve container name" error message. Thoughts?

Is there a reproducer? Can you please file an issue against k/k?

@bduclaux
Copy link

I confirm that moving back to docker solves the problem:

gcloud container clusters upgrade mycluster --image-type cos --node-pool mynodepool --zone myzone

@matti
Copy link

matti commented Sep 30, 2021

This also happens with UBUNTU_CONTAINERD, not just COS_CONTAINERD

@qiutongs
Copy link
Contributor

qiutongs commented Oct 27, 2021

@mikebrow I investigated a reported issue in k/k before kubernetes/kubernetes#94085

My summary is that kubelet has correct logic for incrementing the restart number which is set to "current_restart + 1". See this kubelet code.

  1. If the CreateContainer request eventually succeeds on containerd side, kubelet will see it and increment the restart on the next iteration of SyncPod. The pod will be eventually ready.
  2. If the CreateContainer request eventually fails on containerd side, containerd should release the name. On next iteration of Kubelet SyncPod, it shouldn't see "failed to reserve container name" error.
  3. If the CreateContainer request is stuck on containerd side, the name is never released. Then kubelet will keep seeing "failed to reserve container name" error.

@dungdm93
Copy link

dungdm93 commented Nov 8, 2021

Hello @qiutongs, my k8s cluster used to run Spark jobs, and recently we got alot of CreateContainerError.
image
As your report, I find out that in the first time request, containerd create container but it stuck on Created state.
image
I also check logs of both containerd, and kubelet but did not find any more details.
Do you have any guide/instruction or recommend to help me deep dive further. Thanks

@navnitDevOps
Copy link

does anyone has any other solution to the issue except
moving back to cos with docker
since eventually we have to upgrade our gke-cluster and we might again face this issue in future

@bduclaux
Copy link

Unfortunately, the only working solution is to move back to cos with docker.
Amazing to see that this critical bug has been opened more than one year go, and still no fix.

@sadortun
Copy link
Author

sadortun commented Nov 16, 2021

@navnitDevOps

does anyone has any other solution

Have a look at how many containers are starting at the same time, and how much memory each of them take at startup. Make sure you have plenty left.

In our case, even if we were short on RAM, having more free RAM seems to help, but we still had to revert to COS.

@eduardoaw
Copy link

eduardoaw commented Nov 30, 2021

Same problem here

containerd github.com/containerd/containerd 1.4.6 d71fcd7d8303cbf684402823e425e9dd2e99285d

Amazon EKS 1.21

@elucidsoft
Copy link

Just had this happen to me on GKE.

@dungdm93
Copy link

@danielfoehrKn If the root cause is because of slow disk, I wonder why docker is not suffer from the same issue.

@danielfoehrKn
Copy link

danielfoehrKn commented Jan 14, 2022

If the root cause is because of slow disk, I wonder why docker is not suffer from the same issue.

At least for us, the disk was most likely the reason why the initial request to the CRI timed out.
After it timed out, then the pod could not be created because each subsequent API call by the kubelet ran into the failed to reserve container name problem.

We have not tested it on docker (cannot switch either). However, if dockerd/docker-shim does not have the same problem due to keeping some false "state", then subsequent requests by the kubelet to create the PodSandbox could work given the reason for the initial DeadlineExceeded error from the container runtime is not present any more (such as disk I/O overload).
This could be the case when many I/O heavy pods are scheduled to the node at the same time all causing image pulls, doing initialization work ....

@mikebrow
Copy link
Member

@danielfoehrKn If the root cause is because of slow disk, I wonder why docker is not suffer from the same issue.

Per CRI, containerd is receiving and processing requests in parallel. Kubelet has a couple bugs/issues in that it's making to many parallel requests even under disk pressure, then when a timeout occurs on it's side, it fails to recognize the problem was of it's own creation, presumes asking the CRI to do the same thing again will make it work this time even though it's still doing what was previously requested, and containerd reports the error. Dockershim path with kubelet down through the docker api then through containerd (vs directly to containerd) has more serialization and different code in the now deleted docker-shim code (even though containerd was still in the path), thus producing different behavior. Sometimes more serialization is faster (such as when under to much resource pressure resulting in thrashing of resources, for example memory swapping to disk while creating snapshots and new containers loading up and garbage collecting memory used for older requests... etc.).

Kubelet should be modified to recognize the timeout situation and avoid subsequent duplicate requests... or we can modify the CRI api to serialize service requests (I do not recommend we go this way) or we can change the API from parallel requests with client side timeouts to subscription requests with a first ack response and subsequent status change event responses. We could also have a change to the CRI api to request (by policy) serialization (through queuing) of requests when under pressure... or serialization by "failing" requests when currently processing parallel requests when under pressure (also not recommended). If we want us to "manage" requests at the CRI level we can do so but we're going to want to talk about node management policies.

@jsturtevant
Copy link
Contributor

I have also seen something similiar with the Windows sandbox creation in the Windows e2e tests recently. I looked for an issue to track possible changes to kubelet and couldn't find one so created: kubernetes/kubernetes#107561

@qiutongs
Copy link
Contributor

Dockershim path with kubelet down through the docker api then through containerd (vs directly to containerd) has more serialization and different code in the now deleted docker-shim code (even though containerd was still in the path), thus producing different behavior. Sometimes more serialization is faster

@mikebrow Do you have more details to share? More specifically, in terms of the rate control container runtime requests, how dockershim and containerd CRI plugin behave differently?

@kubino148
Copy link

Happened to me, it's really a serious bug when you are running your gitlab ci/cd runners in containerd based k8s because some pipelines are designed to run multiple containers in parallel and this bug happens very often. Is going back to docker really the only option here?

@matti
Copy link

matti commented Jan 24, 2022

Is going back to docker really the only option here?

yes, now it looks like it.

@fuweid
Copy link
Member

fuweid commented Jan 24, 2022

Hi, @matti and @kubino148 and @sadortun and all subscribers, could you mind to provide the goroutine stack of containerd when you see the error? Thanks.

kill -USR1 $(pidof containerd) will trigger the dump and check containerd log to get stack.

@fuweid fuweid self-assigned this Jan 24, 2022
@qiutongs
Copy link
Contributor

Amended Theory 1

(See the original theory 1 in #4604 (comment))

Docker has a similar mechanism of "reserving container name" to prevent conflict. However, dockershim handles it in a different way from containerd CRI implementation.

err = docker.CreateContainer
if (err == container name conflict) {
    removeErr = docker.RemoveContainer
    if (removeErr == nil) {
        return err
    } else {
        if (removeErr == "Container not found") {
            randomize the container name XYZ to XYZ_<RANDOM_SUFFIX>
            return docker.CreateContainer
        } else {
            return err
        }
    } 
}

https://github.com/kubernetes/kubernetes/blob/release-1.19/pkg/kubelet/dockershim/helpers.go#L284

In fact, this difference of retry leads to significantly different CRI rates between dockershim and containerd. In containerd, the CreateContainer request comes about every 10s-20s (See example in #4604 (comment)). But in dockershim case, theCreateContainer request comes about every 2min. This is because the requests of hitting "failed to reserve name" are fast in containerd while the requests can take 2min with a new container name in dockershim. This applies to RunPodSandbox as well. Therefore, it is a fact that the load of CRI requests in containerd is 10x of the load in dockershim. And I infer this makes the node further overloaded.

This theory echos with a similar bug solved in CRI-O - https://bugzilla.redhat.com/show_bug.cgi?id=1785399., in which the solution says "Now, when systems are under load, CRI-O does everything it can to slow down the Kubelet and reduce load on the system."

I believe our direction is also to slow down Kubelet sending too many requests. This might be aligned with Mike's comment- #4604 (comment)

@chrisroat
Copy link

Is going back to docker really the only option here?

yes, now it looks like it.

On GCP, only for a little while longer, though. Just got an email:

[Action Required] Migrate to Containerd node images before GKE v1.24

Support for Docker as a container runtime on Kubernetes nodes will be removed from OSS Kubernetes and GKE starting with v1.24. Please migrate your GKE workloads to Containerd as soon as possible.

@jsoref
Copy link

jsoref commented Jan 25, 2022

If I'm reading https://cloud.google.com/kubernetes-engine/docs/release-schedule and https://cloud.google.com/kubernetes-engine/versioning#lifecycle correctly, 1.23 will be supported through approximately 2023-06, which is a bit in the future. (This would require you to have created a static cluster as opposed to a regular upgrading cluster.)

Personally, I'd rather figure out what's wrong here and get it fixed (but I'm currently snowed in under a bunch of other tasks, sorry).

@matti
Copy link

matti commented Jan 25, 2022

from GKE email:

Most user workloads do not have dependencies on the container runtime. If you use a node image based on Dockershim, please migrate your GKE workloads to a Containerd node image as soon as possible.

How can we get someone from GKE on this thread?

@qiutongs
Copy link
Contributor

How can we get someone from GKE on this thread?

Hi Matti, I am from GKE. We are fully aware of this issue and are prioritizing it.

@fuweid
Copy link
Member

fuweid commented Jan 25, 2022

That said, We are pretty convince the issue comes from one of the two following issue:

Disk IO is to high and containerd timeout because of that
Starting 10-20+ pods at the same time on a single node cause a memory usage spike and at some point containerd process get killed.

@sadortun I was trying to reproduce it with high disk IO. But sorry, I didn't reproduce it. :(

If the CreateContainer doesn't return, it maybe hang in two syscalls:

  • umount: CreateContainer will mount rootfs to get USER ID. If the rootfs is overlayfs, umount will sync the device where the rootfs are. Since the IO pressure is too high, umount overlayfs will make it worse.
  • fsync: containerd uses boltdb to store container metadata. But it only syncs file data. It is unlikely.

Docker uses lazy-umount which might hide the problem.
If someone sees this issue, please provide more details like

  • kernel version
  • container snapshotter name: overlayfs or other
  • cat /proc/$(pidof containerd)/task/*/stack

Thanks

@sadortun
Copy link
Author

@fuweid

Thanks for your time on this issue.

Unfortunately, I did stop using COS back in 2020 after we could not find a solution.

I'm 97% sure we were using overlayfs and as for the rest I have no way to find this historical data.

Sorry about that.

fuweid added a commit to fuweid/containerd that referenced this issue Jan 26, 2022
In linux kernel, the umount writable-mountpoint will try to do sync-fs
to make sure that the dirty pages to the underlying filesystems. The many
number of umount actions in the same time maybe introduce performance
issue in IOPS limited disk.

When CRI-plugin creates container, it will temp-mount rootfs to read
that UID/GID info for entrypoint. Basically, the rootfs is writable
snapshotter and then after read, umount will invoke sync-fs action.

For example, using overlayfs on ext4 and use bcc-tools to monitor
ext4_sync_fs call.

```
// uname -a
Linux chaofan 5.13.0-27-generic containerd#29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

// open terminal 1
kubectl run --image=nginx --image-pull-policy=IfNotPresent nginx-pod

// open terminal 2
/usr/share/bcc/tools/stackcount ext4_sync_fs -i 1 -v -P

  ext4_sync_fs
  sync_filesystem
  ovl_sync_fs
  __sync_filesystem
  sync_filesystem
  generic_shutdown_super
  kill_anon_super
  deactivate_locked_super
  deactivate_super
  cleanup_mnt
  __cleanup_mnt
  task_work_run
  exit_to_user_mode_prepare
  syscall_exit_to_user_mode
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
  syscall.Syscall.abi0
  github.com/containerd/containerd/mount.unmount
  github.com/containerd/containerd/mount.UnmountAll
  github.com/containerd/containerd/mount.WithTempMount.func2
  github.com/containerd/containerd/mount.WithTempMount
  github.com/containerd/containerd/oci.WithUserID.func1
  github.com/containerd/containerd/oci.WithUser.func1
  github.com/containerd/containerd/oci.ApplyOpts
  github.com/containerd/containerd.WithSpec.func1
  github.com/containerd/containerd.(*Client).NewContainer
  github.com/containerd/containerd/pkg/cri/server.(*criService).CreateContainer
  github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).CreateContainer
  k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler.func1
  github.com/containerd/containerd/services/server.unaryNamespaceInterceptor
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1
  k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler
  google.golang.org/grpc.(*Server).processUnaryRPC
  google.golang.org/grpc.(*Server).handleStream
  google.golang.org/grpc.(*Server).serveStreams.func1.2
  runtime.goexit.abi0
    containerd [34771]
    1
```

If there are comming several create requestes, the /var/lib/containerd's
underlying disk mayb bring high IO pressure.

After checkout the kernel code[1], the kernel will not call
__sync_filesystem if the mount is readonly. Based on this, containerd
should use readonly mount to get UID/GID information.

Reference:

* [1] https://elixir.bootlin.com/linux/v5.13/source/fs/sync.c#L61

Closes: containerd#4604

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this issue Jan 28, 2022
In linux kernel, the umount writable-mountpoint will try to do sync-fs
to make sure that the dirty pages to the underlying filesystems. The many
number of umount actions in the same time maybe introduce performance
issue in IOPS limited disk.

When CRI-plugin creates container, it will temp-mount rootfs to read
that UID/GID info for entrypoint. Basically, the rootfs is writable
snapshotter and then after read, umount will invoke sync-fs action.

For example, using overlayfs on ext4 and use bcc-tools to monitor
ext4_sync_fs call.

```
// uname -a
Linux chaofan 5.13.0-27-generic containerd#29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

// open terminal 1
kubectl run --image=nginx --image-pull-policy=IfNotPresent nginx-pod

// open terminal 2
/usr/share/bcc/tools/stackcount ext4_sync_fs -i 1 -v -P

  ext4_sync_fs
  sync_filesystem
  ovl_sync_fs
  __sync_filesystem
  sync_filesystem
  generic_shutdown_super
  kill_anon_super
  deactivate_locked_super
  deactivate_super
  cleanup_mnt
  __cleanup_mnt
  task_work_run
  exit_to_user_mode_prepare
  syscall_exit_to_user_mode
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
  syscall.Syscall.abi0
  github.com/containerd/containerd/mount.unmount
  github.com/containerd/containerd/mount.UnmountAll
  github.com/containerd/containerd/mount.WithTempMount.func2
  github.com/containerd/containerd/mount.WithTempMount
  github.com/containerd/containerd/oci.WithUserID.func1
  github.com/containerd/containerd/oci.WithUser.func1
  github.com/containerd/containerd/oci.ApplyOpts
  github.com/containerd/containerd.WithSpec.func1
  github.com/containerd/containerd.(*Client).NewContainer
  github.com/containerd/containerd/pkg/cri/server.(*criService).CreateContainer
  github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).CreateContainer
  k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler.func1
  github.com/containerd/containerd/services/server.unaryNamespaceInterceptor
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1
  k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler
  google.golang.org/grpc.(*Server).processUnaryRPC
  google.golang.org/grpc.(*Server).handleStream
  google.golang.org/grpc.(*Server).serveStreams.func1.2
  runtime.goexit.abi0
    containerd [34771]
    1
```

If there are comming several create requestes, umount actions might
bring high IO pressure on the /var/lib/containerd's underlying disk.

After checkout the kernel code[1], the kernel will not call
__sync_filesystem if the mount is readonly. Based on this, containerd
should use readonly mount to get UID/GID information.

Reference:

* [1] https://elixir.bootlin.com/linux/v5.13/source/fs/sync.c#L61

Closes: containerd#4604

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this issue Jan 28, 2022
In linux kernel, the umount writable-mountpoint will try to do sync-fs
to make sure that the dirty pages to the underlying filesystems. The many
number of umount actions in the same time maybe introduce performance
issue in IOPS limited disk.

When CRI-plugin creates container, it will temp-mount rootfs to read
that UID/GID info for entrypoint. Basically, the rootfs is writable
snapshotter and then after read, umount will invoke sync-fs action.

For example, using overlayfs on ext4 and use bcc-tools to monitor
ext4_sync_fs call.

```
// uname -a
Linux chaofan 5.13.0-27-generic containerd#29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

// open terminal 1
kubectl run --image=nginx --image-pull-policy=IfNotPresent nginx-pod

// open terminal 2
/usr/share/bcc/tools/stackcount ext4_sync_fs -i 1 -v -P

  ext4_sync_fs
  sync_filesystem
  ovl_sync_fs
  __sync_filesystem
  sync_filesystem
  generic_shutdown_super
  kill_anon_super
  deactivate_locked_super
  deactivate_super
  cleanup_mnt
  __cleanup_mnt
  task_work_run
  exit_to_user_mode_prepare
  syscall_exit_to_user_mode
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
  syscall.Syscall.abi0
  github.com/containerd/containerd/mount.unmount
  github.com/containerd/containerd/mount.UnmountAll
  github.com/containerd/containerd/mount.WithTempMount.func2
  github.com/containerd/containerd/mount.WithTempMount
  github.com/containerd/containerd/oci.WithUserID.func1
  github.com/containerd/containerd/oci.WithUser.func1
  github.com/containerd/containerd/oci.ApplyOpts
  github.com/containerd/containerd.WithSpec.func1
  github.com/containerd/containerd.(*Client).NewContainer
  github.com/containerd/containerd/pkg/cri/server.(*criService).CreateContainer
  github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).CreateContainer
  k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler.func1
  github.com/containerd/containerd/services/server.unaryNamespaceInterceptor
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1
  k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler
  google.golang.org/grpc.(*Server).processUnaryRPC
  google.golang.org/grpc.(*Server).handleStream
  google.golang.org/grpc.(*Server).serveStreams.func1.2
  runtime.goexit.abi0
    containerd [34771]
    1
```

If there are comming several create requestes, umount actions might
bring high IO pressure on the /var/lib/containerd's underlying disk.

After checkout the kernel code[1], the kernel will not call
__sync_filesystem if the mount is readonly. Based on this, containerd
should use readonly mount to get UID/GID information.

Reference:

* [1] https://elixir.bootlin.com/linux/v5.13/source/fs/sync.c#L61

Closes: containerd#4604

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@fuweid
Copy link
Member

fuweid commented Jan 28, 2022

@sadortun I file pr to enhance this. #6478 (comment)

No sure that what different between docker and containerd in GKE. sorry about that.

@derekperkins
Copy link

We're on GKE 1.21.6-gke1500 and we've been seeing this problem for the last 1-2 months

@qiutongs
Copy link
Contributor

qiutongs commented Feb 1, 2022

@sadortun I file pr to enhance this. #6478 (comment)

I got some good results showing this patch improves the latency of "CreateContainer".

  • Setup: GKE 1.20 ubuntu_containerd node with 10 GB boot disk
  • Execution A
    • Add some disk IO stress-ng --io 1 -d 1 --timeout 7200 --hdd-bytes 8M
    • Create a deployment of nginx with 25 replicas
    • Check containerd log
  • Execution B: repeat above steps with a newly built containerd with this patch
  • Result A
    • Seen "failed to reserve container name"
    • The last pod is ready after 15min
  • Result B
    • Not seen "failed to reserve container name"
    • All CreateContainer complete within 2 mins
    • The last pod is ready after 10min

Please note this is based on a couple of experiments, not ample data set. stress-ng doesn't produce stable IOPS so the disk state cannot be the exact same in two cases.

@qiutongs
Copy link
Contributor

qiutongs commented Feb 1, 2022

Summary (2022/02)

"failed to reserve container name" error is returned by containerd CRI if there is an in-flight CreateContainer request reserving the same container name (like below).
T1: 1st CreateContainer(XYZ) request is sent. (Timeout on Kubelet side)
T2: 2nd CreateContainer(XYZ) request is sent (Kubelet retry)
T3: 2nd CreateContainer request returns "failed to reserve container name XYZ" error
T4: 1st CreateContainer request is still in-flight…

Don't panic. Given sufficient time, the container and pod will be created successfully, as long as you are using restartPolicy:Always or restartPolicy:OnFailure in PodSpec.

Root Cause and Fix

Slow disk operations((e.g. disk throttle on GKE) are the culprit. What generates lots of disk IO can come from a number of factors: user's disk-heavy workload, big images pulling and containerd CRI implementation.

An unnecessary sync-fs operation was found as part of CreateContainer stack. It is the where CreateContainer gets stuck. The sync-fs is got rid of in #6478. Not only it makes CreateContainer return faster, but it reduces disk IO generated by containerd.

Please note there are perhaps other undiscovered reason contributing to this problem.

Mitigation

  1. If pods are failed, consider to use restartPolicy:Always or restartPolicy:OnFailure in PodSpec
  2. Increase the boot disk IOPS (e.g. upgrade disk type or increase disk size)
  3. Upgrade containerd with this patch oci: use readonly mount to read user/group info #6478 which will be available in 1.6+ and 1.5.X(backport working in progress)

@containerd containerd locked as resolved and limited conversation to collaborators Feb 1, 2022
qiutongs pushed a commit to qiutongs/containerd that referenced this issue Feb 3, 2022
In linux kernel, the umount writable-mountpoint will try to do sync-fs
to make sure that the dirty pages to the underlying filesystems. The many
number of umount actions in the same time maybe introduce performance
issue in IOPS limited disk.

When CRI-plugin creates container, it will temp-mount rootfs to read
that UID/GID info for entrypoint. Basically, the rootfs is writable
snapshotter and then after read, umount will invoke sync-fs action.

For example, using overlayfs on ext4 and use bcc-tools to monitor
ext4_sync_fs call.

```
// uname -a
Linux chaofan 5.13.0-27-generic containerd#29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

// open terminal 1
kubectl run --image=nginx --image-pull-policy=IfNotPresent nginx-pod

// open terminal 2
/usr/share/bcc/tools/stackcount ext4_sync_fs -i 1 -v -P

  ext4_sync_fs
  sync_filesystem
  ovl_sync_fs
  __sync_filesystem
  sync_filesystem
  generic_shutdown_super
  kill_anon_super
  deactivate_locked_super
  deactivate_super
  cleanup_mnt
  __cleanup_mnt
  task_work_run
  exit_to_user_mode_prepare
  syscall_exit_to_user_mode
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
  syscall.Syscall.abi0
  github.com/containerd/containerd/mount.unmount
  github.com/containerd/containerd/mount.UnmountAll
  github.com/containerd/containerd/mount.WithTempMount.func2
  github.com/containerd/containerd/mount.WithTempMount
  github.com/containerd/containerd/oci.WithUserID.func1
  github.com/containerd/containerd/oci.WithUser.func1
  github.com/containerd/containerd/oci.ApplyOpts
  github.com/containerd/containerd.WithSpec.func1
  github.com/containerd/containerd.(*Client).NewContainer
  github.com/containerd/containerd/pkg/cri/server.(*criService).CreateContainer
  github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).CreateContainer
  k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler.func1
  github.com/containerd/containerd/services/server.unaryNamespaceInterceptor
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1
  k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler
  google.golang.org/grpc.(*Server).processUnaryRPC
  google.golang.org/grpc.(*Server).handleStream
  google.golang.org/grpc.(*Server).serveStreams.func1.2
  runtime.goexit.abi0
    containerd [34771]
    1
```

If there are comming several create requestes, umount actions might
bring high IO pressure on the /var/lib/containerd's underlying disk.

After checkout the kernel code[1], the kernel will not call
__sync_filesystem if the mount is readonly. Based on this, containerd
should use readonly mount to get UID/GID information.

Reference:

* [1] https://elixir.bootlin.com/linux/v5.13/source/fs/sync.c#L61

Closes: containerd#4604

Signed-off-by: Wei Fu <fuweid89@gmail.com>
(cherry picked from commit 813a061)
Signed-off-by: Qiutong Song <qiutongs@google.com>
wllenyj pushed a commit to wllenyj/containerd that referenced this issue Feb 8, 2022
In linux kernel, the umount writable-mountpoint will try to do sync-fs
to make sure that the dirty pages to the underlying filesystems. The many
number of umount actions in the same time maybe introduce performance
issue in IOPS limited disk.

When CRI-plugin creates container, it will temp-mount rootfs to read
that UID/GID info for entrypoint. Basically, the rootfs is writable
snapshotter and then after read, umount will invoke sync-fs action.

For example, using overlayfs on ext4 and use bcc-tools to monitor
ext4_sync_fs call.

```
// uname -a
Linux chaofan 5.13.0-27-generic mxpv#29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

// open terminal 1
kubectl run --image=nginx --image-pull-policy=IfNotPresent nginx-pod

// open terminal 2
/usr/share/bcc/tools/stackcount ext4_sync_fs -i 1 -v -P

  ext4_sync_fs
  sync_filesystem
  ovl_sync_fs
  __sync_filesystem
  sync_filesystem
  generic_shutdown_super
  kill_anon_super
  deactivate_locked_super
  deactivate_super
  cleanup_mnt
  __cleanup_mnt
  task_work_run
  exit_to_user_mode_prepare
  syscall_exit_to_user_mode
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
  syscall.Syscall.abi0
  github.com/containerd/containerd/mount.unmount
  github.com/containerd/containerd/mount.UnmountAll
  github.com/containerd/containerd/mount.WithTempMount.func2
  github.com/containerd/containerd/mount.WithTempMount
  github.com/containerd/containerd/oci.WithUserID.func1
  github.com/containerd/containerd/oci.WithUser.func1
  github.com/containerd/containerd/oci.ApplyOpts
  github.com/containerd/containerd.WithSpec.func1
  github.com/containerd/containerd.(*Client).NewContainer
  github.com/containerd/containerd/pkg/cri/server.(*criService).CreateContainer
  github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).CreateContainer
  k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler.func1
  github.com/containerd/containerd/services/server.unaryNamespaceInterceptor
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1
  k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler
  google.golang.org/grpc.(*Server).processUnaryRPC
  google.golang.org/grpc.(*Server).handleStream
  google.golang.org/grpc.(*Server).serveStreams.func1.2
  runtime.goexit.abi0
    containerd [34771]
    1
```

If there are comming several create requestes, umount actions might
bring high IO pressure on the /var/lib/containerd's underlying disk.

After checkout the kernel code[1], the kernel will not call
__sync_filesystem if the mount is readonly. Based on this, containerd
should use readonly mount to get UID/GID information.

Reference:

* [1] https://elixir.bootlin.com/linux/v5.13/source/fs/sync.c#L61

Closes: containerd#4604

Signed-off-by: Wei Fu <fuweid89@gmail.com>
katiewasnothere pushed a commit to katiewasnothere/containerd that referenced this issue Apr 21, 2022
In linux kernel, the umount writable-mountpoint will try to do sync-fs
to make sure that the dirty pages to the underlying filesystems. The many
number of umount actions in the same time maybe introduce performance
issue in IOPS limited disk.

When CRI-plugin creates container, it will temp-mount rootfs to read
that UID/GID info for entrypoint. Basically, the rootfs is writable
snapshotter and then after read, umount will invoke sync-fs action.

For example, using overlayfs on ext4 and use bcc-tools to monitor
ext4_sync_fs call.

```
// uname -a
Linux chaofan 5.13.0-27-generic containerd#29~20.04.1-Ubuntu SMP Fri Jan 14 00:32:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

// open terminal 1
kubectl run --image=nginx --image-pull-policy=IfNotPresent nginx-pod

// open terminal 2
/usr/share/bcc/tools/stackcount ext4_sync_fs -i 1 -v -P

  ext4_sync_fs
  sync_filesystem
  ovl_sync_fs
  __sync_filesystem
  sync_filesystem
  generic_shutdown_super
  kill_anon_super
  deactivate_locked_super
  deactivate_super
  cleanup_mnt
  __cleanup_mnt
  task_work_run
  exit_to_user_mode_prepare
  syscall_exit_to_user_mode
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
  syscall.Syscall.abi0
  github.com/containerd/containerd/mount.unmount
  github.com/containerd/containerd/mount.UnmountAll
  github.com/containerd/containerd/mount.WithTempMount.func2
  github.com/containerd/containerd/mount.WithTempMount
  github.com/containerd/containerd/oci.WithUserID.func1
  github.com/containerd/containerd/oci.WithUser.func1
  github.com/containerd/containerd/oci.ApplyOpts
  github.com/containerd/containerd.WithSpec.func1
  github.com/containerd/containerd.(*Client).NewContainer
  github.com/containerd/containerd/pkg/cri/server.(*criService).CreateContainer
  github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).CreateContainer
  k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler.func1
  github.com/containerd/containerd/services/server.unaryNamespaceInterceptor
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1
  k8s.io/cri-api/pkg/apis/runtime/v1._RuntimeService_CreateContainer_Handler
  google.golang.org/grpc.(*Server).processUnaryRPC
  google.golang.org/grpc.(*Server).handleStream
  google.golang.org/grpc.(*Server).serveStreams.func1.2
  runtime.goexit.abi0
    containerd [34771]
    1
```

If there are comming several create requestes, umount actions might
bring high IO pressure on the /var/lib/containerd's underlying disk.

After checkout the kernel code[1], the kernel will not call
__sync_filesystem if the mount is readonly. Based on this, containerd
should use readonly mount to get UID/GID information.

Reference:

* [1] https://elixir.bootlin.com/linux/v5.13/source/fs/sync.c#L61

Closes: containerd#4604

Signed-off-by: Wei Fu <fuweid89@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet