runtime/v2/runc: handle early exits w/o big locks #8617

corhere · 2023-05-31T23:39:11Z

eventSendMu is causing severe lock contention when multiple processes start and exit concurrently. Replace it with a different scheme for maintaining causality w.r.t. start and exit events for a process which does not rely on big locks for synchronization.

Keep track of all processes for which a Task(Exec)Start event has been published and have not yet exited in a map, keyed by their PID. Processing exits then is as simple as looking up which process corresponds to the PID. If there are no started processes known with that PID, the PID must either belong to a process which was started by s.Start() and before the s.Start() call has added the process to the map of running processes, or a reparented process which we don't care about. Handle the former case by having each s.Start() call subscribe to "early exit" events before starting the process. It checks if the PID has exited in the time between it starting the process and publishing the TaskStart event, handling the exit if it has. Exit events for reparented processes received when no s.Start() calls are in flight are immediately discarded, and events received during an s.Start() call are discarded when the s.Start() call returns.

Big thanks to @laurazard for analyzing the root cause of #8557 and for also developing a fix. I would not have come up with the idea for this solution without inspiration from hers.

k8s-ci-robot · 2023-05-31T23:39:20Z

Hi @corhere. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

corhere · 2023-05-31T23:53:31Z

/cc @laurazard @thaJeztah @cpuguy83 @dmcgowan

k8s-ci-robot · 2023-05-31T23:53:36Z

@corhere: GitHub didn't allow me to request PR reviews from the following users: laurazard.

Note that only containerd members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @laurazard @thaJeztah @cpuguy83 @dmcgowan

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

runtime/v2/runc/task/service.go

corhere · 2023-06-02T17:09:02Z

I was able to get rid of the timed garbage collector by leveraging the fact that the only early exits which matter for a given s.Start call are the exits which are received in between container.Start() and the PID being added to s.running. I will squash down the branch if tests pass and reviewers are happy with the new approach.

thaJeztah · 2023-06-02T17:18:27Z

Looks like one commit misses a DCO

laurazard · 2023-06-07T11:49:39Z

/cc @dmcgowan @fuweid @dcantah PTAL

I chatted with @corhere, he's a bit busy this week and the next but I can follow up on any comments (would be great to get this in!)

k8s-ci-robot · 2023-06-07T11:49:43Z

@laurazard: GitHub didn't allow me to request PR reviews from the following users: PTAL.

Note that only containerd members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @dmcgowan @fuweid @dcantah PTAL

I chatted with @corhere, he's a bit busy this week and the next but I can follow up on any comments (would be great to get this in!)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dcantah · 2023-06-08T01:07:23Z

Looks like this caused TestRegressionIssue4769 to go a little haywire, it's failing for runc, crun and the rockylinux runs. @laurazard Can you look into this?

laurazard · 2023-06-08T08:59:26Z

Aaahh, I see and can replicate it locally, I'll take a look, thanks @dcantah.

laurazard · 2023-06-09T12:45:04Z

runtime/v2/runc/task/service.go

@@ -113,7 +113,7 @@ type service struct {
 	containers map[string]*runc.Container

 	lifecycleMu sync.Mutex
-	running     map[int][]containerProcess // pid -> running process, guarded by lifecycleMu
+	running     map[int]map[containerProcess]struct{} // pid -> running process, guarded by lifecycleMu


Commenting here to get some input: the fix for the failure mentioned in #8617 (comment) has to do with us creating duplicate entries for the same containerProcess in s.running (first in s.Create, then in s.Start), which then causes us to send exitEvents for the same containerProcess twice.

I fixed this in c43966c by storing replacing the containerProcess slice with a map, and using the containerProcess as the key so that we don't keep the same containerProcess there twice. This works fine, but I wonder if we can make this more simple: I don't think we can assume pids are unique, although if we could we could do:

Suggested change

running map[int]map[containerProcess]struct{} // pid -> running process, guarded by lifecycleMu

running map[int]containerProcess // pid -> running process, guarded by lifecycleMu

If we can't assume that, I still wonder if we should do something like

Suggested change

running map[int]map[containerProcess]struct{} // pid -> running process, guarded by lifecycleMu

running map[int]map[string]containerProcess // pid -> running process, guarded by lifecycleMu

where the key for the inner map is something like container.ID, since I think containerID+processID should be unique.

@corhere thoughts?

Silly question; if pid here is containerProcess.Process.Pid, could it then be map[containerProcess]struct{} ? (or map[containerID]map[containerProcess]struct{} if we need to index per container?

It's not :'(. We have multiple processes per-container, and when we get the exit event back from runC we only get the specific process PID, so the "outer" map must be keyed by pid – so that when we get an exit event, we can check our s.running map with the exit-event's PID and get the containerProcesses we know are running for that PID so that we can process the exit event for them.

Ah, ugh... gotcha

@laurazard and I have concluded that the duplicate entries in s.running is really a symptom of an underlying bug in my implementation which could manifest as an exit event being published before a start event in certain cases. Deduplicating the running map values by containerProcess does not address the root cause. The correct solution, we believe, is to have (*service).Start() remove the container init process from s.running before calling container.Start() and add it back afterwards so that the process exiting before (*service).Start() publishes the TaskStart event is correctly handled as an early exit.

eventSendMu is causing severe lock contention when multiple processes start and exit concurrently. Replace it with a different scheme for maintaining causality w.r.t. start and exit events for a process which does not rely on big locks for synchronization. Keep track of all processes for which a Task(Exec)Start event has been published and have not yet exited in a map, keyed by their PID. Processing exits then is as simple as looking up which process corresponds to the PID. If there are no started processes known with that PID, the PID must either belong to a process which was started by s.Start() and before the s.Start() call has added the process to the map of running processes, or a reparented process which we don't care about. Handle the former case by having each s.Start() call subscribe to exit events before starting the process. It checks if the PID has exited in the time between it starting the process and publishing the TaskStart event, handling the exit if it has. Exit events for reparented processes received when no s.Start() calls are in flight are immediately discarded, and events received during an s.Start() call are discarded when the s.Start() call returns. Co-authored-by: Laura Brehm <laurabrehm@hey.com> Signed-off-by: Cory Snider <csnider@mirantis.com>

After pr containerd#8617, create handler of containerd-shim-runc-v2 will call handleStarted() to record the init process and handle its exit. Although under normal circumstances, init process wouldn't quit so early. But if this screnario occurs, handleStarted() will call handleProcessExit(). It will cause deadlock because create() had acquired s.mu, and handleProcessExit() will try to lock it again. I found that after pr containerd#8617, handleProcessExit() won't access s.containers, so we can remove the unnecessary lock guard to avoid deadlock. Fix: containerd#9103 Signed-off-by: Chen Yiyang <cyyzero@qq.com>

After pr containerd#8617, create handler of containerd-shim-runc-v2 will call handleStarted() to record the init process and handle its exit. Init process wouldn't quit so early in normal circumstances. But if this screnario occurs, handleStarted() will call handleProcessExit(), which will cause deadlock because create() had acquired s.mu, and handleProcessExit() will try to lock it again. I found that after pr containerd#8617, handleProcessExit() won't access s.containers, so we can remove the unnecessary lock guard to avoid deadlock. Fix: containerd#9103 Signed-off-by: Chen Yiyang <cyyzero@qq.com>

After pr containerd#8617, create handler of containerd-shim-runc-v2 will call handleStarted() to record the init process and handle its exit. Init process wouldn't quit so early in normal circumstances. But if this screnario occurs, handleStarted() will call handleProcessExit(), which will cause deadlock because create() had acquired s.mu, and handleProcessExit() will try to lock it again. So I add handleProcessExitNoLock() function which will not lock s.mu. It can safely be called in create handler without deadlock. Fix: containerd#9103 Signed-off-by: Chen Yiyang <cyyzero@qq.com>

After pr containerd#8617, create handler of containerd-shim-runc-v2 will call handleStarted() to record the init process and handle its exit. Init process wouldn't quit so early in normal circumstances. But if this screnario occurs, handleStarted() will call handleProcessExit(), which will cause deadlock because create() had acquired s.mu, and handleProcessExit() will try to lock it again. So, I added a parameter muLocked to handleStarted to indicate whether or not s.mu is currently locked, and thus deciding whether or not to lock it when calling handleProcessExit. Fix: containerd#9103 Signed-off-by: Chen Yiyang <cyyzero@qq.com>

@corhere

After pr containerd#8617, create handler of containerd-shim-runc-v2 will call handleStarted() to record the init process and handle its exit. Init process wouldn't quit so early in normal circumstances. But if this screnario occurs, handleStarted() will call handleProcessExit(), which will cause deadlock because create() had acquired s.mu, and handleProcessExit() will try to lock it again. For some historical reasons, the use of s.mu is a bit confusing, sometimes used to protect s.containers only, and sometimes used as a mutex for some functions. According to analysis of @corhere in containerd#9103, we can know that: Locking s.mu for Create() handler is introduced in containerd#1452, which was a solution for the missed early-exits problem: Create() holds the mutex to block checkProcesses() from handling exit events until after Create() has added the container process to the s.processes map. This locking logic was copied into the v2 shim implementation. containerd#8617 solves the same problem using a strategy that does not rely on mutual exclusion between container create and exit-event handling. As for Shutdown(), the implementation was added in containerd#3004. And in this initial implementation, the mutex is locked to safely access s.containers, it is not unlocked because nothing matters after os.Exit(0). Then ae87730 changed Shutdown() to return instead of exiting, followed by containerd#4988 to unlock upon return. If the intention is to block containers from being created while the shim's task service is shutting down, locking a mutex is a poor solution since it makes the create requests hang instead of returning an error, and is ineffective after Shutdown() returns as the mutex is unlocked. If Create() or handleProcessExit() enters again after Shutdown() unlocking s.mu and returning, shim service will panic by sending to a closed channel. So I remove the unnecessary lock guards in service, and rename mu to containersMu to make it clear that it is used to protect s.containers only. Fix: containerd#9103 Signed-off-by: Chen Yiyang <cyyzero@qq.com>

@corhere

After pr containerd#8617, create handler of containerd-shim-runc-v2 will call handleStarted() to record the init process and handle its exit. Init process wouldn't quit so early in normal circumstances. But if this screnario occurs, handleStarted() will call handleProcessExit(), which will cause deadlock because create() had acquired s.mu, and handleProcessExit() will try to lock it again. For some historical reasons, the use of s.mu is a bit confusing, sometimes used to protect s.containers only, and sometimes used as a mutex for some functions. According to analysis of @corhere in containerd#9103, we can know that: Locking s.mu for Create() handler is introduced in containerd#1452, which was a solution for the missed early-exits problem: Create() holds the mutex to block checkProcesses() from handling exit events until after Create() has added the container process to the s.processes map. This locking logic was copied into the v2 shim implementation. containerd#8617 solves the same problem using a strategy that does not rely on mutual exclusion between container create and exit-event handling. As for Shutdown(), the implementation was added in containerd#3004. And in this initial implementation, the mutex is locked to safely access s.containers, it is not unlocked because nothing matters after os.Exit(0). Then ae87730 changed Shutdown() to return instead of exiting, followed by containerd#4988 to unlock upon return. If the intention is to block containers from being created while the shim's task service is shutting down, locking a mutex is a poor solution since it makes the create requests hang instead of returning an error, and is ineffective after Shutdown() returns as the mutex is unlocked. If Create() or handleProcessExit() enters again after Shutdown() unlocking s.mu and returning, shim service will panic by sending to a closed channel. So I remove the unnecessary lock guards in service, and rename mu to containersMu to make it clear that it is used to protect s.containers only. Fix: containerd#9103 Signed-off-by: Chen Yiyang <cyyzero@qq.com>

After pr containerd#8617, create handler of containerd-shim-runc-v2 will call handleStarted() to record the init process and handle its exit. Init process wouldn't quit so early in normal circumstances. But if this screnario occurs, handleStarted() will call handleProcessExit(), which will cause deadlock because create() had acquired s.mu, and handleProcessExit() will try to lock it again. So, I added a parameter muLocked to handleStarted to indicate whether or not s.mu is currently locked, and thus deciding whether or not to lock it when calling handleProcessExit. Fix: containerd#9103 Signed-off-by: Chen Yiyang <cyyzero@qq.com>

After pr containerd#8617, create handler of containerd-shim-runc-v2 will call handleStarted() to record the init process and handle its exit. Init process wouldn't quit so early in normal circumstances. But if this screnario occurs, handleStarted() will call handleProcessExit(), which will cause deadlock because create() had acquired s.mu, and handleProcessExit() will try to lock it again. So, I added a parameter muLocked to handleStarted to indicate whether or not s.mu is currently locked, and thus deciding whether or not to lock it when calling handleProcessExit. Fix: containerd#9103 Signed-off-by: Chen Yiyang <cyyzero@qq.com> (cherry picked from commit 68dd47e) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>

After pr containerd#8617, create handler of containerd-shim-runc-v2 will call handleStarted() to record the init process and handle its exit. Init process wouldn't quit so early in normal circumstances. But if this screnario occurs, handleStarted() will call handleProcessExit(), which will cause deadlock because create() had acquired s.mu, and handleProcessExit() will try to lock it again. So, I added a parameter muLocked to handleStarted to indicate whether or not s.mu is currently locked, and thus deciding whether or not to lock it when calling handleProcessExit. Fix: containerd#9103 Signed-off-by: Chen Yiyang <cyyzero@qq.com>

k8s-ci-robot added the needs-ok-to-test label May 31, 2023

k8s-ci-robot requested review from thaJeztah, cpuguy83 and dmcgowan May 31, 2023 23:53

dmcgowan reviewed May 31, 2023

View reviewed changes

runtime/v2/runc/task/service.go Outdated Show resolved Hide resolved

fuweid reviewed Jun 1, 2023

View reviewed changes

runtime/v2/runc/task/service.go Outdated Show resolved Hide resolved

corhere force-pushed the reduce-exec-lock-contention branch from 1476721 to 16698bf Compare June 1, 2023 18:32

corhere force-pushed the reduce-exec-lock-contention branch 3 times, most recently from 53dca61 to adbb1b1 Compare June 2, 2023 22:04

k8s-ci-robot requested review from dcantah, dmcgowan and fuweid June 7, 2023 11:49

dmcgowan approved these changes Jun 7, 2023

View reviewed changes

laurazard force-pushed the reduce-exec-lock-contention branch 2 times, most recently from adbb1b1 to fc17ca0 Compare June 8, 2023 10:40

laurazard mentioned this pull request Jun 9, 2023

integration/client: add timeout to TestShimOOMScore #8664

Merged

laurazard reviewed Jun 9, 2023

View reviewed changes

corhere force-pushed the reduce-exec-lock-contention branch from c43966c to 5c52599 Compare June 9, 2023 18:17

corhere force-pushed the reduce-exec-lock-contention branch from 5c52599 to 5cd6210 Compare June 9, 2023 20:54

cyyzero mentioned this pull request Sep 16, 2023

containerd-shim may deadlock when processing create requests #9103

Closed

cyyzero mentioned this pull request Sep 16, 2023

containerd-shim-runc-v2: avoid potential deadlock in create handler #9104

Merged

This was referenced Oct 10, 2023

[release/1.7 backport] containerd-shim-runc-v2: avoid potential deadlock in create handler #9209

Merged

[release/1.6 backport] containerd-shim-runc-v2: avoid potential deadlock in create handler #9210

Merged

isker mentioned this pull request Jan 2, 2024

Update containerd in yum repos to the latest patch awslabs/amazon-eks-ami#1526

Closed

tjohnes mentioned this pull request Jan 31, 2024

TaskExit event can be sent for an exec process after TaskExit is sent for the init process #9719

Closed

dtronche mentioned this pull request Jun 3, 2024

Many parallel calls to docker exec make Docker unresponsive moby/moby#45595

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime/v2/runc: handle early exits w/o big locks #8617

runtime/v2/runc: handle early exits w/o big locks #8617

corhere commented May 31, 2023 •

edited

Loading

k8s-ci-robot commented May 31, 2023

corhere commented May 31, 2023

k8s-ci-robot commented May 31, 2023

corhere commented Jun 2, 2023

thaJeztah commented Jun 2, 2023

laurazard commented Jun 7, 2023

k8s-ci-robot commented Jun 7, 2023

dcantah commented Jun 8, 2023

laurazard commented Jun 8, 2023

laurazard Jun 9, 2023 •

edited

Loading

laurazard Jun 9, 2023

thaJeztah Jun 9, 2023

laurazard Jun 9, 2023 •

edited

Loading

thaJeztah Jun 9, 2023

corhere Jun 9, 2023

	running map[int]map[containerProcess]struct{} // pid -> running process, guarded by lifecycleMu
	running map[int]containerProcess // pid -> running process, guarded by lifecycleMu

	running map[int]map[containerProcess]struct{} // pid -> running process, guarded by lifecycleMu
	running map[int]map[string]containerProcess // pid -> running process, guarded by lifecycleMu

runtime/v2/runc: handle early exits w/o big locks #8617

runtime/v2/runc: handle early exits w/o big locks #8617

Conversation

corhere commented May 31, 2023 • edited Loading

k8s-ci-robot commented May 31, 2023

corhere commented May 31, 2023

k8s-ci-robot commented May 31, 2023

corhere commented Jun 2, 2023

thaJeztah commented Jun 2, 2023

laurazard commented Jun 7, 2023

k8s-ci-robot commented Jun 7, 2023

dcantah commented Jun 8, 2023

laurazard commented Jun 8, 2023

laurazard Jun 9, 2023 • edited Loading

Choose a reason for hiding this comment

laurazard Jun 9, 2023

Choose a reason for hiding this comment

thaJeztah Jun 9, 2023

Choose a reason for hiding this comment

laurazard Jun 9, 2023 • edited Loading

Choose a reason for hiding this comment

thaJeztah Jun 9, 2023

Choose a reason for hiding this comment

corhere Jun 9, 2023

Choose a reason for hiding this comment

corhere commented May 31, 2023 •

edited

Loading

laurazard Jun 9, 2023 •

edited

Loading

laurazard Jun 9, 2023 •

edited

Loading