Add support for lock debugging #18796

mheon · 2023-06-05T19:54:48Z

Deadlocks are among the most annoying parts of Podman to debug (second only to the intricacies of the REST attach endpoints, in my opinion). Part of this is that there wasn't really a good way to tell what was going on - all our locks are in-memory, so an strace just shows futex calls on inscrutable memory addresses (which blend into the futex calls that the Go runtime is doing on its own, making things extra fun).

This PR attempts to remedy this in 3 ways:

Add the number of the lock used by a container/pod/volume to the output of podman inspect - allows for quick spot checks to verify that numbers assigned look sane
Add the number of available locks to podman info - so we can easily see if the system has run entirely out of locks (usually because folks forget to prune volumes)
Add a new, hidden debug command, podman system locks, to check for potential deadlocks due to duplicate lock assignment and identify any locks that are currently in use. This is hidden because the output isn't really meant for users, but for developers attempting to debug a deadlock.

The number of free locks available is now displayed in `podman info` to help debug lock exhaustion scenarios.

Being able to easily identify what lock has been allocated to a given Libpod object is only somewhat useful for debugging lock issues, but it's trivial to expose and I don't see any harm in doing so. Signed-off-by: Matt Heon <mheon@redhat.com>

This is a nice quality-of-life change that should help to debug situations where someone runs out of locks (usually when a bunch of unused volumes accumulate). Signed-off-by: Matt Heon <mheon@redhat.com>

This is a general debug command that identifies any lock conflicts that could lead to a deadlock. It's only intended for Libpod developers (while it does tell you if you need to run `podman system renumber`, you should never have to do that anyways, and the next commit will include a lot more technical info in the output that no one except a Libpod dev will want). Hence, hidden command, and only implemented for the local driver (recommend just running it by SSHing into a `podman machine` VM in the unlikely case it's needed by remote Podman). These conflicts should normally never happen, but having a command like this is useful for debugging deadlock conditions when they do occur. Signed-off-by: Matt Heon <mheon@redhat.com>

openshift-ci · 2023-06-05T19:54:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mheon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mheon · 2023-06-05T20:06:08Z

Sample output from podman system locks on a system that has nothing wrong, but is creating/removing a lot of containers:

No lock conflicts have been detected, system safe from deadlocks.

Lock 0 is presently being held

If there was an actual conflict, it's be listed, and the output would note that podman system renumber should be used. And any additional locks being held would be listed below the Lock 0 line.

robbmanes · 2023-06-05T20:10:44Z

😍❤️

$ id -u
1000

$ ./bin/podman system locks
No lock conflicts have been detected, system safe from deadlocks.

$ ./bin/podman info | grep -i FreeLocks
  freeLocks: 1943

$ for i in {1..10000}; do podman volume create vol$i; done
[...]
Error: allocating lock for new volume: allocation failed; exceeded num_locks (2048)

$ ./bin/podman info | grep -i FreeLocks
  freeLocks: 0

To debug a deadlock, we really want to know what lock is actually locked, so we can figure out what is using that lock. This PR adds support for this, using trylock to check if every lock on the system is free or in use. Will really need to be run a few times in quick succession to verify that it's not a transient lock and it's actually stuck, but that's not really a big deal. Signed-off-by: Matt Heon <mheon@redhat.com>

Luap99 · 2023-06-06T08:37:50Z

cmd/podman/system/locks.go

+	if len(report.LockConflicts) > 0 {
+		fmt.Printf("\nLock conflicts have been detected. Recommend immediate use of `podman system renumber` to resolve.\n\n")
+	} else {
+		fmt.Printf("\nNo lock conflicts have been detected, system safe from deadlocks.\n\n")


As much as I understand this wording I would not claim system safe from deadlocks, this only checks for shm lock conflicts. We can still have ABBA deadlocks or any other deadlock between different kind of locks such as mutex, go channels, WaitGroups or even c/storage locks.

Luap99 · 2023-06-06T08:46:43Z

libpod/lock/shm/shm_lock.c

+  for (i = 0; i < shm->num_bitmaps; i++) {
+    // Short-circuit to catch fully-empty bitmaps quick.
+    if (shm->locks[i].bitmap == 0) {
+      free_locks += 32;


should this be s/32/sizeof(bitmap_t)/? I mean it works but this would make it more clear where 32 is coming from.

Luap99 · 2023-06-06T08:46:51Z

libpod/lock/shm/shm_lock.c

+      count++;
+    }
+
+    free_locks += 32 - count;


Luap99 · 2023-06-06T08:54:45Z

libpod/runtime.go

+		locksArr, ok := locksInUse[lockNum]
+		if ok {
+			locksInUse[lockNum] = append(locksArr, ctrString)
+		} else {
+			locksInUse[lockNum] = []string{ctrString}
+		}


this can really just be simplified to locksInUse[lockNum] = append(locksInUse[lockNum], ctrString), same for the other two branches for pods/volumes

test/e2e/info_test.go

The inspect format for `.LockNumber` needed to be documented. Signed-off-by: Matt Heon <mheon@redhat.com>

mheon · 2023-06-06T17:54:02Z

Restarted two tests that looked like flakes. Green otherwise.

@containers/podman-maintainers PTAL

Luap99

LGTM

rhatdan · 2023-06-07T12:18:22Z

/lgtm

mheon added 3 commits June 5, 2023 12:28

Add number of free locks to podman info

1013696

This is a nice quality-of-life change that should help to debug situations where someone runs out of locks (usually when a bunch of unused volumes accumulate). Signed-off-by: Matt Heon <mheon@redhat.com>

openshift-ci bot added the release-note label Jun 5, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2023

mheon force-pushed the lock_debugging branch 2 times, most recently from 451a962 to 40eed7c Compare June 5, 2023 20:52

mheon force-pushed the lock_debugging branch from 40eed7c to 4fda793 Compare June 5, 2023 23:34

Luap99 reviewed Jun 6, 2023

View reviewed changes

Address review feedback and add manpage notes

944673c

The inspect format for `.LockNumber` needed to be documented. Signed-off-by: Matt Heon <mheon@redhat.com>

mheon force-pushed the lock_debugging branch from 987bd66 to 944673c Compare June 6, 2023 15:05

Luap99 reviewed Jun 7, 2023

View reviewed changes

edsantiago mentioned this pull request Jun 7, 2023

unlinkat/EBUSY/hosed is back (Jan 2023) #17216

Closed

openshift-ci bot assigned rhatdan Jun 7, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 7, 2023

openshift-merge-robot merged commit 76f4571 into containers:main Jun 7, 2023
87 checks passed

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 6, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for lock debugging #18796

Add support for lock debugging #18796

mheon commented Jun 5, 2023

openshift-ci bot commented Jun 5, 2023

mheon commented Jun 5, 2023

robbmanes commented Jun 5, 2023

Luap99 Jun 6, 2023

Luap99 Jun 6, 2023

Luap99 Jun 6, 2023

Luap99 Jun 6, 2023

mheon commented Jun 6, 2023

Luap99 left a comment

rhatdan commented Jun 7, 2023

Add support for lock debugging #18796

Add support for lock debugging #18796

Conversation

mheon commented Jun 5, 2023

openshift-ci bot commented Jun 5, 2023

mheon commented Jun 5, 2023

robbmanes commented Jun 5, 2023

Luap99 Jun 6, 2023

Choose a reason for hiding this comment

Luap99 Jun 6, 2023

Choose a reason for hiding this comment

Luap99 Jun 6, 2023

Choose a reason for hiding this comment

Luap99 Jun 6, 2023

Choose a reason for hiding this comment

mheon commented Jun 6, 2023

Luap99 left a comment

Choose a reason for hiding this comment

rhatdan commented Jun 7, 2023