tests/basic: useradd: lock /etc/group.lock already used by PID 4 #1250

lucab · 2022-07-11T15:35:02Z

There seems to be a build flake which manifests itself as a test failure which looks like this:

=== RUN   basic
--- FAIL: basic (14.11s)
        harness.go:1452: mach.Start() failed: machine 81ddc678-5017-4641-97d9-5b3192d9810e entered emergency.target in initramfs

and the console showing:

[    7.818420] ignition[948]: CRITICAL : files: ensureUsers: op(1): [failed]   creating or modifying user "core":
exit status 10:
Cmd: "useradd" "--root" "/sysroot" "--create-home" "--password" "*" "--comment" "CoreOS Admin" "--groups" "adm,sudo,systemd-journal,wheel" "core"
Stdout: ""
Stderr: "useradd: lock /etc/group.lock already used by PID 4\nuseradd: cannot lock /etc/group; try again later.\n"

The actual failure though seems to be happening at compose time, where something is going wrong with the groupadd calls in the scriptlets:

[2022-07-11T13:16:29.346Z] Running pre scripts...20 done
[2022-07-11T13:17:15.931Z] clevis.prein: groupadd: /etc/group.4: lock file already used
[2022-07-11T13:17:15.931Z] clevis.prein: groupadd: cannot lock /etc/group; try again later.
[2022-07-11T13:17:15.931Z] clevis.prein: useradd: group 'clevis' does not exist
[2022-07-11T13:17:15.931Z] clevis.prein: groups: 'clevis': no such user
[2022-07-11T13:17:15.931Z] moby-engine.prein: groupadd: lock /etc/group.lock already used by PID 4
[2022-07-11T13:17:15.931Z] moby-engine.prein: groupadd: cannot lock /etc/group; try again later.

The text was updated successfully, but these errors were encountered:

dustymabe · 2022-07-28T14:50:05Z

hit this over in CI for coreos/coreos-assembler#3010.

[    8.379743] ignition[862]: CRITICAL : files: ensureUsers: op(1): [failed]   creating or modifying user "core": exit status 10:
Cmd: "useradd" "--root" "/sysroot" "--create-home" "--password" "*" "--comment" "CoreOS Admin" "--groups" "adm,sudo,systemd-journal,wheel" "core"
Stdout: ""
Stderr: "useradd: lock /etc/group.lock already used by PID 4\nuseradd: cannot lock /etc/group; try again later.\n"

lucab · 2022-07-28T15:02:50Z

This seems to happen more commonly when we change the manifest/group content, last occurrence was coreos/fedora-coreos-config#1866.

But I've been working on tweaking/changing groups definition for a couple of weeks, and never observed this locally.

lucab · 2022-08-04T13:21:43Z

Some more notes:

this seems to happen more frequently (or only?) on s390x
we have observed this failing in ext.config.extensions.module too, with:

Aug  4 01:40:51.094420 rpm-ostree(nginx-filesystem.prein)[1945]: groupadd: /etc/group.4: lock file already used
Aug  4 01:40:51.098809 rpm-ostree(nginx-filesystem.prein)[1945]: groupadd: cannot lock /etc/group; try again later.
Aug  4 01:40:51.130761 rpm-ostree(nginx-filesystem.prein)[1949]: useradd: group 'nginx' does not exist

it has now been observed as a flake, i.e. it passed in the automatic rerun in the same job
a change in manifests/passwd may have triggered it as well, even if we didn't touch group entries (and no errors at compose time)

dustymabe · 2022-08-09T12:55:50Z

Seen in https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/blue/organizations/jenkins/bump-lockfile/detail/bump-lockfile/6/pipeline/278

dustymabe · 2022-08-11T00:51:18Z

Seen in https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/blue/organizations/jenkins/bump-lockfile/detail/bump-lockfile/9/pipeline/287

travier · 2022-08-11T10:52:19Z

Should we (temporarily) workaround that in a post-process script that makes sure that this lock file does not exists in the final image?

dustymabe · 2022-08-11T12:14:53Z

That won't fix it. The failures happen intermittently at various times. If the lock file got left in the image that was created then the failures would consistently happen during Ignition when the core user is getting created.

It appears the file can get left behind by the compose process or when an rpm-ostree transaction is taking place (like overlaying rpms).

travier · 2022-08-11T13:51:14Z

OK. Maybe rpm-ostree could make sure that this file does not exists after each step in a compose?

travier · 2022-08-11T13:52:08Z

Luca"s working on useradd/groupadd wrappers so maybe that could be part of that.

lucab · 2022-08-11T13:56:32Z

Luca"s working on useradd/groupadd wrappers so maybe that could be part of that.

Not really, the first PR was coreos/rpm-ostree#3778 and it merged on July 27th.
That's two weeks later than this ticket.

jlebon · 2022-08-11T15:18:53Z

This could also be from a kernel or FUSE regression (via rofiles-fuse).

travier · 2022-08-11T17:26:09Z

Not really, the first PR was coreos/rpm-ostree#3778 and it merged on July 27th. That's two weeks later than this ticket.

Sorry, I wanted to mean that the work that you've done around that might help us come up with a workaround for this one.

justinkinney · 2022-08-11T17:45:56Z

Peanut gallery here: we've seen this in a non-Fedora distribution and it became so problematic that we added a 3-count retry to the useradd call in Ignition as a temporary mitigation. We tried various efforts to determine what could be causing this including setting up auditd to watch file access in the initramfs, but we never found a root cause.

Without our local workaround, it occurs nearly 10% the time when launching instances. With the workaround, we've never seen it crop up again.

cgwalters · 2022-08-15T14:02:51Z

Thanks for commenting! I think what's interesting here is this is happening both at build time (rpm-ostree compose tree) and inside the initramfs on instances; this strongly seems to point to a bug/regression in shadow-utils.

lucab · 2022-08-25T15:35:47Z

I have access to a s390x node running FCOS (testing - 36.20220806.2.0 with rpm-ostree-2022.12-2).

I've been trying to observe this by running a loop of rpm-ostree install unbound-anchor && rpm-ostree install nginx && rpm-ostree cleanup -p (two packages adding users/groups not present in the base OS). After letting it run for ~6 hours, I haven't actually observed any kind of groupadd/useradd related failure.

I did notice an FD leakage, now tracked at coreos/rpm-ostree#3962, but I don't think it is in any way related to this bug.

I'm now having a look at a standalone groupadd behavior and its source-code, to see if I can maybe get a good guess at what's happening.

lucab · 2022-08-26T16:19:46Z

I'm still not sure about the exact failure mode, but I'm leaving here some in-progress notes for my future self. I do suspect this is related to hardlinks, but couldn't yet pinpoint for sure.

Looking at the first visible error log (/etc/group.4: lock file already used), it means that there was a successful call to link(file, lock), but then the subsequent check_link_count (file) logic failed:

	if (link (file, lock) == 0) {
		retval = check_link_count (file);
		if ((0==retval) && log) {
			(void) fprintf (shadow_logfd,
			                "%s: %s: lock file already used\n",
			                shadow_progname, file);
		}
		unlink (file);
		return retval;
	}

static int check_link_count (const char *file)
{
	struct stat sb;
	if (stat (file, &sb) != 0) {
		return 0;
	}
	if (sb.st_nlink != 2) {
		return 0;
	}
	return 1;
}

So it seems that either:

the link() call succeeded, but the subsequent stat() call failed
the link() call succeeded, but the link count is not the expected value 2

Overall this logic is a bit sketchy in several regards, and it doesn't use FDs for its operations.
I'm not exactly sure why it double-checks the link count after linking, and maybe it would benefit from a fsync().

Unless we manage to reproduce it somehow, I think the best course of action for the moment is to augment the above check_link_count logic to log the specific check which failed.

dustymabe · 2022-08-31T01:58:22Z

Hit the #1265 version of this in bump-lockfile#44.

mark-au · 2022-09-08T02:12:46Z

I'm pretty reliably hitting this on a non-fedora platform but the .ign is too large to easily sanitize, I'll do it if I don't make progress but for the time being I'll share some colloquial observations.

The group to be created is defined in the ign passwd.groups, I see the original issue was from an RPM and @lucab was testing that approach. There are also passwd.users entries, one uses the group created in passwd.groups
I saw it once or twice in ~40 provisions when disks were non-mirrored. After switching to md raid1 and mounting by /dev/md/ instead of /dev/disk/by-partlabel I'm seeing 60% failure rate. ignition-disks finishes without error and while (as expected) resync isn't complete when ignition-files starts the filesystems are mounted read-write.
Like @justinkinney I have a loop to retry groupadd
I've been trying to use @lucab's shadow branch to gather more info but I'm not seeing anything other than the expected:

Sep 07 09:10:44 localhost systemd[1]: Starting Ignition (files)...
Sep 07 09:10:44 localhost ignition[2537]: INFO     : Ignition 2.14.0
Sep 07 09:10:44 localhost ignition[2537]: INFO     : Stage: files
Sep 07 09:10:44 localhost ignition[2537]: INFO     : no config dir at "/usr/lib/ignition/base.d"
Sep 07 09:10:44 localhost ignition[2537]: INFO     : no config dir at "/usr/lib/ignition/base.platform.d/file"
Sep 07 09:10:44 localhost ignition[2537]: INFO     : files: ensureGroups: op(1): [started]  adding group "examplegroupname"
Sep 07 09:10:44 localhost ignition[2537]: DEBUG    : files: ensureGroups: op(1): executing: "groupadd" "--root" "/sysroot" "--gid" "25" "--password" "*" "examplegroupname"
Sep 07 09:10:44 localhost ignition[2537]: CRITICAL : files: ensureGroups: op(1): [failed]   adding group "examplegroupname": exit status 10: Cmd: "groupadd" "--root" "/sysroot" "--gid" "25" "--password" "*" "examplegroupname" Stdout: "" Stderr: "groupadd: cannot lock /etc/group; try again later.\n"
Sep 07 09:10:48 localhost ignition[2537]: INFO     : files: ensureGroups: op(2): [started]  adding group "examplegroupname"
Sep 07 09:10:48 localhost ignition[2537]: DEBUG    : files: ensureGroups: op(2): executing: "groupadd" "--root" "/sysroot" "--gid" "25" "--password" "*" "examplegroupname"
Sep 07 09:10:48 localhost ignition[2537]: CRITICAL : files: ensureGroups: op(2): [failed]   adding group "examplegroupname": exit status 10: Cmd: "groupadd" "--root" "/sysroot" "--gid" "25" "--password" "*" "examplegroupname" Stdout: "" Stderr: "groupadd: cannot lock /etc/group; try again later.\n"
Sep 07 09:10:58 localhost ignition[2537]: INFO     : files: ensureGroups: op(3): [started]  adding group "examplegroupname"
Sep 07 09:10:58 localhost ignition[2537]: DEBUG    : files: ensureGroups: op(3): executing: "groupadd" "--root" "/sysroot" "--gid" "25" "--password" "*" "examplegroupname"
Sep 07 09:10:58 localhost ignition[2537]: CRITICAL : files: ensureGroups: op(3): [failed]   adding group "examplegroupname": exit status 10: Cmd: "groupadd" "--root" "/sysroot" "--gid" "25" "--password" "*" "examplegroupname" Stdout: "" Stderr: "groupadd: cannot lock /etc/group; try again later.\n"
Sep 07 09:10:58 localhost ignition[2037]: files failed

I also threw in another printf but that's not hitting either

diff --git a/lib/commonio.c b/lib/commonio.c
index f7fb1597..26390d3b 100644
--- a/lib/commonio.c
+++ b/lib/commonio.c
@@ -450,6 +450,7 @@ int commonio_lock (struct commonio_db *db)
 #define LOCK_SLEEP 1
 #endif
        for (i = 0; i < LOCK_TRIES; i++) {
+        (void) fprintf (shadow_logfd, "%s: try %i\n", shadow_progname, i);
                if (i > 0) {
                        sleep (LOCK_SLEEP);     /* delay between retries */
                }

Hopefully this helps create a testcase, I'll update thread if I find anything interesting

mark-au · 2022-09-08T08:26:04Z

Added more debug and the failure is https://github.com/shadow-maint/shadow/blob/master/lib/commonio.c#L423

If I insert a printf before the lckpwdf comparison and line432 I only see the before

mark-au · 2022-09-08T08:38:25Z

Forgive the simple names but detail how it reaches

diff --git a/lib/commonio.c b/lib/commonio.c
index f7fb1597..53224934 100644
--- a/lib/commonio.c
+++ b/lib/commonio.c
@@ -406,20 +406,25 @@ cleanup_ENOMEM:
 int commonio_lock (struct commonio_db *db)
 {
 	int i;
-
+		(void) fprintf (shadow_logfd, "%s: entered commonio_lock\n",shadow_progname);
 #ifdef HAVE_LCKPWDF
+		(void) fprintf (shadow_logfd, "%s: HAVE_LCKPWDF \n",shadow_progname);
+
 	/*
 	 * Only if the system libc has a real lckpwdf() - the one from
 	 * lockpw.c calls us and would cause infinite recursion!
 	 * It is also not used with the prefix option.
 	 */
 	if (!db->setname) {
+			(void) fprintf (shadow_logfd, "%s: GOT_INTO\n",shadow_progname);
 		/*
 		 * Call lckpwdf() on the first lock.
 		 * If it succeeds, call *_lock() only once
 		 * (no retries, it should always succeed).
 		 */
 		if (0 == lock_count) {
+		   	(void) fprintf (shadow_logfd, "%s: COUNTZERO \n",shadow_progname);
+
 			if (lckpwdf () == -1) {
 				if (geteuid () != 0) {
 					(void) fprintf (shadow_logfd,
@@ -429,15 +434,21 @@ int commonio_lock (struct commonio_db *db)
 				return 0;	/* failure */
 			}
 		}
+		   	(void) fprintf (shadow_logfd, "%s: COUNTNOTZERO \n",shadow_progname);

 		if (commonio_lock_nowait (db, true) != 0) {
 			return 1;	/* success */
 		}
+		   	(void) fprintf (shadow_logfd, "%s: ALMOST_OUT_FUNC \n",shadow_progname);

 		ulckpwdf ();
 		return 0;		/* failure */
 	}
+			(void) fprintf (shadow_logfd, "%s: MISSED IT \n",shadow_progname);
+
 #endif				/* !HAVE_LCKPWDF */
+		(void) fprintf (shadow_logfd, "%s: PASSED HAVE_LCKPWDF \n",shadow_progname);


Sep 08 08:04:09 localhost ignition[2067]: CRITICAL : files: ensureGroups: op(1): [failed]   adding group "examplegroupname": exit status 10: Cmd: "groupadd" "--root" "/sysroot" "--gid" "35" "--password" "*" "examplegroupname" Stdout: "groupadd: Entering open_files.\n" Stderr: "groupadd: entered commonio_lock\ngroupadd: HAVE_LCKPWDF \ngroupadd: GOT_INTO \ngroupadd: COUNTZERO \ngroupadd: cannot lock /etc/group; try again later.\n"

lucab · 2022-09-08T08:59:28Z

@mark-au it sounds like you may be hitting a different problem than what I was chasing.
It looks like in your case lckpwdf() is somehow failing; it may be good to log the errno reason when that happens to understand what is going on.

Also, the lckpwdf-based flow does not retry on failures, and I think neither does glibc internally.
Adding some retrying logic there will likely fix your issues.

LakshmiRavichandran1 · 2022-09-08T12:08:19Z

[rawhide][s390x] run-202 hit the same failure as in issue 1265

fyi, ext.config.extensions.package failed

Sep  7 13:22:10.470338 rpm-ostree(unbound-anchor.prein)[2070]: groupadd.rpmostreesave: /etc/group.4: lock file already used
Sep  7 13:22:10.476582 rpm-ostree(unbound-anchor.prein)[2070]: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
Sep  7 13:22:10.545888 rpm-ostree(unbound-anchor.prein)[2079]: useradd.rpmostreesave: group 'unbound' does not exist

mark-au · 2022-09-09T01:35:44Z

I run the groupadd in a 3x loop with a pause of 20 seconds in between and no improvement.
Added more debug around
https://github.com/shadow-maint/shadow/blob/master/lib/commonio.c#L423-L429

lckpwdf() is -1
geteuid is 0
Which means I fall into line 429's failure.
Looking into errno now

mark-au · 2022-09-09T04:40:55Z

errno is curious: Read-only file system

lucab · 2022-09-09T08:23:35Z

@mark-au at this point I believe you are really chasing a different bug, possibly the same that @justinkinney saw.
I'd recommend opening a new ticket dedicated to that.

For that new ticket, it may be interesting to distill a minimal Butane or Ignition config with just the disk setup and the user/group manipulation. If we can observe the behavior in a fresh VM, it should become easier to track down.
Also, if the -EROFS is a persistent failure, then it may be worth checking the console logs and the journal in case there is a filesystem / memory corruption that the kernel is encountering.

lucab · 2022-09-15T15:37:49Z

shadow-maint/shadow#562 merged, once it lands in Fedora it should shed some more light about what's going on.
Upstream wasn't really convinced by my guess for the reason of the failure, so no real logic change for now.

dustymabe · 2022-09-17T12:31:41Z

Seeing a pretty weird group error in bump-lockfile#90 on testing-devel:

[2022-09-17T01:50:45.381Z] error: While applying overrides for pkg libutempter: Could not find group 'utempter' in group file
[2022-09-17T01:50:45.382Z] libutempter.prein: groupadd.rpmostreesave: /etc/gshadow.3: lock file already used
[2022-09-17T01:50:45.382Z] libutempter.prein: groupadd.rpmostreesave: cannot lock /etc/gshadow; try again later.
[2022-09-17T01:50:45.382Z] libutempter.prein: groupadd.rpmostreesave: /etc/group.4: lock file already used
[2022-09-17T01:50:45.382Z] libutempter.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] dbus-broker.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-17T01:50:45.382Z] dbus-broker.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] openssh-server.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-17T01:50:45.382Z] openssh-server.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] dnsmasq.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-17T01:50:45.382Z] dnsmasq.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] dnsmasq.prein: useradd.rpmostreesave: group 'dnsmasq' does not exist
[2022-09-17T01:50:45.382Z] samba-common.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-17T01:50:45.382Z] samba-common.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] rpcbind.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-17T01:50:45.382Z] rpcbind.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] tpm2-tss.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-17T01:50:45.382Z] tpm2-tss.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] tpm2-tss.prein: useradd.rpmostreesave: group 'tss' does not exist
[2022-09-17T01:50:45.382Z] clevis.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-17T01:50:45.382Z] clevis.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] clevis.prein: useradd.rpmostreesave: group 'clevis' does not exist
[2022-09-17T01:50:45.382Z] moby-engine.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-17T01:50:45.382Z] moby-engine.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] error: failed to execute cmd-build: exit status 1

dustymabe · 2022-09-27T14:10:48Z

We also saw this in build-arch#395. Just noting here that the logs can be misleading. When I originally looked at the logs here I was focusing on:

Sep 26 22:20:53.556926 rpm-ostree[2636]: /usr/lib/tmpfiles.d/pkg-nginx-core.conf:1: Failed to resolve user 'nginx': No such process
Sep 26 22:20:53.558711 rpm-ostree[2636]: /usr/lib/tmpfiles.d/pkg-nginx-core.conf:2: Failed to resolve user 'nginx': No such process
Sep 26 22:20:53.559437 rpm-ostree[2636]: /usr/lib/tmpfiles.d/pkg-setup.conf:1: Duplicate line for path "/run/motd.d", ignoring.
Sep 26 22:20:53.560109 rpm-ostree[2636]: /usr/lib/tmpfiles.d/setup.conf:2: Duplicate line for path "/run/motd.d", ignoring.
Sep 26 22:20:53.561272 rpm-ostree[2636]: /usr/lib/tmpfiles.d/tmp.conf:12: Duplicate line for path "/var/tmp", ignoring.
Sep 26 22:20:53.561776 rpm-ostree[2636]: /usr/lib/tmpfiles.d/var.conf:14: Duplicate line for path "/var/log", ignoring.
Sep 26 22:20:53.561882 rpm-ostree[2636]: /usr/lib/tmpfiles.d/var.conf:19: Duplicate line for path "/var/cache", ignoring.
Sep 26 22:20:53.561916 rpm-ostree[2636]: /usr/lib/tmpfiles.d/var.conf:21: Duplicate line for path "/var/lib", ignoring.
Sep 26 22:20:53.561944 rpm-ostree[2636]: /usr/lib/tmpfiles.d/var.conf:23: Duplicate line for path "/var/spool", ignoring.
Sep 26 22:20:53.615267 systemd[1]: rpm-ostree-tmpfiles.service: Main process exited, code=exited, status=65/DATAERR
Sep 26 22:20:53.615391 systemd[1]: rpm-ostree-tmpfiles.service: Failed with result 'exit-code'.

But that is just a symptom of the original problem in the log file:

Sep 26 22:20:19.052413 rpm-ostree(nginx-filesystem.prein)[2128]: groupadd.rpmostreesave: /etc/group.4: lock file already used
Sep 26 22:20:19.056677 rpm-ostree(nginx-filesystem.prein)[2128]: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
Sep 26 22:20:19.122757 rpm-ostree(nginx-filesystem.prein)[2132]: useradd.rpmostreesave: group 'nginx' does not exist

dustymabe · 2022-09-27T16:04:31Z

saw this in build-arch#398:

[2022-09-27T13:13:39.401Z] Running pre scripts...19 done
[2022-09-27T13:13:51.538Z] tpm2-tss.prein: useradd.rpmostreesave: /etc/group.12: lock file already used
[2022-09-27T13:13:51.538Z] tpm2-tss.prein: useradd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-27T13:13:51.538Z] openssh-server.prein: groupadd.rpmostreesave: /etc/group.4: lock file already used
[2022-09-27T13:13:51.538Z] openssh-server.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-27T13:13:51.538Z] clevis.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-27T13:13:51.538Z] clevis.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-27T13:13:51.538Z] clevis.prein: useradd.rpmostreesave: group 'clevis' does not exist
[2022-09-27T13:13:51.538Z] clevis.prein: groups: 'clevis': no such user
[2022-09-27T13:13:51.538Z] dnsmasq.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-27T13:13:51.538Z] dnsmasq.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-27T13:13:51.538Z] dnsmasq.prein: useradd.rpmostreesave: group 'dnsmasq' does not exist
[2022-09-27T13:13:51.538Z] samba-common.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-27T13:13:51.538Z] samba-common.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-27T13:13:51.538Z] rpcbind.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-27T13:13:51.538Z] rpcbind.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-27T13:13:51.538Z] moby-engine.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-27T13:13:51.538Z] moby-engine.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.

which leads to kola basic tests failing in Ignition with:

[    7.354272] ignition[920]: CRITICAL : Ignition failed: failed to create users/groups:
                 failed to configure users: failed to create user "core": exit status 10:
                 Cmd: "useradd" "--root" "/sysroot" "--create-home" "--password" "*" "--comment" "CoreOS Admin" "--groups" "adm,sudo,systemd-journal,wheel" "core"
                 Stdout: "" Stderr: "useradd: lock /etc/group.lock already used by PID 4\nuseradd: cannot lock /etc/group; try again later.\n"

dustymabe · 2022-09-29T03:31:08Z

I believe we are seeing the unbound-anchor version of this in build-arch#404.

dustymabe · 2022-09-29T03:32:45Z

shadow-maint/shadow#562 merged, once it lands in Fedora it should shed some more light about what's going on. Upstream wasn't really convinced by my guess for the reason of the failure, so no real logic change for now.

Could we backport this to rawhide so we can see if we get any more information there?

dustymabe · 2022-10-04T13:35:28Z

I believe we are seeing the unbound-anchor version of this in build-arch#444.

dustymabe · 2022-11-14T14:40:12Z

Saw a similar unbound-anchor failure in coreos-ci (x86_64) today for a PR: link

LorbusChris · 2022-11-24T17:37:25Z

Seen today in the SCOS Tekton build pipeline:

Running %prein for unbound-libs: bwrap(/bin/sh): Child process killed by signal 6
unbound-libs.prein: groupadd.rpmostreesave: /etc/group.4: lock file already used
unbound-libs.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
unbound-libs.prein: useradd.rpmostreesave: group 'unbound' does not exist
error: failed to execute cmd-build: exit status 1

tidux · 2023-02-23T01:48:57Z

I'm seeing this in FC37 Kinoite today for brlapi.

lua86 · 2023-02-27T20:36:25Z

I'm seeing this in FC37 Kinoite today for brlapi.

Me too, but we should probably file a bug on kinoite rather than replying to one on coreos. For the appropriate bug tracker: https://pagure.io/fedora-kde/SIG/issues?tags=kinoite

LorbusChris · 2023-02-28T19:31:46Z

Hitting this again today in multiple runs of the SCOS pipeline:

error: Running %prein for unbound-libs: bwrap(/bin/sh): Child process killed by signal 6
unbound-libs.prein: groupadd.rpmostreesave: /etc/group.4: lock file already used
unbound-libs.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
unbound-libs.prein: useradd.rpmostreesave: group 'unbound' does not exist
error: failed to execute cmd-build: exit status 1

LorbusChris · 2023-02-28T20:09:04Z

The non-fatal third time showed:

glusterfs.prein: groupadd.rpmostreesave: /etc/group.4: lock file already used
glusterfs.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
glusterfs.prein: useradd.rpmostreesave: group 'gluster' does not exist
clevis.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
clevis.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
clevis.prein: useradd.rpmostreesave: group 'clevis' does not exist
clevis.prein: groups: 'clevis': no such user

dustymabe · 2023-03-01T04:05:41Z

shadow-maint/shadow#562 merged, once it lands in Fedora it should shed some more light about what's going on.

This is at least in https://github.com/shadow-maint/shadow/releases/tag/4.13 which is in F38/rawhide now.

openvswitch %pre scriptlet adds the openvswitch user to the hugetblfs group. Since %pre runs without `set -e` by default the failures are ignored resulting in worker nodes that do not come online during a cluster install. This seems to be happening on only RHCOS based on RHEL 9.2 on ppc64le. These errors are showing up during the rpm-ostree compose: 14:30:05 openvswitch3.1.prein: usermod.rpmostreesave: /etc/passwd.6: lock file already used 14:30:05 openvswitch3.1.prein: usermod.rpmostreesave: cannot lock /etc/passwd; try again later. `: xref: coreos/fedora-coreos-tracker#1250

LorbusChris · 2023-09-28T18:41:22Z

I'm again seeing this on builds on a system where the disk IO is very slow, and this has so far happened in all runs.

If anyone would like access to the system to reproduce this, please let me know.

lucab added kind/bug test-flake labels Jul 11, 2022

dustymabe mentioned this issue Jul 27, 2022

kola: s390x: ext.config.extensions.package failing in %prein for unbound-anchor #1265

Closed

lucab mentioned this issue Aug 3, 2022

manifests/passwd: align 'nologin' shell path with systemd defaults coreos/fedora-coreos-config#1886

Merged

travier mentioned this issue Aug 21, 2022

Failure to call useradd/groupadd in akmods prein script (NVIDIA) fedora-silverblue/issue-tracker#331

Open

lucab mentioned this issue Aug 29, 2022

lib/commonio: make lock failures more detailed shadow-maint/shadow#562

Merged

cgwalters mentioned this issue Sep 13, 2022

Remove tools/, install schematyper on demand coreos/coreos-assembler#3093

Merged

cgwalters mentioned this issue Nov 29, 2022

qemu: Add support for full emulation coreos/coreos-assembler#3201

Merged

Shikachuu mentioned this issue Apr 12, 2023

Raspberry aarch64 install becomes unresponsive on first boot #1466

Closed

jlebon mentioned this issue May 8, 2023

kola: add openvswitch in hugetlbfs group test openshift/os#1275

Merged

cgwalters mentioned this issue Aug 2, 2023

Create tooling to take a FCOS-derived container and generate bootable images from it #1151

Open

cgwalters mentioned this issue Aug 30, 2023

build(deps): bump serde_json from 1.0.104 to 1.0.105 coreos/rpm-ostree#4543

Merged

tests/basic: useradd: lock /etc/group.lock already used by PID 4 #1250

tests/basic: useradd: lock /etc/group.lock already used by PID 4 #1250

Comments

lucab commented Jul 11, 2022

dustymabe commented Jul 28, 2022

lucab commented Jul 28, 2022

lucab commented Aug 4, 2022 • edited Loading

dustymabe commented Aug 9, 2022

dustymabe commented Aug 11, 2022

travier commented Aug 11, 2022

dustymabe commented Aug 11, 2022

travier commented Aug 11, 2022

travier commented Aug 11, 2022

lucab commented Aug 11, 2022

jlebon commented Aug 11, 2022

travier commented Aug 11, 2022

justinkinney commented Aug 11, 2022 • edited Loading

cgwalters commented Aug 15, 2022

lucab commented Aug 25, 2022

lucab commented Aug 26, 2022 • edited Loading

dustymabe commented Aug 31, 2022

mark-au commented Sep 8, 2022 • edited Loading

mark-au commented Sep 8, 2022

mark-au commented Sep 8, 2022

lucab commented Sep 8, 2022

LakshmiRavichandran1 commented Sep 8, 2022

mark-au commented Sep 9, 2022

mark-au commented Sep 9, 2022

lucab commented Sep 9, 2022

lucab commented Sep 15, 2022

dustymabe commented Sep 17, 2022

dustymabe commented Sep 27, 2022

dustymabe commented Sep 27, 2022

dustymabe commented Sep 29, 2022

dustymabe commented Sep 29, 2022

dustymabe commented Oct 4, 2022

dustymabe commented Nov 14, 2022 • edited Loading

LorbusChris commented Nov 24, 2022 • edited Loading

tidux commented Feb 23, 2023

lua86 commented Feb 27, 2023 • edited Loading

LorbusChris commented Feb 28, 2023 • edited Loading

LorbusChris commented Feb 28, 2023

dustymabe commented Mar 1, 2023

LorbusChris commented Sep 28, 2023

lucab commented Aug 4, 2022 •

edited

Loading

justinkinney commented Aug 11, 2022 •

edited

Loading

lucab commented Aug 26, 2022 •

edited

Loading

mark-au commented Sep 8, 2022 •

edited

Loading

dustymabe commented Nov 14, 2022 •

edited

Loading

LorbusChris commented Nov 24, 2022 •

edited

Loading

lua86 commented Feb 27, 2023 •

edited

Loading

LorbusChris commented Feb 28, 2023 •

edited

Loading