Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests/basic: useradd: lock /etc/group.lock already used by PID 4 #1250

Open
lucab opened this issue Jul 11, 2022 · 40 comments
Open

tests/basic: useradd: lock /etc/group.lock already used by PID 4 #1250

lucab opened this issue Jul 11, 2022 · 40 comments

Comments

@lucab
Copy link
Contributor

lucab commented Jul 11, 2022

There seems to be a build flake which manifests itself as a test failure which looks like this:

=== RUN   basic
--- FAIL: basic (14.11s)
        harness.go:1452: mach.Start() failed: machine 81ddc678-5017-4641-97d9-5b3192d9810e entered emergency.target in initramfs

and the console showing:

[    7.818420] ignition[948]: CRITICAL : files: ensureUsers: op(1): [failed]   creating or modifying user "core":
exit status 10:
Cmd: "useradd" "--root" "/sysroot" "--create-home" "--password" "*" "--comment" "CoreOS Admin" "--groups" "adm,sudo,systemd-journal,wheel" "core"
Stdout: ""
Stderr: "useradd: lock /etc/group.lock already used by PID 4\nuseradd: cannot lock /etc/group; try again later.\n"

The actual failure though seems to be happening at compose time, where something is going wrong with the groupadd calls in the scriptlets:

[2022-07-11T13:16:29.346Z] Running pre scripts...20 done
[2022-07-11T13:17:15.931Z] clevis.prein: groupadd: /etc/group.4: lock file already used
[2022-07-11T13:17:15.931Z] clevis.prein: groupadd: cannot lock /etc/group; try again later.
[2022-07-11T13:17:15.931Z] clevis.prein: useradd: group 'clevis' does not exist
[2022-07-11T13:17:15.931Z] clevis.prein: groups: 'clevis': no such user
[2022-07-11T13:17:15.931Z] moby-engine.prein: groupadd: lock /etc/group.lock already used by PID 4
[2022-07-11T13:17:15.931Z] moby-engine.prein: groupadd: cannot lock /etc/group; try again later.
@dustymabe
Copy link
Member

hit this over in CI for coreos/coreos-assembler#3010.

[    8.379743] ignition[862]: CRITICAL : files: ensureUsers: op(1): [failed]   creating or modifying user "core": exit status 10:
Cmd: "useradd" "--root" "/sysroot" "--create-home" "--password" "*" "--comment" "CoreOS Admin" "--groups" "adm,sudo,systemd-journal,wheel" "core"
Stdout: ""
Stderr: "useradd: lock /etc/group.lock already used by PID 4\nuseradd: cannot lock /etc/group; try again later.\n"

@lucab
Copy link
Contributor Author

lucab commented Jul 28, 2022

This seems to happen more commonly when we change the manifest/group content, last occurrence was coreos/fedora-coreos-config#1866.

But I've been working on tweaking/changing groups definition for a couple of weeks, and never observed this locally.

@lucab
Copy link
Contributor Author

lucab commented Aug 4, 2022

Some more notes:

  • this seems to happen more frequently (or only?) on s390x
  • we have observed this failing in ext.config.extensions.module too, with:
Aug  4 01:40:51.094420 rpm-ostree(nginx-filesystem.prein)[1945]: groupadd: /etc/group.4: lock file already used
Aug  4 01:40:51.098809 rpm-ostree(nginx-filesystem.prein)[1945]: groupadd: cannot lock /etc/group; try again later.
Aug  4 01:40:51.130761 rpm-ostree(nginx-filesystem.prein)[1949]: useradd: group 'nginx' does not exist
  • it has now been observed as a flake, i.e. it passed in the automatic rerun in the same job
  • a change in manifests/passwd may have triggered it as well, even if we didn't touch group entries (and no errors at compose time)

@travier
Copy link
Member

travier commented Aug 11, 2022

Should we (temporarily) workaround that in a post-process script that makes sure that this lock file does not exists in the final image?

@dustymabe
Copy link
Member

That won't fix it. The failures happen intermittently at various times. If the lock file got left in the image that was created then the failures would consistently happen during Ignition when the core user is getting created.

It appears the file can get left behind by the compose process or when an rpm-ostree transaction is taking place (like overlaying rpms).

@travier
Copy link
Member

travier commented Aug 11, 2022

OK. Maybe rpm-ostree could make sure that this file does not exists after each step in a compose?

@travier
Copy link
Member

travier commented Aug 11, 2022

Luca"s working on useradd/groupadd wrappers so maybe that could be part of that.

@lucab
Copy link
Contributor Author

lucab commented Aug 11, 2022

Luca"s working on useradd/groupadd wrappers so maybe that could be part of that.

Not really, the first PR was coreos/rpm-ostree#3778 and it merged on July 27th.
That's two weeks later than this ticket.

@jlebon
Copy link
Member

jlebon commented Aug 11, 2022

This could also be from a kernel or FUSE regression (via rofiles-fuse).

@travier
Copy link
Member

travier commented Aug 11, 2022

Not really, the first PR was coreos/rpm-ostree#3778 and it merged on July 27th. That's two weeks later than this ticket.

Sorry, I wanted to mean that the work that you've done around that might help us come up with a workaround for this one.

@justinkinney
Copy link

justinkinney commented Aug 11, 2022

Peanut gallery here: we've seen this in a non-Fedora distribution and it became so problematic that we added a 3-count retry to the useradd call in Ignition as a temporary mitigation. We tried various efforts to determine what could be causing this including setting up auditd to watch file access in the initramfs, but we never found a root cause.

Without our local workaround, it occurs nearly 10% the time when launching instances. With the workaround, we've never seen it crop up again.

@cgwalters
Copy link
Member

Thanks for commenting! I think what's interesting here is this is happening both at build time (rpm-ostree compose tree) and inside the initramfs on instances; this strongly seems to point to a bug/regression in shadow-utils.

@lucab
Copy link
Contributor Author

lucab commented Aug 25, 2022

I have access to a s390x node running FCOS (testing - 36.20220806.2.0 with rpm-ostree-2022.12-2).

I've been trying to observe this by running a loop of rpm-ostree install unbound-anchor && rpm-ostree install nginx && rpm-ostree cleanup -p (two packages adding users/groups not present in the base OS). After letting it run for ~6 hours, I haven't actually observed any kind of groupadd/useradd related failure.

I did notice an FD leakage, now tracked at coreos/rpm-ostree#3962, but I don't think it is in any way related to this bug.

I'm now having a look at a standalone groupadd behavior and its source-code, to see if I can maybe get a good guess at what's happening.

@lucab
Copy link
Contributor Author

lucab commented Aug 26, 2022

I'm still not sure about the exact failure mode, but I'm leaving here some in-progress notes for my future self. I do suspect this is related to hardlinks, but couldn't yet pinpoint for sure.

Looking at the first visible error log (/etc/group.4: lock file already used), it means that there was a successful call to link(file, lock), but then the subsequent check_link_count (file) logic failed:

	if (link (file, lock) == 0) {
		retval = check_link_count (file);
		if ((0==retval) && log) {
			(void) fprintf (shadow_logfd,
			                "%s: %s: lock file already used\n",
			                shadow_progname, file);
		}
		unlink (file);
		return retval;
	}
static int check_link_count (const char *file)
{
	struct stat sb;
	if (stat (file, &sb) != 0) {
		return 0;
	}
	if (sb.st_nlink != 2) {
		return 0;
	}
	return 1;
}

So it seems that either:

  1. the link() call succeeded, but the subsequent stat() call failed
  2. the link() call succeeded, but the link count is not the expected value 2

Overall this logic is a bit sketchy in several regards, and it doesn't use FDs for its operations.
I'm not exactly sure why it double-checks the link count after linking, and maybe it would benefit from a fsync().

Unless we manage to reproduce it somehow, I think the best course of action for the moment is to augment the above check_link_count logic to log the specific check which failed.

@dustymabe
Copy link
Member

Hit the #1265 version of this in bump-lockfile#44.

@mark-au
Copy link

mark-au commented Sep 8, 2022

I'm pretty reliably hitting this on a non-fedora platform but the .ign is too large to easily sanitize, I'll do it if I don't make progress but for the time being I'll share some colloquial observations.

  • The group to be created is defined in the ign passwd.groups, I see the original issue was from an RPM and @lucab was testing that approach. There are also passwd.users entries, one uses the group created in passwd.groups
  • I saw it once or twice in ~40 provisions when disks were non-mirrored. After switching to md raid1 and mounting by /dev/md/ instead of /dev/disk/by-partlabel I'm seeing 60% failure rate. ignition-disks finishes without error and while (as expected) resync isn't complete when ignition-files starts the filesystems are mounted read-write.
  • Like @justinkinney I have a loop to retry groupadd
  • I've been trying to use @lucab's shadow branch to gather more info but I'm not seeing anything other than the expected:
Sep 07 09:10:44 localhost systemd[1]: Starting Ignition (files)...
Sep 07 09:10:44 localhost ignition[2537]: INFO     : Ignition 2.14.0
Sep 07 09:10:44 localhost ignition[2537]: INFO     : Stage: files
Sep 07 09:10:44 localhost ignition[2537]: INFO     : no config dir at "/usr/lib/ignition/base.d"
Sep 07 09:10:44 localhost ignition[2537]: INFO     : no config dir at "/usr/lib/ignition/base.platform.d/file"
Sep 07 09:10:44 localhost ignition[2537]: INFO     : files: ensureGroups: op(1): [started]  adding group "examplegroupname"
Sep 07 09:10:44 localhost ignition[2537]: DEBUG    : files: ensureGroups: op(1): executing: "groupadd" "--root" "/sysroot" "--gid" "25" "--password" "*" "examplegroupname"
Sep 07 09:10:44 localhost ignition[2537]: CRITICAL : files: ensureGroups: op(1): [failed]   adding group "examplegroupname": exit status 10: Cmd: "groupadd" "--root" "/sysroot" "--gid" "25" "--password" "*" "examplegroupname" Stdout: "" Stderr: "groupadd: cannot lock /etc/group; try again later.\n"
Sep 07 09:10:48 localhost ignition[2537]: INFO     : files: ensureGroups: op(2): [started]  adding group "examplegroupname"
Sep 07 09:10:48 localhost ignition[2537]: DEBUG    : files: ensureGroups: op(2): executing: "groupadd" "--root" "/sysroot" "--gid" "25" "--password" "*" "examplegroupname"
Sep 07 09:10:48 localhost ignition[2537]: CRITICAL : files: ensureGroups: op(2): [failed]   adding group "examplegroupname": exit status 10: Cmd: "groupadd" "--root" "/sysroot" "--gid" "25" "--password" "*" "examplegroupname" Stdout: "" Stderr: "groupadd: cannot lock /etc/group; try again later.\n"
Sep 07 09:10:58 localhost ignition[2537]: INFO     : files: ensureGroups: op(3): [started]  adding group "examplegroupname"
Sep 07 09:10:58 localhost ignition[2537]: DEBUG    : files: ensureGroups: op(3): executing: "groupadd" "--root" "/sysroot" "--gid" "25" "--password" "*" "examplegroupname"
Sep 07 09:10:58 localhost ignition[2537]: CRITICAL : files: ensureGroups: op(3): [failed]   adding group "examplegroupname": exit status 10: Cmd: "groupadd" "--root" "/sysroot" "--gid" "25" "--password" "*" "examplegroupname" Stdout: "" Stderr: "groupadd: cannot lock /etc/group; try again later.\n"
Sep 07 09:10:58 localhost ignition[2037]: files failed

I also threw in another printf but that's not hitting either

diff --git a/lib/commonio.c b/lib/commonio.c
index f7fb1597..26390d3b 100644
--- a/lib/commonio.c
+++ b/lib/commonio.c
@@ -450,6 +450,7 @@ int commonio_lock (struct commonio_db *db)
 #define LOCK_SLEEP 1
 #endif
        for (i = 0; i < LOCK_TRIES; i++) {
+        (void) fprintf (shadow_logfd, "%s: try %i\n", shadow_progname, i);
                if (i > 0) {
                        sleep (LOCK_SLEEP);     /* delay between retries */
                }

Hopefully this helps create a testcase, I'll update thread if I find anything interesting

@mark-au
Copy link

mark-au commented Sep 8, 2022

Added more debug and the failure is https://github.com/shadow-maint/shadow/blob/master/lib/commonio.c#L423

If I insert a printf before the lckpwdf comparison and line432 I only see the before

@mark-au
Copy link

mark-au commented Sep 8, 2022

Forgive the simple names but detail how it reaches

diff --git a/lib/commonio.c b/lib/commonio.c
index f7fb1597..53224934 100644
--- a/lib/commonio.c
+++ b/lib/commonio.c
@@ -406,20 +406,25 @@ cleanup_ENOMEM:
 int commonio_lock (struct commonio_db *db)
 {
 	int i;
-
+		(void) fprintf (shadow_logfd, "%s: entered commonio_lock\n",shadow_progname);
 #ifdef HAVE_LCKPWDF
+		(void) fprintf (shadow_logfd, "%s: HAVE_LCKPWDF \n",shadow_progname);
+
 	/*
 	 * Only if the system libc has a real lckpwdf() - the one from
 	 * lockpw.c calls us and would cause infinite recursion!
 	 * It is also not used with the prefix option.
 	 */
 	if (!db->setname) {
+			(void) fprintf (shadow_logfd, "%s: GOT_INTO\n",shadow_progname);
 		/*
 		 * Call lckpwdf() on the first lock.
 		 * If it succeeds, call *_lock() only once
 		 * (no retries, it should always succeed).
 		 */
 		if (0 == lock_count) {
+		   	(void) fprintf (shadow_logfd, "%s: COUNTZERO \n",shadow_progname);
+
 			if (lckpwdf () == -1) {
 				if (geteuid () != 0) {
 					(void) fprintf (shadow_logfd,
@@ -429,15 +434,21 @@ int commonio_lock (struct commonio_db *db)
 				return 0;	/* failure */
 			}
 		}
+		   	(void) fprintf (shadow_logfd, "%s: COUNTNOTZERO \n",shadow_progname);

 		if (commonio_lock_nowait (db, true) != 0) {
 			return 1;	/* success */
 		}
+		   	(void) fprintf (shadow_logfd, "%s: ALMOST_OUT_FUNC \n",shadow_progname);

 		ulckpwdf ();
 		return 0;		/* failure */
 	}
+			(void) fprintf (shadow_logfd, "%s: MISSED IT \n",shadow_progname);
+
 #endif				/* !HAVE_LCKPWDF */
+		(void) fprintf (shadow_logfd, "%s: PASSED HAVE_LCKPWDF \n",shadow_progname);


Sep 08 08:04:09 localhost ignition[2067]: CRITICAL : files: ensureGroups: op(1): [failed]   adding group "examplegroupname": exit status 10: Cmd: "groupadd" "--root" "/sysroot" "--gid" "35" "--password" "*" "examplegroupname" Stdout: "groupadd: Entering open_files.\n" Stderr: "groupadd: entered commonio_lock\ngroupadd: HAVE_LCKPWDF \ngroupadd: GOT_INTO \ngroupadd: COUNTZERO \ngroupadd: cannot lock /etc/group; try again later.\n"

@lucab
Copy link
Contributor Author

lucab commented Sep 8, 2022

@mark-au it sounds like you may be hitting a different problem than what I was chasing.
It looks like in your case lckpwdf() is somehow failing; it may be good to log the errno reason when that happens to understand what is going on.

Also, the lckpwdf-based flow does not retry on failures, and I think neither does glibc internally.
Adding some retrying logic there will likely fix your issues.

@LakshmiRavichandran1
Copy link

[rawhide][s390x] run-202 hit the same failure as in issue 1265

fyi, ext.config.extensions.package failed

Sep  7 13:22:10.470338 rpm-ostree(unbound-anchor.prein)[2070]: groupadd.rpmostreesave: /etc/group.4: lock file already used
Sep  7 13:22:10.476582 rpm-ostree(unbound-anchor.prein)[2070]: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
Sep  7 13:22:10.545888 rpm-ostree(unbound-anchor.prein)[2079]: useradd.rpmostreesave: group 'unbound' does not exist

@mark-au
Copy link

mark-au commented Sep 9, 2022

I run the groupadd in a 3x loop with a pause of 20 seconds in between and no improvement.
Added more debug around
https://github.com/shadow-maint/shadow/blob/master/lib/commonio.c#L423-L429

  • lckpwdf() is -1
  • geteuid is 0
    Which means I fall into line 429's failure.
    Looking into errno now

@mark-au
Copy link

mark-au commented Sep 9, 2022

errno is curious: Read-only file system

@lucab
Copy link
Contributor Author

lucab commented Sep 9, 2022

@mark-au at this point I believe you are really chasing a different bug, possibly the same that @justinkinney saw.
I'd recommend opening a new ticket dedicated to that.

For that new ticket, it may be interesting to distill a minimal Butane or Ignition config with just the disk setup and the user/group manipulation. If we can observe the behavior in a fresh VM, it should become easier to track down.
Also, if the -EROFS is a persistent failure, then it may be worth checking the console logs and the journal in case there is a filesystem / memory corruption that the kernel is encountering.

@lucab
Copy link
Contributor Author

lucab commented Sep 15, 2022

shadow-maint/shadow#562 merged, once it lands in Fedora it should shed some more light about what's going on.
Upstream wasn't really convinced by my guess for the reason of the failure, so no real logic change for now.

@dustymabe
Copy link
Member

Seeing a pretty weird group error in bump-lockfile#90 on testing-devel:

[2022-09-17T01:50:45.381Z] error: While applying overrides for pkg libutempter: Could not find group 'utempter' in group file
[2022-09-17T01:50:45.382Z] libutempter.prein: groupadd.rpmostreesave: /etc/gshadow.3: lock file already used
[2022-09-17T01:50:45.382Z] libutempter.prein: groupadd.rpmostreesave: cannot lock /etc/gshadow; try again later.
[2022-09-17T01:50:45.382Z] libutempter.prein: groupadd.rpmostreesave: /etc/group.4: lock file already used
[2022-09-17T01:50:45.382Z] libutempter.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] dbus-broker.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-17T01:50:45.382Z] dbus-broker.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] openssh-server.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-17T01:50:45.382Z] openssh-server.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] dnsmasq.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-17T01:50:45.382Z] dnsmasq.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] dnsmasq.prein: useradd.rpmostreesave: group 'dnsmasq' does not exist
[2022-09-17T01:50:45.382Z] samba-common.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-17T01:50:45.382Z] samba-common.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] rpcbind.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-17T01:50:45.382Z] rpcbind.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] tpm2-tss.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-17T01:50:45.382Z] tpm2-tss.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] tpm2-tss.prein: useradd.rpmostreesave: group 'tss' does not exist
[2022-09-17T01:50:45.382Z] clevis.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-17T01:50:45.382Z] clevis.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] clevis.prein: useradd.rpmostreesave: group 'clevis' does not exist
[2022-09-17T01:50:45.382Z] moby-engine.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-17T01:50:45.382Z] moby-engine.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-17T01:50:45.382Z] error: failed to execute cmd-build: exit status 1

@dustymabe
Copy link
Member

We also saw this in build-arch#395. Just noting here that the logs can be misleading. When I originally looked at the logs here I was focusing on:

Sep 26 22:20:53.556926 rpm-ostree[2636]: /usr/lib/tmpfiles.d/pkg-nginx-core.conf:1: Failed to resolve user 'nginx': No such process
Sep 26 22:20:53.558711 rpm-ostree[2636]: /usr/lib/tmpfiles.d/pkg-nginx-core.conf:2: Failed to resolve user 'nginx': No such process
Sep 26 22:20:53.559437 rpm-ostree[2636]: /usr/lib/tmpfiles.d/pkg-setup.conf:1: Duplicate line for path "/run/motd.d", ignoring.
Sep 26 22:20:53.560109 rpm-ostree[2636]: /usr/lib/tmpfiles.d/setup.conf:2: Duplicate line for path "/run/motd.d", ignoring.
Sep 26 22:20:53.561272 rpm-ostree[2636]: /usr/lib/tmpfiles.d/tmp.conf:12: Duplicate line for path "/var/tmp", ignoring.
Sep 26 22:20:53.561776 rpm-ostree[2636]: /usr/lib/tmpfiles.d/var.conf:14: Duplicate line for path "/var/log", ignoring.
Sep 26 22:20:53.561882 rpm-ostree[2636]: /usr/lib/tmpfiles.d/var.conf:19: Duplicate line for path "/var/cache", ignoring.
Sep 26 22:20:53.561916 rpm-ostree[2636]: /usr/lib/tmpfiles.d/var.conf:21: Duplicate line for path "/var/lib", ignoring.
Sep 26 22:20:53.561944 rpm-ostree[2636]: /usr/lib/tmpfiles.d/var.conf:23: Duplicate line for path "/var/spool", ignoring.
Sep 26 22:20:53.615267 systemd[1]: rpm-ostree-tmpfiles.service: Main process exited, code=exited, status=65/DATAERR
Sep 26 22:20:53.615391 systemd[1]: rpm-ostree-tmpfiles.service: Failed with result 'exit-code'.

But that is just a symptom of the original problem in the log file:

Sep 26 22:20:19.052413 rpm-ostree(nginx-filesystem.prein)[2128]: groupadd.rpmostreesave: /etc/group.4: lock file already used
Sep 26 22:20:19.056677 rpm-ostree(nginx-filesystem.prein)[2128]: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
Sep 26 22:20:19.122757 rpm-ostree(nginx-filesystem.prein)[2132]: useradd.rpmostreesave: group 'nginx' does not exist

@dustymabe
Copy link
Member

saw this in build-arch#398:

[2022-09-27T13:13:39.401Z] Running pre scripts...19 done
[2022-09-27T13:13:51.538Z] tpm2-tss.prein: useradd.rpmostreesave: /etc/group.12: lock file already used
[2022-09-27T13:13:51.538Z] tpm2-tss.prein: useradd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-27T13:13:51.538Z] openssh-server.prein: groupadd.rpmostreesave: /etc/group.4: lock file already used
[2022-09-27T13:13:51.538Z] openssh-server.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-27T13:13:51.538Z] clevis.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-27T13:13:51.538Z] clevis.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-27T13:13:51.538Z] clevis.prein: useradd.rpmostreesave: group 'clevis' does not exist
[2022-09-27T13:13:51.538Z] clevis.prein: groups: 'clevis': no such user
[2022-09-27T13:13:51.538Z] dnsmasq.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-27T13:13:51.538Z] dnsmasq.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-27T13:13:51.538Z] dnsmasq.prein: useradd.rpmostreesave: group 'dnsmasq' does not exist
[2022-09-27T13:13:51.538Z] samba-common.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-27T13:13:51.538Z] samba-common.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-27T13:13:51.538Z] rpcbind.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-27T13:13:51.538Z] rpcbind.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
[2022-09-27T13:13:51.538Z] moby-engine.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
[2022-09-27T13:13:51.538Z] moby-engine.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.

which leads to kola basic tests failing in Ignition with:

[    7.354272] ignition[920]: CRITICAL : Ignition failed: failed to create users/groups:
                 failed to configure users: failed to create user "core": exit status 10:
                 Cmd: "useradd" "--root" "/sysroot" "--create-home" "--password" "*" "--comment" "CoreOS Admin" "--groups" "adm,sudo,systemd-journal,wheel" "core"
                 Stdout: "" Stderr: "useradd: lock /etc/group.lock already used by PID 4\nuseradd: cannot lock /etc/group; try again later.\n"

@dustymabe
Copy link
Member

I believe we are seeing the unbound-anchor version of this in build-arch#404.

@dustymabe
Copy link
Member

shadow-maint/shadow#562 merged, once it lands in Fedora it should shed some more light about what's going on. Upstream wasn't really convinced by my guess for the reason of the failure, so no real logic change for now.

Could we backport this to rawhide so we can see if we get any more information there?

@dustymabe
Copy link
Member

I believe we are seeing the unbound-anchor version of this in build-arch#444.

@dustymabe
Copy link
Member

dustymabe commented Nov 14, 2022

Saw a similar unbound-anchor failure in coreos-ci (x86_64) today for a PR: link

@LorbusChris
Copy link
Contributor

LorbusChris commented Nov 24, 2022

Seen today in the SCOS Tekton build pipeline:

Running %prein for unbound-libs: bwrap(/bin/sh): Child process killed by signal 6
unbound-libs.prein: groupadd.rpmostreesave: /etc/group.4: lock file already used
unbound-libs.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
unbound-libs.prein: useradd.rpmostreesave: group 'unbound' does not exist
error: failed to execute cmd-build: exit status 1

@tidux
Copy link

tidux commented Feb 23, 2023

I'm seeing this in FC37 Kinoite today for brlapi.

@lua86
Copy link

lua86 commented Feb 27, 2023

I'm seeing this in FC37 Kinoite today for brlapi.

Me too, but we should probably file a bug on kinoite rather than replying to one on coreos. For the appropriate bug tracker: https://pagure.io/fedora-kde/SIG/issues?tags=kinoite

@LorbusChris
Copy link
Contributor

LorbusChris commented Feb 28, 2023

Hitting this again today in multiple runs of the SCOS pipeline:

error: Running %prein for unbound-libs: bwrap(/bin/sh): Child process killed by signal 6
unbound-libs.prein: groupadd.rpmostreesave: /etc/group.4: lock file already used
unbound-libs.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
unbound-libs.prein: useradd.rpmostreesave: group 'unbound' does not exist
error: failed to execute cmd-build: exit status 1

@LorbusChris
Copy link
Contributor

The non-fatal third time showed:

glusterfs.prein: groupadd.rpmostreesave: /etc/group.4: lock file already used
glusterfs.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
glusterfs.prein: useradd.rpmostreesave: group 'gluster' does not exist
clevis.prein: groupadd.rpmostreesave: lock /etc/group.lock already used by PID 4
clevis.prein: groupadd.rpmostreesave: cannot lock /etc/group; try again later.
clevis.prein: useradd.rpmostreesave: group 'clevis' does not exist
clevis.prein: groups: 'clevis': no such user

@dustymabe
Copy link
Member

shadow-maint/shadow#562 merged, once it lands in Fedora it should shed some more light about what's going on.

This is at least in https://github.com/shadow-maint/shadow/releases/tag/4.13 which is in F38/rawhide now.

mike-nguyen added a commit to mike-nguyen/os that referenced this issue May 8, 2023
openvswitch %pre scriptlet adds the openvswitch user to the hugetblfs group. Since %pre runs
without `set -e` by default the failures are ignored resulting in worker nodes that do not
come online during a cluster install. This seems to be happening on only RHCOS based on RHEL
9.2 on ppc64le.

These errors are showing up during the rpm-ostree compose:

14:30:05  openvswitch3.1.prein: usermod.rpmostreesave: /etc/passwd.6: lock file already used
14:30:05  openvswitch3.1.prein: usermod.rpmostreesave: cannot lock /etc/passwd; try again later.
`:

xref: coreos/fedora-coreos-tracker#1250
mike-nguyen added a commit to mike-nguyen/os that referenced this issue May 8, 2023
openvswitch %pre scriptlet adds the openvswitch user to the hugetblfs group. Since %pre runs
without `set -e` by default the failures are ignored resulting in worker nodes that do not
come online during a cluster install. This seems to be happening on only RHCOS based on RHEL
9.2 on ppc64le.

These errors are showing up during the rpm-ostree compose:

14:30:05  openvswitch3.1.prein: usermod.rpmostreesave: /etc/passwd.6: lock file already used
14:30:05  openvswitch3.1.prein: usermod.rpmostreesave: cannot lock /etc/passwd; try again later.
`:

xref: coreos/fedora-coreos-tracker#1250
mike-nguyen added a commit to mike-nguyen/os that referenced this issue May 8, 2023
openvswitch %pre scriptlet adds the openvswitch user to the hugetblfs group. Since %pre runs
without `set -e` by default the failures are ignored resulting in worker nodes that do not
come online during a cluster install. This seems to be happening on only RHCOS based on RHEL
9.2 on ppc64le.

These errors are showing up during the rpm-ostree compose:

14:30:05  openvswitch3.1.prein: usermod.rpmostreesave: /etc/passwd.6: lock file already used
14:30:05  openvswitch3.1.prein: usermod.rpmostreesave: cannot lock /etc/passwd; try again later.
`:

xref: coreos/fedora-coreos-tracker#1250
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/os that referenced this issue May 9, 2023
openvswitch %pre scriptlet adds the openvswitch user to the hugetblfs group. Since %pre runs
without `set -e` by default the failures are ignored resulting in worker nodes that do not
come online during a cluster install. This seems to be happening on only RHCOS based on RHEL
9.2 on ppc64le.

These errors are showing up during the rpm-ostree compose:

14:30:05  openvswitch3.1.prein: usermod.rpmostreesave: /etc/passwd.6: lock file already used
14:30:05  openvswitch3.1.prein: usermod.rpmostreesave: cannot lock /etc/passwd; try again later.
`:

xref: coreos/fedora-coreos-tracker#1250
@LorbusChris
Copy link
Contributor

I'm again seeing this on builds on a system where the disk IO is very slow, and this has so far happened in all runs.

If anyone would like access to the system to reproduce this, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests