-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests/basic: useradd: lock /etc/group.lock already used by PID 4 #1250
Comments
hit this over in CI for coreos/coreos-assembler#3010.
|
This seems to happen more commonly when we change the But I've been working on tweaking/changing groups definition for a couple of weeks, and never observed this locally. |
Some more notes:
|
Should we (temporarily) workaround that in a post-process script that makes sure that this lock file does not exists in the final image? |
That won't fix it. The failures happen intermittently at various times. If the lock file got left in the image that was created then the failures would consistently happen during Ignition when the It appears the file can get left behind by the compose process or when an |
OK. Maybe rpm-ostree could make sure that this file does not exists after each step in a compose? |
Luca"s working on useradd/groupadd wrappers so maybe that could be part of that. |
Not really, the first PR was coreos/rpm-ostree#3778 and it merged on July 27th. |
This could also be from a kernel or FUSE regression (via |
Sorry, I wanted to mean that the work that you've done around that might help us come up with a workaround for this one. |
Peanut gallery here: we've seen this in a non-Fedora distribution and it became so problematic that we added a 3-count retry to the useradd call in Ignition as a temporary mitigation. We tried various efforts to determine what could be causing this including setting up auditd to watch file access in the initramfs, but we never found a root cause. Without our local workaround, it occurs nearly 10% the time when launching instances. With the workaround, we've never seen it crop up again. |
Thanks for commenting! I think what's interesting here is this is happening both at build time (rpm-ostree compose tree) and inside the initramfs on instances; this strongly seems to point to a bug/regression in shadow-utils. |
I have access to a I've been trying to observe this by running a loop of I did notice an FD leakage, now tracked at coreos/rpm-ostree#3962, but I don't think it is in any way related to this bug. I'm now having a look at a standalone |
I'm still not sure about the exact failure mode, but I'm leaving here some in-progress notes for my future self. I do suspect this is related to hardlinks, but couldn't yet pinpoint for sure. Looking at the first visible error log (
So it seems that either:
Overall this logic is a bit sketchy in several regards, and it doesn't use FDs for its operations. Unless we manage to reproduce it somehow, I think the best course of action for the moment is to augment the above |
Hit the #1265 version of this in bump-lockfile#44. |
I'm pretty reliably hitting this on a non-fedora platform but the
I also threw in another printf but that's not hitting either
Hopefully this helps create a testcase, I'll update thread if I find anything interesting |
Added more debug and the failure is https://github.com/shadow-maint/shadow/blob/master/lib/commonio.c#L423 If I insert a printf before the |
Forgive the simple names but detail how it reaches
|
@mark-au it sounds like you may be hitting a different problem than what I was chasing. Also, the lckpwdf-based flow does not retry on failures, and I think neither does glibc internally. |
[rawhide][s390x] run-202 hit the same failure as in issue 1265 fyi,
|
I run the groupadd in a 3x loop with a pause of 20 seconds in between and no improvement.
|
|
@mark-au at this point I believe you are really chasing a different bug, possibly the same that @justinkinney saw. For that new ticket, it may be interesting to distill a minimal Butane or Ignition config with just the disk setup and the user/group manipulation. If we can observe the behavior in a fresh VM, it should become easier to track down. |
shadow-maint/shadow#562 merged, once it lands in Fedora it should shed some more light about what's going on. |
Seeing a pretty weird
|
We also saw this in build-arch#395. Just noting here that the logs can be misleading. When I originally looked at the logs here I was focusing on:
But that is just a symptom of the original problem in the log file:
|
saw this in build-arch#398:
which leads to kola basic tests failing in Ignition with:
|
I believe we are seeing the |
Could we backport this to |
I believe we are seeing the |
Saw a similar |
Seen today in the SCOS Tekton build pipeline:
|
I'm seeing this in FC37 Kinoite today for brlapi. |
Me too, but we should probably file a bug on kinoite rather than replying to one on coreos. For the appropriate bug tracker: https://pagure.io/fedora-kde/SIG/issues?tags=kinoite |
Hitting this again today in multiple runs of the SCOS pipeline:
|
The non-fatal third time showed:
|
This is at least in https://github.com/shadow-maint/shadow/releases/tag/4.13 which is in F38/rawhide now. |
openvswitch %pre scriptlet adds the openvswitch user to the hugetblfs group. Since %pre runs without `set -e` by default the failures are ignored resulting in worker nodes that do not come online during a cluster install. This seems to be happening on only RHCOS based on RHEL 9.2 on ppc64le. These errors are showing up during the rpm-ostree compose: 14:30:05 openvswitch3.1.prein: usermod.rpmostreesave: /etc/passwd.6: lock file already used 14:30:05 openvswitch3.1.prein: usermod.rpmostreesave: cannot lock /etc/passwd; try again later. `: xref: coreos/fedora-coreos-tracker#1250
openvswitch %pre scriptlet adds the openvswitch user to the hugetblfs group. Since %pre runs without `set -e` by default the failures are ignored resulting in worker nodes that do not come online during a cluster install. This seems to be happening on only RHCOS based on RHEL 9.2 on ppc64le. These errors are showing up during the rpm-ostree compose: 14:30:05 openvswitch3.1.prein: usermod.rpmostreesave: /etc/passwd.6: lock file already used 14:30:05 openvswitch3.1.prein: usermod.rpmostreesave: cannot lock /etc/passwd; try again later. `: xref: coreos/fedora-coreos-tracker#1250
openvswitch %pre scriptlet adds the openvswitch user to the hugetblfs group. Since %pre runs without `set -e` by default the failures are ignored resulting in worker nodes that do not come online during a cluster install. This seems to be happening on only RHCOS based on RHEL 9.2 on ppc64le. These errors are showing up during the rpm-ostree compose: 14:30:05 openvswitch3.1.prein: usermod.rpmostreesave: /etc/passwd.6: lock file already used 14:30:05 openvswitch3.1.prein: usermod.rpmostreesave: cannot lock /etc/passwd; try again later. `: xref: coreos/fedora-coreos-tracker#1250
openvswitch %pre scriptlet adds the openvswitch user to the hugetblfs group. Since %pre runs without `set -e` by default the failures are ignored resulting in worker nodes that do not come online during a cluster install. This seems to be happening on only RHCOS based on RHEL 9.2 on ppc64le. These errors are showing up during the rpm-ostree compose: 14:30:05 openvswitch3.1.prein: usermod.rpmostreesave: /etc/passwd.6: lock file already used 14:30:05 openvswitch3.1.prein: usermod.rpmostreesave: cannot lock /etc/passwd; try again later. `: xref: coreos/fedora-coreos-tracker#1250
I'm again seeing this on builds on a system where the disk IO is very slow, and this has so far happened in all runs. If anyone would like access to the system to reproduce this, please let me know. |
There seems to be a build flake which manifests itself as a test failure which looks like this:
and the console showing:
The actual failure though seems to be happening at compose time, where something is going wrong with the
groupadd
calls in the scriptlets:The text was updated successfully, but these errors were encountered: