Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/exec/stages/disks: prevent races with udev #1319

Merged
merged 1 commit into from
Sep 29, 2023

Conversation

pothos
Copy link
Contributor

@pothos pothos commented Feb 16, 2022

The "udevadm settle" command used to wait for udev to process the disk
changes and recreate the entries under /dev was still prone to races
where udev didn't get notified yet of the final event to wait for.
This caused the boot with a btrfs root filesystem created by Ignition
to fail almost every time on certain hardware.

Issue tagged events and wait for them to be processed by udev. This is
actually meanigful in all stages not only for the other parts of the
initramfs which may be surprised by sudden device nodes disappearing
shortly like the case was with systemd's fsck service but also for the
inter-stage dependencies which currently are using the waiter for
systemd device units but that doesn't really prevent from races with
udev device node recreation. Thus, these changes are complementary to
the existing waiter which mainly has the purpose to wait for unmodified
devices. For newly created RAIDs we can wait for the new node to be
available as udev will not recreate it.

@pothos
Copy link
Contributor Author

pothos commented Apr 22, 2022

FYI, Flatcar uses this already for a while now

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for upstreaming this and writing a detailed commit message! Always fun to debug udev races. :)

Had a minor comment, but this looks sane to me overall.

One concern is that we're calling this on every separate filesystem/disk/LUKS/RAID device. Would it be more efficient to instead do it only once per section for the last device in each associated list?

internal/exec/stages/disks/disks.go Outdated Show resolved Hide resolved
@pothos
Copy link
Contributor Author

pothos commented Apr 27, 2022

One concern is that we're calling this on every separate filesystem/disk/LUKS/RAID device. Would it be more efficient to instead do it only once per section for the last device in each associated list?

That would rely on inotify for the other devices and I think there is no guarantee that the inotify event goes into the queue and gets fully processed if we trigger a tagged event and wait for this event's completion - maybe if we check all implementation details we know that it works for the current udev implementation but I would rather be on the safe side since absence of the race condition is not verifiable through testing…

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Will let @bgilbert take a look as well.

@pothos pothos force-pushed the kai/udev-race-prevention branch 3 times, most recently from d14a76a to fba4ff4 Compare May 13, 2022 13:04
@pothos pothos requested a review from bgilbert October 10, 2022 15:00
@bgilbert
Copy link
Contributor

I won't be able to look at this for some weeks, unfortunately, but it's still on my list and I'll circle back as soon as I can.

@bgilbert bgilbert removed their request for review July 28, 2023 06:53
@bgilbert
Copy link
Contributor

bgilbert commented Aug 1, 2023

Your reasoning makes sense to me, but I haven't thought it through carefully. I'll defer to re-review from @jlebon, which actually I should have done months ago. 😳 Thanks for your patience.

PR needs rebase to pick up CI changes.

@pothos pothos force-pushed the kai/udev-race-prevention branch 2 times, most recently from d12147c to 3c6297c Compare August 1, 2023 14:02
@pothos
Copy link
Contributor Author

pothos commented Aug 1, 2023

I added a release note entry as required

@pothos
Copy link
Contributor Author

pothos commented Sep 28, 2023

@jlebon Can you do the approval (actually, you did already, not sure you need to submit again) and hit the merge button?

The "udevadm settle" command used to wait for udev to process the disk
changes and recreate the entries under /dev was still prone to races
where udev didn't get notified yet of the final event to wait for.
This caused the boot with a btrfs root filesystem created by Ignition
to fail almost every time on certain hardware.

Issue tagged events and wait for them to be processed by udev. This is
actually meanigful in all stages not only for the other parts of the
initramfs which may be surprised by sudden device nodes disappearing
shortly like the case was with systemd's fsck service but also for the
inter-stage dependencies which currently are using the waiter for
systemd device units but that doesn't really prevent from races with
udev device node recreation. Thus, these changes are complementary to
the existing waiter which mainly has the purpose to wait for unmodified
devices. For newly created RAIDs we can wait for the new node to be
available as udev will not recreate it.
Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. It (still) makes sense to me and in fact we do something similar in other places of the stack for similar reasons. I'm pretty sure we should be able to drop some of those with this.

I just tweaked the error messages and the release note item. Let me know if you disagree with some of them.

@jlebon jlebon merged commit de4da0e into coreos:main Sep 29, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants