Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sgdisk: Run partx after partition changes #1717

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

pothos
Copy link
Contributor

@pothos pothos commented Sep 29, 2023

  • disks: Refuse to modify disks/partitions in use

    When a partition or the whole disk is in use, sgdisk should not execute
    the destructive operation.
    Add a check that errors out when a disk in use or a partition in use is
    to be destroyed.

  • sgdisk: Run partx after partition changes

    The sgdisk tool does not update the kernel partition table with BLKPG in
    contrast to other similar tools but only uses BLKRRPART which fails as
    soon as one partition of the disk is mounted.
    Update the kernel partition table with partx when we know that a
    partition of the disk is in use.

pothos added a commit to flatcar/bootengine that referenced this pull request Oct 2, 2023
The sgdisk tool does not update the kernel partition table in contrast
to other similar tools. Often udev can detect the changes but not always
as experienced when adding a new partition on Flatcar's boot disk.
Instead of implicitly relying on some other component to re-read the
kernel partition table, trigger the re-read with partprobe. This is
proposed in coreos/ignition#1717
@pothos pothos marked this pull request as ready for review October 2, 2023 13:39
@pothos pothos force-pushed the kai/partprobe branch 2 times, most recently from 8b75dab to 78efdc2 Compare October 2, 2023 13:45
pothos added a commit to flatcar/scripts that referenced this pull request Oct 2, 2023
pothos added a commit to flatcar/scripts that referenced this pull request Oct 2, 2023
pothos added a commit to flatcar/scripts that referenced this pull request Oct 2, 2023
pothos added a commit to flatcar/bootengine that referenced this pull request Oct 2, 2023
The sgdisk tool does not update the kernel partition table in contrast
to other similar tools. Often udev can detect the changes but not always
as experienced when adding a new partition on Flatcar's boot disk.
Instead of implicitly relying on some other component to re-read the
kernel partition table, trigger the re-read with partprobe. This is
proposed in coreos/ignition#1717
pothos added a commit to flatcar/scripts that referenced this pull request Oct 2, 2023
pothos added a commit to flatcar/bootengine that referenced this pull request Oct 4, 2023
The sgdisk tool does not update the kernel partition table in contrast
to other similar tools. Often udev can detect the changes but not always
as experienced when adding a new partition on Flatcar's boot disk.
Instead of implicitly relying on some other component to re-read the
kernel partition table, trigger the re-read with partprobe. This is
proposed in coreos/ignition#1717
pothos added a commit to flatcar/scripts that referenced this pull request Oct 4, 2023
Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this makes sense to me and may actually fix #1729. Some comments but the idea overall LGTM.

docs/release-notes.md Outdated Show resolved Hide resolved
@@ -15,10 +15,12 @@ nav_order: 9

### Changes

- The Dracut module now installs partprobe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer we use blockdev --rereadpt which is part of util-linux and more likely to be already available (e.g. FCOS doesn't ship partprobe).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this and it doesn't achieve the same thing because it can fail with BLKRRPART: Device or resource busy while partprobe succeeds in forcing a kernel partition re-read. If you know that you don't have this problem you could use the introduced command parameter to run blockdev instead, does this work for you?

Copy link
Contributor Author

@pothos pothos Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this requires a wrapper to pass the --rereadpt argument. Your initrd could ship something like that as blockdev-rereadpt and then you configure it as partprobeCmd:

#!/bin/sh
exec blockdev --rereadpt "$@"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this and it doesn't achieve the same thing because it can fail with BLKRRPART: Device or resource busy while partprobe succeeds in forcing a kernel partition re-read.

Not sure why it's hitting EBUSY. The common case would be if one of the filesystems on that disk is mounted, but since we were just able to modify the partition table, that shouldn't be the case. Possibly we're racing with udev?

And actually on that topic, would it work here to play the same trick as #1319 and emit a tagged event against the disk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, one of the filesystems is mounted and this happens when adding a new partition to the boot disk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so what works is partx -u - /dev/vda and this should have the same effect as partprobe which also adds the individual partitions to the kernel. Since partx should be available everywhere where util-linux is, should I change the PR to use partx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that sgdisk doesn't do that itself while sfdisk does - is there a big reason to use sgdisk?
Anyway, since it doesn't hurt to call partx I would suggest to use it here and can update the PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that sgdisk doesn't do that itself while sfdisk does - is there a big reason to use sgdisk?

I'm not sure. It seems like sfdisk didn't support GPT in the past so that may be related, but I didn't check the timelines. Regardless, changing it at this point is not worth the risk. The implementation is quite tied to sgdisk intricacies.

Anyway, since it doesn't hurt to call partx I would suggest to use it here and can update the PR.

Before we do that, did you try out this suggestion:

And actually on that topic, would it work here to play the same trick as #1319 and emit a tagged event against the disk?

?

That'd be my preferred solution if it works.

I know you're aware, but for posterity: the difference with partx (and partprobe) vs blockdev --rereadpt is that they manually parse the table and tell the kernel about it instead of letting the kernel do it (BLKPG vs BLKRRPART). While there shouldn't be a perceptible difference, I'd rather we keep relying on the kernel and not make the first boot special. (And of course, it avoids another dependency.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And actually on that topic, would it work here to play the same trick as #1319 and emit a tagged event against the disk?

This here is not a race, it's really because the BLKRRPART ioctl on the block fails when one partition is mounted (E.g., the ESP or root partition). In this case, to my understanding, the partitioning program should add the partition objects one by one with BLKPG (I would argue that normally this is even the preferred way because it avoids the full re-read and async waiting logic needed when wanting to access the created partition).

I haven't tried if supplying sfdisk as sgdisk command would work but anyway, I think we need to work around the deficiencies of sgdisk and run partx/partprobe from Ignition.

Copy link
Contributor Author

@pothos pothos Oct 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

udevadm trigger --settle does not result in the new partition showing up, maybe that wasn't clear from my previous comments
(After a reboot, the new partition shows up without problems of course)

internal/sgdisk/sgdisk.go Outdated Show resolved Hide resolved
@jlebon
Copy link
Member

jlebon commented Oct 20, 2023

OK, I think I understand better now.

sgdisk does call BLKRRPART at the end of its operation, but will still exit 0 if it gets EBUSY (and print out its "Warning: The kernel is still using the old partition table." message). So the partition table never gets reloaded. And this happens in your environment because something is keeping one or more filesystems busy on the target disk we're editing.

Now, is that by design (e.g. you're knowingly mounting filesystems before the disks stage runs), or a bug (e.g. something is racing, maybe udev, and probing filesystems between the time sgdisk edits the table and the time it emits BLKRRPART)?

@pothos
Copy link
Contributor Author

pothos commented Oct 21, 2023

Now, is that by design (e.g. you're knowingly mounting filesystems before the disks stage runs), or a bug (e.g. something is racing, maybe udev, and probing filesystems between the time sgdisk edits the table and the time it emits BLKRRPART)?

Correct, it's by design because users want to add a new partition on the boot disk where in our case /usr is already mounted in the initrd, and maybe more (OEM or ESP, I would have to check).

@jlebon
Copy link
Member

jlebon commented Oct 25, 2023

Correct, it's by design because users want to add a new partition on the boot disk where in our case /usr is already mounted in the initrd, and maybe more (OEM or ESP, I would have to check).

Gotcha. And did your initrd change recently to do this mounting earlier than it used to? I'm trying to understand how this ever worked before. I don't think there's a udev rule that will call out to partx/partprobe.

Thinking on this, it's not clear to me whether we should even try to support this (partitioning a busy disk). It makes Ignition less deterministic since depending on the config, it may or may not be able to update the kernel. Though if we can ensure that we correctly error out in the latter case, that'd be OK I guess... I'll try to gather thoughts from a few others.

@pothos pothos changed the title sgdisk: Run partprobe after partition changes sgdisk: Run partx after partition changes Oct 30, 2023
@pothos
Copy link
Contributor Author

pothos commented Oct 30, 2023

Our initrd uses the /usr partition already at that point. Yes, this was a recent change. By chance, repartitioning the root partition worked because the start point stayed the same even though the "old" partition was still used by the kernel. Adding a new one didn't work anymore which we only found out later.
To me this is really a limitation in sgdisk and with sfdisk this wouldn't occur. I've updated the patch to use partx from util-linux which FCOS probably already has in the initrd (Flatcar did).

@jlebon
Copy link
Member

jlebon commented Nov 6, 2023

I'm OK with this I think if we can confirm that Ignition will correctly error out if we try to nuke/modify partitions that are currently busy (i.e. with mounted filesystems).

In other words, the invariant here is that the loaded partition table must always match the on-disk state at the end of the operation. And fail if that invariant cannot hold.

@pothos
Copy link
Contributor Author

pothos commented Nov 7, 2023

I'm OK with this I think if we can confirm that Ignition will correctly error out if we try to nuke/modify partitions that are currently busy (i.e. with mounted filesystems).

This is orthogonal: currently sgdisk will not error out if you remove or resize a partition in use. Detecting this is also not easy because it's not only mounted filesystems but also about device mapper usage.
This can be used to check for unused disks/partitions:

lsblk -lnpb -o NAME | while IFS= read -r line; do
drive=$(echo "$line" | awk '{print $1}')
mountpoints=$(lsblk -ln -o MOUNTPOINT "$drive")
device_mapper_usages=$(lsblk -ln -o PATH "$drive" |  grep -v "^$drive")
if [[ -z "$mountpoints" ]] && [[ -z "${device_mapper_usages}" ]]; then
  echo "$drive"
fi
done

In other words, the invariant here is that the loaded partition table must always match the on-disk state at the end of the operation. And fail if that invariant cannot hold.

Yes, that's the aim of this PR, as this wasn't upheld before.

@pothos
Copy link
Contributor Author

pothos commented Nov 7, 2023

This is orthogonal: currently sgdisk will not error out if you remove or resize a partition in use. Detecting this is also not easy because it's not only mounted filesystems but also about device mapper usage.

I created a new issue for this: #1745

@jlebon
Copy link
Member

jlebon commented Nov 8, 2023

Let me reword my comment a different way: I think the fact that partitioning a busy disk "additively" worked was a happy accident and not a design choice. No downstream distro until now needed it. This PR is making it a design choice, and I'm OK with that, but then we should try to implement it properly.

I had hoped that partx -u would error out if a busy partition was deleted, but testing it, that's not the case. An strace shows it getting EBUSY for BLKPG_DEL_PARTITION but the command itself doesn't error out.

Re. #1745, I think a simpler way to check for whether the partition is busy is to check if the holders/ directory in sysfs for that device is empty. We do something similar in coreos-installer: https://github.com/coreos/coreos-installer/blob/686e0320f04a5a3e18ab0bb6278a34137eb22f96/src/blockdev.rs#L132

As a bonus, checking this upfront means that we can call partx only when strictly necessary and otherwise rely on BLKRRPART from sgdisk.

Also, it looks like partx is failing in CI on the RAID1 test and some others:

[ 23.100241] ignition[958]: Ignition failed: create partitions failed: re-reading partitions failed: exit status 1: Cmd: "partx" "-u" "-" "/run/ignition/dev_aliases/dev/vda" Stdout: "" Stderr: "partx: /run/ignition/dev_aliases/dev/vda: failed to read partition table\n"

@pothos
Copy link
Contributor Author

pothos commented Nov 23, 2023

When testing I found out that partx -u is not a replacement for partprobe because it does not add new partitions or remove deleted ones. I need separate invocations with partx -d and partx -a with the exact partition numbers to work on and when at it it might make sense to also use the exact partitions for partx -u but that's a minor optimization.

@pothos
Copy link
Contributor Author

pothos commented Dec 15, 2023

The Go 1.21 linter suddenly can't find certain fields anymore, not sure where this comes from. Looks unrelated.

@pothos
Copy link
Contributor Author

pothos commented Dec 15, 2023

I'm still testing this, and will say when it's ready for review

@pothos
Copy link
Contributor Author

pothos commented Dec 18, 2023

I'm still testing this, and will say when it's ready for review

It's ready for review. I've tested it with Flatcar's kola suite and also manually: adding a partition on the boot device, updating a partition (resize), and deleting, and also that modifications of partitions in-use is producing an error and won't be touching the data.

@prestist
Copy link
Collaborator

prestist commented Jan 8, 2024

I'm still testing this, and will say when it's ready for review

It's ready for review. I've tested it with Flatcar's kola suite and also manually: adding a partition on the boot device, updating a partition (resize), and deleting, and also that modifications of partitions in-use is producing an error and won't be touching the data.

Thank you so much for going through the testing. I will try and take a look today.

When you can, please make sure to re-base, as some of our CI has been updated , and will prevent a merge without it.

Thank you again for your patience and diligence.

@pothos
Copy link
Contributor Author

pothos commented Jan 8, 2024

Thanks for the note, I've rebased it.

Copy link
Collaborator

@prestist prestist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, I have a few asks and nits if you would not mind taking a look.

Nothing is blocking minus release notes.

docs/release-notes.md Outdated Show resolved Hide resolved
docs/release-notes.md Outdated Show resolved Hide resolved
dracut/30ignition/module-setup.sh Show resolved Hide resolved
internal/distro/distro.go Show resolved Hide resolved
internal/distro/distro.go Show resolved Hide resolved
internal/exec/stages/disks/partitions.go Outdated Show resolved Hide resolved
internal/exec/stages/disks/partitions.go Show resolved Hide resolved
@jlebon jlebon self-requested a review January 9, 2024 19:01
@jlebon
Copy link
Member

jlebon commented Jan 9, 2024

I've added a review request from myself as well. I should be able to review it soon too.

@pothos pothos force-pushed the kai/partprobe branch 2 times, most recently from 1b80a4c to fb2c16a Compare January 12, 2024 13:38
Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor details, but this looks sane overall. Thanks!

return false, fmt.Errorf("failed to resolve %q: %v", blockDev, err)
}

mounts, err := os.Open("/proc/mounts")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit heavy to parse /proc/mounts on every block device. Ideally, we'd do this once. In practice, I don't think it matters too much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there shouldn't be millions of mount points in the initrd - if that ever is the case it can be cached.

internal/exec/stages/disks/partitions.go Outdated Show resolved Hide resolved
internal/exec/stages/disks/partitions.go Outdated Show resolved Hide resolved
internal/exec/stages/disks/partitions.go Outdated Show resolved Hide resolved
internal/exec/stages/disks/partitions.go Outdated Show resolved Hide resolved
internal/exec/stages/disks/partitions.go Outdated Show resolved Hide resolved
docs/release-notes.md Outdated Show resolved Hide resolved
@prestist
Copy link
Collaborator

@pothos wanted to see if you had the chance to see @jlebon's change requests?

@pothos
Copy link
Contributor Author

pothos commented Feb 29, 2024

I'll try to have a look next week (edit: in the next months…).

@pothos pothos force-pushed the kai/partprobe branch 3 times, most recently from 338169e to 0c8f970 Compare May 6, 2024 12:46
pothos added 2 commits May 6, 2024 22:00
When a partition or the whole disk is in use, sgdisk should not execute
the destructive operation.
Add a check that errors out when a disk in use or a partition in use is
to be destroyed.
The sgdisk tool does not update the kernel partition table with BLKPG in
contrast to other similar tools but only uses BLKRRPART which fails as
soon as one partition of the disk is mounted.
Update the kernel partition table with partx when we know that a
partition of the disk is in use.
@pothos
Copy link
Contributor Author

pothos commented May 7, 2024

The patches work well in Flatcar PR where a backported them to replace the first patch posted here.

Tested three cases, the overwriting of USR-A was detected and an error thrown, the other two changes passed without problem because the USR-B is unused and can be overwritten and the new partition is also ok to add (confirmed with lsblk that the kernel has the updated values thanks to partx).

variant: flatcar
version: 1.0.0
storage:
  disks:
  - device: /dev/vda
    partitions:
    - label: USR-A
      number: 3
      size_mib: 1
      resize: true
variant: flatcar
version: 1.0.0
storage:
  disks:
  - device: /dev/vda
    partitions:
    - label: USR-B
      number: 4
      size_mib: 1
      resize: true
variant: flatcar
version: 1.0.0
storage:
  disks:
  - device: /dev/vda
    partitions:
    - label: VAR
      number: 10
      size_mib: 1
  filesystems:
  - device: /dev/disk/by-partlabel/VAR
    format: vfat
    path: /var
    label: VAR
    with_mount_unit: true

@pothos pothos requested review from jlebon and prestist May 11, 2024 06:37
@prestist
Copy link
Collaborator

@pothos Thank you so much for testing the cases and addressing @jlebon 's comments. I am going to take a pass at this today and see if I notice anything.

Copy link
Collaborator

@prestist prestist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thank you again for being so diligent. @jlebon wdyt?

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, but looks great overall!

}

// Expects a /dev/xyz path
func partitionNumberPrefix(blockDevResolved string) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no error path in this function so we could also simplify this to just return a string.

scanner := bufio.NewScanner(mounts)
for scanner.Scan() {
mountSource := strings.Split(scanner.Text(), " ")[0]
if strings.Contains(mountSource, "/") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: likely functionally equivalent, but it'd be clearer/more self-explanatory if instead we did strings.HasPrefix(mountSource, "/").

func blockDevMounted(blockDevResolved string) (bool, error) {
mounts, err := os.Open("/proc/mounts")
if err != nil {
return false, fmt.Errorf("failed to open mounts: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: maybe "failed to open /proc/mounts"?

}
var partitions []string
for _, entry := range entries {
if strings.HasPrefix(entry.Name(), blockDevNode) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is that the canonical way to get disk partitions? I wonder how libblkid does it. Which actually, we do link to it in Ignition.

So another way to get this info is to use e.g. getPartitionMap() in this file which calls out to libblkid. But I think that function assumes that the passed device is a disk so it wouldn't work when recursing into the partition in blockDevInUse(). But we could also restructure blockDevInUse() so that the held and mounted checks are their own functions and then not recurse (which anyway feels a bit like a weird thing to do to avoid even looking for partitions on partitions).

Not a blocker.

}
var activePartitions []string
for _, partition := range partitions {
partInUse, _, err := blockDevInUse(partition)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least here, we could pass a bool to make it not even check for partitions when recursing here.

@prestist
Copy link
Collaborator

@pothos have you had a chance to see @jlebon's comments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants