Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(disk_setup) add timeout #4673

Closed
wants to merge 2 commits into from

Conversation

flokli
Copy link

@flokli flokli commented Dec 7, 2023

Proposed Commit Message

In a cloud environment, sometimes disks will attach while cloud-init is running and get missed. This adds a configurable timeout to wait for those disks.

Fixes GH-3386
LP: #1832645

Additional Context

This is #710 resurrected.
Fixes #3386.

Test Steps

Tested with the following cloud-config file on azure:

disk_setup:
  /dev/disk/by-lun/10:
    layout: false
    timeout: 60

fs_setup:
 - filesystem: ext4
   partition: auto
   device: /dev/disk/by-lun/10
   label: jenkins

mounts:
 - ["/dev/disk/by-label/jenkins", "/var/lib/jenkins"]

I used a azurerm_virtual_machine_data_disk_attachment terraform resource, which attaches the disk halfway during bootup, due to hashicorp/terraform-provider-azurerm#6117, but this should also work with setting a reasonably large timeout and manually attaching the disk (at that lun)

Checklist

Merge type

  • Squash merge using "Proposed Commit Message"
  • Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

@holmanb
Copy link
Member

holmanb commented Dec 7, 2023

Ugh. This problem demonstrates that cloud-init's current code in this area is a mess.

The proposed code might work here, but I can't say that I like it.

  • timeouts -> unreliable, ugly
  • this makes cloud-init block remaining cloud-init runtime (and all of boot) until the disk is ready

Plus, if you're using a systemd-based instance, you should be able to accomplish this already using systemd's builtin mount point generation from fstab. Something like this should do it:

mounts:
 - ["/dev/disk/by-label/jenkins", "/var/lib/jenkins", "ext4", "defaults,nofail,x-systemd.makefs"]

Since systemd creates mount units from fstab, and mount units get automatically ordered, I think the above should "just work".

At least, that would work if cloud-init's cc_mount module didn't assume that mounts need to be available when cloud-init.service is running in order for them to get written into /etc/fstab. I really don't think that cloud-init's cc_mounts.py::sanitize_devname() should be booting out unavailable devices. If systemd can do the job correctly (and cloud-init can't), cloud-init should step out of the way and allow the user to write an fstab entry that gets the job done.

Thanks for proposing this @flokli. This might be an interim solution that we can take, but I think a more elegant solution is desirable long term.

@flokli
Copy link
Author

flokli commented Dec 7, 2023

Using systemd-fstab-generator unfortunately won't work at all, no - it's the approach I tried before.

The fstab line creates a .mount unit which will depend on a .device unit that doesn't exist while systemd plans the transaction, because there's no corresponding /dev/disk/by-lun/10 while the disk is not attached yet - so all units with RequiresMountsFor= - either explicit or implicit via StateDirectory will fail due to a missing dependency.

This is not an issue on the second boot, or if I re-attempt to reach mutli-user.target (or just restart the .service unit using the data disk), but the only way to get this working straight from the first boot was to do mounting via cloud-init, not systemd.

Let me know if you'd be fine to accept this, happy to then take another pass and fix the linters.

@holmanb
Copy link
Member

holmanb commented Dec 8, 2023

The fstab line creates a .mount unit which will depend on a .device unit that doesn't exist while systemd plans the transaction, because there's no corresponding /dev/disk/by-lun/10 while the disk is not attached yet - so all units with RequiresMountsFor= - either explicit or implicit via StateDirectory will fail due to a missing dependency.

You're right that depending on only the generator output wouldn't work for first boot. I didn't consider initial transaction failure due to missing device unit.

However, the .device file should later be generated by systemd-udev[1], which will cause subsequent transaction re-calculation to succeed. I think that this is why restarting the .service unit and second boot and re-attempting multi-user all worked. I think that systemctl daemon-reload should also do that anytime after the .device file is created.

An alternative that could take advantage of better systemd ordering would be to use ENV{SYSTEMD_WANTS}+="your-fs.mount"[2] in a udev rule that matches on the specified block device. This seems like something that systemd-fstab-generator could feasibly do: a special mount option with a name like x-systemd.first-boot or x-systemd.new-device which generates a udev rule for the device which activates the .mount when available - the rule could be disabled / deleted if the .device file already exists.

This is not an issue on the second boot, or if I re-attempt to reach mutli-user.target (or just restart the .service unit using the data disk), but the only way to get this working straight from the first boot was to do mounting via cloud-init, not systemd.

[1] systemd.device

systemd will dynamically create device units for all kernel devices that are marked with the "systemd" udev tag (by default all block and network devices, and a few others).

[2] systemd-fstab-generator

These properties may be used to activate arbitrary units when a specific device becomes available.

@holmanb
Copy link
Member

holmanb commented Dec 8, 2023

Your proposal should be easy to review, so for now if you want to move forward with this approach, please make sure to add the new key to the jsonschema.

@flokli
Copy link
Author

flokli commented Dec 8, 2023

I added the field to the jsonschema, addressed the linter warning and renamed need to missing to be more consistent with the other users of this method. PTAL.

In a cloud environment, sometimes disks will attach while cloud-init is
running and get missed. This adds a configurable timeout to wait for
those disks.

Signed-off-by: Florian Klink <flokli@flokli.de>
@holmanb holmanb self-assigned this Dec 11, 2023
@holmanb
Copy link
Member

holmanb commented Dec 11, 2023

@flokli I don't see your username flokli connected to any users that have signed the CLA. Could you please do that so that we can accept this contribution to cloud-init?

@holmanb holmanb added the CLA not signed The submitter of the PR has not (yet) signed the CLA label Dec 11, 2023
@flokli
Copy link
Author

flokli commented Dec 11, 2023

Sorry, please check again, just did it.

@holmanb holmanb added CLA signed The submitter of the PR has signed the CLA and removed CLA not signed The submitter of the PR has not (yet) signed the CLA labels Dec 12, 2023
Copy link
Member

@holmanb holmanb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution @flokli, and welcome to cloud-init!

@holmanb
Copy link
Member

holmanb commented Dec 12, 2023

@flokli Sorry about this one more thing. I just noticed that commit 9b2e3dc has @nibalizer as the author, but @flokli's DCO. This project requires CLA, not DCO. Whose code is this, @flokli's or @nibalizer's? If @nibalizer, I need you to sign the CLA please.

Apologies for the hold up on this. I wish this wasn't part of the process but unfortunately it is.

@flokli
Copy link
Author

flokli commented Dec 12, 2023

Correct, the changes were resurrected from @nibalizer's PR, that's mentioned in the PR description:

This is #710 resurrected.

@nibalizer
Copy link

nibalizer commented Dec 12, 2023

This project requires CLA, not DCO. Whose code is this, @flokli's or @nibalizer's? If @nibalizer, I need you to sign the DCO please.

I assume you need CLA not DCO.

I opened this PR while at IBM. At the time I don't think IBM signed a CCLA with Canonical. I no longer work there. Is it possible we could get this in under "trivial commit" ? Or perhaps all this code is authored by @flokli who can sign the CLA.

cc @jjasghar @bradtopol

@flokli
Copy link
Author

flokli commented Dec 13, 2023

I realized you actually can change the timeout of .device by using the x-systemd.device-timeout mount option in your fstab line, or the DefaultDeviceTimeout= systemd.conf setting, so I might not be needing this workaround after all, at least for my specific usecase.

flokli added a commit to flokli/ghaf-infra that referenced this pull request Dec 13, 2023
Apparently canonical/cloud-init#4673 and more
hacks are not needed, we can simply ramp up the timeout that systemd is
willing to wait for the .device unit to appear.
@holmanb
Copy link
Member

holmanb commented Dec 13, 2023

I realized you actually can change the timeout of .device by using the x-systemd.device-timeout mount option in your fstab line, or the DefaultDeviceTimeout= systemd.conf setting, so I might not be needing this workaround after all, at least for my specific usecase.

@flokli Let me know if that works. It looks like the existence of the .device unit is required for this command (much like x-systemd.makefs that I pointed out above), so I'm guessing that this won't help since the transaction will still fail to include the mount unit because the device unit still won't exist on first boot.

@holmanb holmanb added CLA not signed The submitter of the PR has not (yet) signed the CLA and removed CLA signed The submitter of the PR has signed the CLA labels Dec 13, 2023
@holmanb
Copy link
Member

holmanb commented Dec 13, 2023

I assume you need CLA not DCO.

Yes, that's what I meant.

I opened this PR while at IBM. At the time I don't think IBM signed a CCLA with Canonical. I no longer work there.Is it possible we could get this in under "trivial commit" ? Or perhaps all this code is authored by @flokli who can sign the CLA.

I'll check internally, but I'm not sure that there is much that we can do unless we can get a CLA signature from IBM.

cc @jjasghar @bradtopol

Any insight would be appreciated.

@flokli
Copy link
Author

flokli commented Dec 13, 2023

I realized you actually can change the timeout of .device by using the x-systemd.device-timeout mount option in your fstab line, or the DefaultDeviceTimeout= systemd.conf setting, so I might not be needing this workaround after all, at least for my specific usecase.

@flokli Let me know if that works. It looks like the existence of the .device unit is required for this command (much like x-systemd.makefs that I pointed out above), so I'm guessing that this won't help since the transaction will still fail to include the mount unit because the device unit still won't exist on first boot.

You need to follow one more reference ;-) is_device_path just matches the passed string with some known-good path strings: https://github.com/systemd/systemd/blob/1fdab6c3069ab945259d70e22a29e80da8370288/src/basic/path-util.c#L1285

The approach using x-systemd.device-timeout worked in principle - but I then ran into the issue that sometimes the azurerm_linux_virtual_machine resource /itself/ blocked returninguntil the machine finished booting up - which it never does if it waits for the block device to be attached, which only happens after the resource returns 🙈

In the end I gave up on this journey, and resorted to the azurerm_virtual_machine resource, which does support attaching data disks to VMs at creation time.

I assume keeping systemd about the mountpoints in the dark, and all the waiting / filesystem creation / mounting in cloud-init would also work, so I still think this PR is useful to have.

@holmanb
Copy link
Member

holmanb commented Dec 13, 2023

You need to follow one more reference ;-) is_device_path just matches the passed string with some known-good path strings:

you're right again :P

The approach using x-systemd.device-timeout worked in principle - but I then ran into the issue that sometimes the azurerm_linux_virtual_machine resource /itself/ blocked returninguntil the machine finished booting up - which it never does if it waits for the block device to be attached, which only happens after the resource returns 🙈

oof

In the end I gave up on this journey, and resorted to the azurerm_virtual_machine resource, which does support attaching data disks to VMs at creation time.

👍, glad you got something figured out for your use case

I assume keeping systemd about the mountpoints in the dark, and all the waiting / filesystem creation / mounting in cloud-init would also work, so I still think this PR is useful to have.

Agreed

flokli added a commit to flokli/ghaf-infra that referenced this pull request Dec 19, 2023
Apparently canonical/cloud-init#4673 and more
hacks are not needed, we can simply ramp up the timeout that systemd is
willing to wait for the .device unit to appear.

Signed-off-by: Florian Klink <flokli@flokli.de>
flokli added a commit to flokli/ghaf-infra that referenced this pull request Dec 19, 2023
Apparently canonical/cloud-init#4673 and more
hacks are not needed, we can simply ramp up the timeout that systemd is
willing to wait for the .device unit to appear.

Signed-off-by: Florian Klink <flokli@flokli.de>
flokli added a commit to flokli/ghaf-infra that referenced this pull request Dec 20, 2023
Apparently canonical/cloud-init#4673 and more
hacks are not needed, we can simply ramp up the timeout that systemd is
willing to wait for the .device unit to appear.

Signed-off-by: Florian Klink <flokli@flokli.de>
@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Jan 4, 2024
@TheRealFalcon TheRealFalcon removed the stale-pr Pull request is stale; will be auto-closed soon label Jan 4, 2024
@canonical canonical deleted a comment from github-actions bot Jan 4, 2024
henrirosten pushed a commit to henrirosten/ghaf-infra that referenced this pull request Jan 8, 2024
Apparently canonical/cloud-init#4673 and more
hacks are not needed, we can simply ramp up the timeout that systemd is
willing to wait for the .device unit to appear.

Signed-off-by: Florian Klink <flokli@flokli.de>
henrirosten pushed a commit to tiiuae/ghaf-infra that referenced this pull request Jan 15, 2024
Apparently canonical/cloud-init#4673 and more
hacks are not needed, we can simply ramp up the timeout that systemd is
willing to wait for the .device unit to appear.

Signed-off-by: Florian Klink <flokli@flokli.de>
Copy link

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Jan 19, 2024
@holmanb holmanb removed the stale-pr Pull request is stale; will be auto-closed soon label Jan 22, 2024
Copy link

github-actions bot commented Feb 6, 2024

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Feb 6, 2024
henrirosten pushed a commit to henrirosten/ghaf-infra that referenced this pull request Feb 7, 2024
Apparently canonical/cloud-init#4673 and more
hacks are not needed, we can simply ramp up the timeout that systemd is
willing to wait for the .device unit to appear.

Signed-off-by: Florian Klink <flokli@flokli.de>
henrirosten pushed a commit to tiiuae/ghaf-infra that referenced this pull request Feb 9, 2024
Apparently canonical/cloud-init#4673 and more
hacks are not needed, we can simply ramp up the timeout that systemd is
willing to wait for the .device unit to appear.

Signed-off-by: Florian Klink <flokli@flokli.de>
@github-actions github-actions bot closed this Feb 13, 2024
@holmanb holmanb reopened this Feb 13, 2024
@holmanb holmanb removed the stale-pr Pull request is stale; will be auto-closed soon label Feb 13, 2024
Copy link
Member

@holmanb holmanb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting this to changes requested for now. Conversations with legal don't appear to be moving quickly, but this is still something we'd like to add.

Copy link

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Feb 28, 2024
@holmanb holmanb removed the stale-pr Pull request is stale; will be auto-closed soon label Feb 28, 2024
Copy link

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Mar 14, 2024
@TheRealFalcon
Copy link
Member

We're still waiting on the CLA for this PR, correct? Any updates?

@flokli
Copy link
Author

flokli commented Mar 14, 2024

This project requires CLA, not DCO. Whose code is this, @flokli's or @nibalizer's? If @nibalizer, I need you to sign the DCO please.

I assume you need CLA not DCO.

I opened this PR while at IBM. At the time I don't think IBM signed a CCLA with Canonical. I no longer work there. Is it possible we could get this in under "trivial commit" ? Or perhaps all this code is authored by @flokli who can sign the CLA.

cc @jjasghar @bradtopol

This is PR is stuck on this. I don't work for IBM, and didn't plan to. If this can't be considered a trivial contribution that doesn't need a CLA, someone at canonical needs to convince IBM to sign this.

@github-actions github-actions bot closed this Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA not signed The submitter of the PR has not (yet) signed the CLA stale-pr Pull request is stale; will be auto-closed soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fs_setup/disk_setup: option to wait for the device to exist before continuing
4 participants