Ignition times out waiting on EBS volumes with AWS "Nitro"-based EC2 instance types (NVMe) #2531
Comments
The udev rules aren't actually being installed in the initramfs. The docs for the dracut
|
Does that mean that the rules aren't supposed to be installed there, or that they are supposed to but by a mistake or historical omission they are not yet installed there? |
I see the file 90-cloud-storage.rules is mentioned in file dracut/30ignition/module-setup.h, but I don't know where any of the rules files mentioned there come from. It sounds like you're saying that we mentioned them, but didn't put them in place. There's a comment in coreos/bootengine#148 that says as much, and it doesn't look like the follow-on coreos/bootengine#149 adds those rules files either. Reading it again tonight, #2481 sounds like it was intended to address this problem, as I mentioned above. That one takes us to coreos/bootengine#149. |
coreos/init#268 introduced file udev/rules.d/90-cloud-storage.rules. |
Sorry, #2531 (comment) was meant as a note for the record while I tracked this down; I didn't intend to send you on a wild goose chase. This is a bug in coreos/coreos-overlay#3396 and coreos/coreos-overlay#3456: bootengine expects to copy the udev rules from coreos-init into the initramfs, but there was nothing ensuring that coreos-init is installed before bootengine runs. Because Dracut's Thanks for the careful and detailed report! |
That's fantastic news. Thank you for the fast response. I'll keep an eye on coreos/coreos-overlay#3499. |
This will be in alpha 1995.0.0. Thanks again for reporting! |
Thank you! How long do you estimate the delay will be before there's an EC2 AMI available for that version? I see we're at version 1981.0.0 today. |
1995.0.0 should land in about two weeks |
It works! Thank you again for the fix. |
Issue Report
Bug
Container Linux Version
AWS EC2 AMI built atop the latest Container Linux AMI selected by the following filter:
This yields the following /etc/os-release file content:
Environment
AWS EC2 in the "us-east-1" availability zone.
Expected Behavior
Container Linux should boot on a "Nitro"-based EC2 instance type such as the "m5" and "m5a" families, with Ignition creating a filesystem on an attached EBS volume, per a configuration stanza like this:
Actual Behavior
When the EC2 instance mounts the EBS volume via NVMe, Ignition times out waiting for the systemd device unit "dev-sdf.device" to start. The call to
conn.StartUnit
ininternal/systemd.WaitOnDevices
fails with result string of "timeout."Note, however, that if I replace "/dev/sdf" with, say, "/dev/nvme1n1" in the above Ignition configuration's "storage.filesystems.mount.device" field, the EC2 instance boots as expected and Ignition creates the filesystem on the attached EBS volume.
After the instance has booted and that filesystem is created, we can see that the symbolic link from the /dev/sdf device path I specified to AWS to /dev/nvme1n1 does in fact get created (as I had reported in #2481):
The NVMe-related udev rules are in place:
However, #2481 noted that these rules may not be present in initramfs, though coreos/bootengine#149 and coreos/coreos-overlay#3456 were supposed to solve that problem. I don't know how to determine whether these rules are in fact present in initramfs. It does seem that when Ignition has systemd go looking for these symbolic link device paths, though, that they're not there yet.
I also tried updating my grub.cfg file to increase the values for
nvme_core.io_timeout
nvme_core.max_retries
, but that didn't change the behavior. When I specify a device path of /dev/nvme1n1 for Ignition's "storage.filesystems.mount.device" field, it completes successfully and quickly, so I don't think we're timing out waiting for the EBS volume to attach. Rather, I think we're timing out because we're looking for a symbolic link that hasn't been created yet when Ignition runs.Reproduction Steps
Failing Case
In the "user data" for the instance, include an Ignition configuration with a "filesystem.mount" stanza, specifying a device path like "/dev/sdf."
Again, specify a device path matching the Ignition configuration like "/dev/sdf."
If the instance isn't responsive after the first minute or so, it isn't ever going to be.
Run the following command repeatedly until it prints an integer value in the hundreds rather than seven, the inspect the destination file /tmp/ec2-console.txt.
Successful Case
In the "user data" for the instance, include an Ignition configuration with a "filesystem.mount" stanza, specifying a device path like "/dev/sdf."
Again, specify a device path matching the Ignition configuration of "/dev/nvme1n1." (Here we're assuming that the number of devices attached via NVMe is small, and that the EBS volume will be the second such device, landing at index 1.)
Other Information
I've seen this same failure occur on EC2 instances in the "m5a" family, but I expect that the behavior will be the same on all those types using NVMe.
Again, by my reading of #2481, we thought we had this fixed—at least for Azure—but I'm not finding it to work for me in AWS.
Here's the EC2 instance console log from a failing case, asking Ignition to create a filesystem on the device path "/dev/sdf."
The text was updated successfully, but these errors were encountered: