New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ubuntu 23.10 (Mantic) minimized/minimal cloud images do not receive IP address during provisioning in certain environments #4451
Comments
Thanks @philroche for the bug here. I can reproduce this problem with the test procedure you provided. And confirm the bug. This race in network setup represents a general problem for two scenarios:
In both of these cases, minimal images/kernels will be forced to rely on udev discovering NICs, loading required modules and bringing links up as a reaction to Given that neither To be honest, this same condition exists for both
Additonally, the systemd unit From cloud-init's perspective, we can toss around ideas in this issue that could be workarounds for known platforms or datasources that require network config to be functional. But, the proposals below feel like hacks around shortcomings in the setup and config of udev configured network devices. Ideally, it feels like images containing kernels and/or initramfs looking to support certain platforms/solutions out of the box should should probably be either providing kernel built-ins or initramfs support of those required drivers to ensure fast and efficient boot on those desired hardware platforms. Proposals/investigations for coping with late udev net subsystem adds ordered in reverse priority when no devices seen yet in
|
@TheRealFalcon @holmanb @philroche @enr0n and @xnox if you have any alternative suggestions that we may pursue please raise them as we think through this. I'm leaning toward option 4 as a final check in only certain conditions, but it really feels like this might be something that should be better handled in the kernel or in |
In this case, |
Thanks @enr0n! I'm deprioritzing this issue here for cloud-init as we won't need to perform this lookup and udevadm settle when no network devices are present, as the upstream systemd commit for systemd/systemd#27822 performs this check prior to completing systemd-networkd-wait-online.service once present in distributions. This behavior will correctly block The cases where this race remains are lower priority and don't warrant generalizing a fix in cloud-init at this time:
|
@blackboxsw have you escalated systemd issue back to foundations? they could have fixed it in time for mantic release, no? @enr0n is there a bug report that systemd is tracking for this? |
I am now tracking https://bugs.launchpad.net/cloud-images/+bug/2036968 for Mantic. |
Outstanding concerns
Do we have any known cloud-init use cases that have both <v253.6 and no initrd? I don't think we do. On Ubuntu we currently release back to focal which has newer than that, so I don't think that this is relevant anymore.
Waiting on ProposalsAll of the proposed solutions depend on
[1] I don't believe that we can know which ones to wait for without either a. platform-specific assumptions or b. relying on a (possibly incorrect) network configuration which in the broken case would require boot to hang indefinitely. We do need to make cloud-init resilient to device enumeration failures due to late device uevents, but that is tracked in a separate bug report. |
The conclusion reached in this bug was that cloud-init expects a primary interface to be available via builtin module availability or dependency on @blackboxsw I don't see any issues or action items in scope for cloud-init related to the this original bug - I think that we should close it. If there is remaining work to do related to, we should file a new issue which describes the user-visible behavior that is broken in cloud-init. Either way, unless I missed something significant, this issue should be closed. |
For future Ubuntu-specific bug reports please file reports on Launchpad. |
Since the issue raised here resulted from changes to external projects on Ubuntu, and the fix was addressed also in external Ubuntu projects. I don't see any action left to take for this bug report, so I'm going to close it. If further Ubuntu-related issues arise related to this, please file a new report on Launchpad. |
Bug report
This is a cloud-init specific bug from the cloud-images/kernel bug @ https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036968
Steps to reproduce the problem
Following a recent change from linux-kvm kernel to linux-generic kernel in the mantic minimized images there is a reproducable bug where a guest VM does not have an IP address assigned as part of cloud-init provisioning.
This is easiest to reproduce when emulating arm64 on amd64 host. The bug is a race condition, so there could exist fast enough virtualisation on fast enough hardware where this bug is not present but in all my testing I have been able to reproduce.
The latest mantic minimized images from http://cloud-images.ubuntu.com/minimal/daily/mantic/ have force initrdless boot and no initrd to fallback to.
This but is not present in the non minimized/base images @ http://cloud-images.ubuntu.com/mantic/ as these boot with initrd with the required drivers present for virtio-net.
Reproducer
You will then be able to log in with user
ubuntu
and passwordpassw0rd
.You can run
ip a
and see that there is a network interface present (separate tolo
) but no IP address has been assigned.This is because when cloud-init is trying to configure network interfaces it doesn't find any so it doesn't configure any. But by the time boot is complete the network interface is present but cloud-init provisioning has already completed.
You can verify this by running
sudo cloud-init clean && sudo cloud-init init
You can then see a successfully configured network interface
There is work ongoing to include the
virtio-net
driver as a built-in in the mantic generic kernel which will solve the problem for use cases using thevirtio-net
driver. But the problem still exists with cloud-init and how to handle this race.There are no plans to include the
e1000
driver so this bug is still reproducable using that driver uselaunch-qcow2-image-qemu-arm64-e1000.sh
reproducerThe bug is also reproducible with amd64 guest on adm64 host on older/slower hardware.
Environment details
cloud-init logs
Attached
minimal-arm64-cloud-init.tar.gz
The text was updated successfully, but these errors were encountered: