Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud-init runs before network interface is ready on RHEL 8.7 #4050

Closed
ubuntu-server-builder opened this issue May 12, 2023 · 5 comments
Closed
Labels
launchpad Migrated from Launchpad

Comments

@ubuntu-server-builder
Copy link
Collaborator

This bug was originally filed in Launchpad as LP: #1998655

Launchpad details
affected_projects = []
assignee = None
assignee_name = None
date_closed = 2023-02-22T16:52:30.310856+00:00
date_created = 2022-12-02T22:41:54.160805+00:00
date_fix_committed = 2022-12-13T16:55:52.660665+00:00
date_fix_released = 2023-02-22T16:52:30.310856+00:00
id = 1998655
importance = undecided
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1998655
milestone = None
owner = mark-rogers-h
owner_name = Mark Rogers
private = False
status = fix_released
submitter = mark-rogers-h
submitter_name = Mark Rogers
tags = []
duplicates = []

Launchpad user Mark Rogers(mark-rogers-h) wrote on 2022-12-02T22:41:54.160805+00:00

We're running RHEL 8.7 VMs on Apache CloudStack. The cloud-init service waits for Network Manager to start, but it runs before Network Manager is done. Because the network isn't ready yet, the CloudStack data source is unable to determine the virtual router address. This causes cloud-init to fall back to using DataSourceNone. That leaves the VM without the proper configuration.

I see one bug, and one potential bug here. The bug is that cloud-init is starting too early in the boot process. The potential bug is that the CloudStack data source doesn't have any retry logic when it looks for the virtual router.

To fix the timing bug, I recommend adding After=NetworkManager-wait-online.service to the cloud-init.service file. That will make it wait until after the network interface is ready. Note that it's already waiting for systemd-networkd-wait-online.service, but that service doesn't exist in RHEL 8.7, so systemd ignores that line.

You can see the timing issue in this excerpt from /var/log/messages:

Dec 2 18:27:04 mark-test-fs NetworkManager[723]: [1670005624.5790] device (ens3): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Dec 2 18:27:04 mark-test-fs NetworkManager[723]: [1670005624.6212] device (ens3): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Dec 2 18:27:04 mark-test-fs NetworkManager[723]: [1670005624.6213] device (ens3): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Dec 2 18:27:04 mark-test-fs NetworkManager[723]: [1670005624.6215] manager: NetworkManager state is now CONNECTED_SITE
Dec 2 18:27:04 mark-test-fs NetworkManager[723]: [1670005624.6217] device (ens3): Activation: successful, device activated.
Dec 2 18:27:04 mark-test-fs NetworkManager[723]: [1670005624.6220] manager: NetworkManager state is now CONNECTED_GLOBAL
Dec 2 18:27:04 mark-test-fs NetworkManager[723]: [1670005624.6222] manager: startup complete
Dec 2 18:27:04 mark-test-fs systemd[1]: Started Network Manager Wait Online.
Dec 2 18:27:04 mark-test-fs systemd[1]: Starting Initial cloud-init job (metadata service crawler)...
Dec 2 18:27:05 mark-test-fs journal[938]: [CLOUDINIT]2022-12-02 18:27:05,027 - util.py[WARNING]: Failed to fetch password from virtual router 10.11.8.0
Dec 2 18:27:05 mark-test-fs journal[938]: [CLOUDINIT]2022-12-02 18:27:05,043 - stages.py[INFO]: Loaded datasource DataSourceCloudStack - DataSourceCloudStack
Dec 2 18:27:05 mark-test-fs journal[938]: [CLOUDINIT]2022-12-02 18:27:05,093 - stages.py[INFO]: Applying network configuration from fallback bringup=True: {'ethernets': {'ens3': {'dhcp4': True, 'set-name': 'ens3', 'match': {'macaddress': '1e:00:29:00:02:b4'}}}, 'version': 2}
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.1480] agent-manager: agent[f74dfcab0129a270,:1.13/nmcli-connect/0]: agent registered
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.1487] device (ens3): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.1489] manager: NetworkManager state is now DISCONNECTING
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.1494] device (ens3): disconnecting for new activation request.
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.1494] audit: op="connection-activate" uuid="21d47e65-8523-1a06-af22-6f121086f085" name="System ens3" pid=975 uid=0 result="success"
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.2131] device (ens3): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.2247] dhcp4 (ens3): canceled DHCP transaction
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.2247] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds)
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.2247] dhcp4 (ens3): state changed no lease
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.2770] manager: NetworkManager state is now DISCONNECTED
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.2771] device (ens3): Activation: starting connection 'System ens3' (21d47e65-8523-1a06-af22-6f121086f085)
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.2777] device (ens3): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.2778] manager: NetworkManager state is now CONNECTING
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.2779] device (ens3): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.2790] device (ens3): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.2791] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds)
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.2855] dhcp4 (ens3): state changed new lease, address=10.11.8.179
Dec 2 18:27:05 mark-test-fs NetworkManager[723]: [1670005625.2858] policy: set 'System ens3' (ens3) as default for IPv4 routing and DNS

@ubuntu-server-builder ubuntu-server-builder added the launchpad Migrated from Launchpad label May 12, 2023
@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Max Dribinsky(maxdrib) wrote on 2022-12-06T16:59:45.929551+00:00

Adding some additional details here, we were able to identify that this issue is not just limited to RHEL8.7, but also can be observed in RHEL8.6. The root cause emerges from the cloud-init source code with rhel. Looking at cloud-init releases for rhel in https://git.centos.org/rpms/cloud-init/releases, we saw the introduction of this systemd After= condition here (https://git.centos.org/rpms/cloud-init/blob/e2481008492b618ff37907f88dc718d0a2154724/f/SOURCES/0007-Remove-race-condition-between-cloud-init-and-Network.patch) to prevent the NetworkManager/cloud-init race condition, and its subsequent removal here (https://git.centos.org/rpms/cloud-init/blob/6af34fd764c66bb42d83d6fb706b92a1cd50cec1/f/SOURCES/ci-Remove-rhel-specific-files.patch). We have identified the issue to affect cloud-init versions >cloud-init-22.1-1.el8 (https://git.centos.org/rpms/cloud-init/c/1992bd68fc512e22df783ff955edccbad66dbd60?branch=c8s). We recommend updating the systemd service template for cloud-init to account for this additional After= condition.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user James Falcon(falcojr) wrote on 2022-12-06T19:12:15.242007+00:00

If I'm understanding this correctly, there was a patch downstream that had added the correct After= condition. That patch was removed from downstream, but that behavior never made it upstream. So you would like upstream (presumably in cloud-init.service.tmpl) to include 'After=NetworkManager-wait-online.service' for any RHEL based distro?

If that's the case, I created a PR here:
#1893

Please let me know if this works for you.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user shixuantong(sxt1001) wrote on 2022-12-07T12:10:51.835656+00:00

#1893

This modification will take effect on all "almalinux", "centos", "cloudlinux", "eurolinux", "fedora","miraclelinux", "openEuler", "openmandriva", "rhel", "rocky","virtuozzo" systems, not only "rhel". Does it affect other distro?

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Alberto Contreras(aciba) wrote on 2023-02-22T16:52:31.036719+00:00

This bug is believed to be fixed in cloud-init in version 23.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Abhilash Chittathukatil(chittath) wrote on 2023-04-27T23:28:37.143058+00:00

I am getting similar error in RHEl8.7 where cloud-init it is unable to bring up the network interface. just curious if it related. Max version I can install from rhel-8-appstream-rhui-rpms is loud-init-22.1-6.el8_7.2.noarch.

rpm -q cloud-init

cloud-init-22.1-6.el8_7.2.noarch


  Starting Network Manager Script Dispatcher Service...

[ OK ] Started Network Manager Script Dispatcher Service.
[ OK ] Started Network Manager Wait Online.
Starting Initial cloud-init job (metadata service crawler)...
[ OK ] Started Dynamic System Tuning Daemon.
[ 8.617423] cloud-init[1245]: Cloud-init v. 21.1-15.el8 running 'init' at Thu, 27 Apr 2023 18:59:22 +0000. Up 8.56 seconds.
[ 8.617508] cloud-init[1245]: ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
[ 8.617537] cloud-init[1245]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[ 8.617563] cloud-init[1245]: ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
[ 8.617588] cloud-init[1245]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[ 8.617612] cloud-init[1245]: ci-info: | eth0 | False | . | . | . |
[ 8.617635] cloud-init[1245]: ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
[ 8.617660] cloud-init[1245]: ci-info: | lo | True | ::1/128 | . | host | . |
[ 8.617684] cloud-init[1245]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[ 8.617707] cloud-init[1245]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
[ 8.617730] cloud-init[1245]: ci-info: +-------+-------------+---------+-----------+-------+
[ 8.617754] cloud-init[1245]: ci-info: | Route | Destination | Gateway | Interface | Flags |
[ 8.617777] cloud-init[1245]: ci-info: +-------+-------------+---------+-----------+-------+
[ 8.617804] cloud-init[1245]: ci-info: +-------+-------------+---------+-----------+-------+
[ 8.617829] cloud-init[1245]: 2023-04-27 18:59:22,354 - util.py[WARNING]: Failed loading yaml blob. unacceptable character #xdca0: special characters are not allowed
[ 8.617855] cloud-init[1245]: in "<unicode string>", position 0
[ 8.620265] cloud-init[1245]: 2023-04-27 18:59:22,354 - util.py[WARNING]: Failed at merging in cloud config part from cloud-config.txt
[ 8.644770] cloud-init[1245]: 2023-04-27 18:59:22,359 - util.py[WARNING]: Failed calling handler ShellScriptPartHandler: [['text/x-shellscript']] (text/x-shellscript, userdata.txt, 2) with frequency always
[ OK ] Started Initial cloud-init job (metadata service crawler).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
launchpad Migrated from Launchpad
Projects
None yet
Development

No branches or pull requests

1 participant