Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to find a system nic while running cloud-init #4125

Open
xiaoge1001 opened this issue May 18, 2023 · 7 comments
Open

Unable to find a system nic while running cloud-init #4125

xiaoge1001 opened this issue May 18, 2023 · 7 comments
Assignees
Labels
bug Something isn't working correctly priority Fix soon Stale

Comments

@xiaoge1001
Copy link
Contributor

xiaoge1001 commented May 18, 2023

Bug report

I run "cloud-init status", I found that cloud-init failed to executel

Steps to reproduce the problem

Reboot computes

Environment details

  • Cloud-init version: 21.4
  • Operating System Distribution: openEuler

cloud-init logs

2023-05-18 02:48:15,451 - util.py[DEBUG]: failed stage init-local
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 689, in status_wrapper
ret = functor(name, args)
File "/usr/lib/python3.9/site-packages/cloudinit/cmd/main.py", line 398, in main_init
init.apply_network_config(bring_up=bring_up_interfaces)
File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 789, in apply_network_config
netcfg, src = self._find_networking_config()
File "/usr/lib/python3.9/site-packages/cloudinit/stages.py", line 740, in _find_networking_config
if self.datasource and hasattr(self.datasource, 'network_config'):
File "/usr/lib/python3.9/site-packages/cloudinit/sources/DataSourceConfigDrive.py", line 158, in network_config
self._network_config = openstack.convert_net_json(
File "/usr/lib/python3.9/site-packages/cloudinit/sources/helpers/openstack.py", line 698, in convert_net_json
raise ValueError("Unable to find a system nic for %s" % d)
ValueError: Unable to find a system nic for {'mtu': 1500, 'type': 'physical', 'subnets': [{'type': 'static', 'netmask': '255.255.255.0', 'routes': [{'netmask': '0.0.0.0', 'network': '0.0.0.0', 'gateway': '192.168.50.1'}], 'address': '192.168.50.19', 'ipv4': True}], 'mac_address': 'fa:16:3e:7c:49:9f'}

@xiaoge1001 xiaoge1001 added bug Something isn't working correctly new An issue that still needs triage labels May 18, 2023
@xiaoge1001
Copy link
Contributor Author

@xiaoge1001
Copy link
Contributor Author

xiaoge1001 commented May 18, 2023

I add "ExecStartPre=sleep 30" in Service section of cloud-init-local.service. This avoids the problem. But I think that it's not a fundamental solution.

@radeksm
Copy link

radeksm commented May 19, 2023

I see similar problem on 22.1-5.el8.0.1, this is what comes from RHEL 8.6:

May 18 12:06:41 overcloud-controller-cbis24-375-0 systemd-hostnamed[7896]: Changed static host name to 'overcloud-controller-cbis24-375-0'
May 18 12:06:41 overcloud-controller-cbis24-375-0 systemd-hostnamed[7896]: Changed host name to 'overcloud-controller-cbis24-375-0'
May 18 12:06:41 overcloud-controller-cbis24-375-0 ovs-vsctl[7898]: ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: Cloud-init v. 22.1-5.el8.0.1 running 'init-local' at Thu, 18 May 2023 16:06:40 +0000. Up 91.27 seconds.
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: 2023-05-18 16:06:41,214 - util.py[WARNING]: failed stage init-local
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: failed run of stage init-local
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: ------------------------------------------------------------
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: Traceback (most recent call last):
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: File "/usr/lib/python3.6/site-packages/cloudinit/cmd/main.py", line 761, in status_wrapper
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: ret = functor(name, args)
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: File "/usr/lib/python3.6/site-packages/cloudinit/cmd/main.py", line 433, in main_init
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: init.apply_network_config(bring_up=bring_up_interfaces)
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: File "/usr/lib/python3.6/site-packages/cloudinit/stages.py", line 869, in apply_network_config
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: netcfg, src = self._find_networking_config()
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: File "/usr/lib/python3.6/site-packages/cloudinit/stages.py", line 812, in _find_networking_config
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: if self.datasource and hasattr(self.datasource, "network_config"):
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: File "/usr/lib/python3.6/site-packages/cloudinit/sources/DataSourceConfigDrive.py", line 169, in network_config
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: self.network_json, known_macs=self.known_macs
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: File "/usr/lib/python3.6/site-packages/cloudinit/sources/helpers/openstack.py", line 731, in convert_net_json
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: raise ValueError("Unable to find a system nic for %s" % d)
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: ValueError: Unable to find a system nic for {'type': 'physical', 'mtu': 9000, 'subnets': [{'type': 'dhcp4'}], 'mac_addre>
May 18 12:06:41 overcloud-controller-cbis24-375-0 cloud-init[7877]: ------------------------------------------------------------
May 18 12:06:41 overcloud-controller-cbis24-375-0 systemd[1]: cloud-init-local.service: Main process exited, code=exited, status=1/FAILURE
May 18 12:06:41 overcloud-controller-cbis24-375-0 dbus-daemon[7630]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.>
May 18 12:06:41 overcloud-controller-cbis24-375-0 systemd[1]: cloud-init-local.service: Failed with result 'exit-code'.
May 18 12:06:41 overcloud-controller-cbis24-375-0 dbus-daemon[7630]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'

While I do see the interface with expected MAC address later, my guess is that the NIC with expected MAC address is not ready or does not exist at the time cloud-init tries to configure it. It is likely to happen when you have some SRIOV NICs witch large number of virtual functions.

@aciba90
Copy link
Contributor

aciba90 commented May 24, 2023

Thanks @xiaoge1001 for reporting this issue.

I am not able to reproduce this exact behavior. Could you please provide steps to reproduce and attach the logs?

I mark this issue as incomplete in the meanwhile.

@aciba90 aciba90 added incomplete Action required by submitter and removed new An issue that still needs triage labels May 24, 2023
@xiaoge1001
Copy link
Contributor Author

Thanks @xiaoge1001 for reporting this issue.

I am not able to reproduce this exact behavior. Could you please provide steps to reproduce and attach the logs?

I mark this issue as incomplete in the meanwhile.

Sorry, this is an occasional problem, I do not reproduce the issue. But I adopted the suggestion here at that time. link: https://askubuntu.com/questions/1400527/unable-to-find-a-system-nic-while-running-cloud-init

@holmanb
Copy link
Member

holmanb commented Apr 27, 2024

This is a duplicate of #3523.

@holmanb
Copy link
Member

holmanb commented Apr 27, 2024

Cloud-init doesn't hit this codepath until after systemd-networkd-wait-online.service (and the networkmanager equivalent). While this guarantees that an interface will be ready in userspace, it does not guarantee that all expected interfaces will be ready in userspace. Interfaces may be loaded late by the kernel whem the driver is not loaded by the initramfs. This is what causes this error.

This exception apparently exists in order to validate that invalid network configuration isn't passed to cloud-init. This gets thrown when rendering a configuration for openstack which contains a network device hasn't been loaded yet.

When an interface loud-init has a few options:

a) assume (incorrectly) that if the device is not yet visible in userspace that writing out a configuration with this device will break network backends, traceback without configuring network at all.

Pros: cloud-init can log warnings when openstack passes invalid configuration
Cons: breaks perfectly valid configurations (as reported multiple times).

b) warn about the device not existing and drop it from the configuration (previously proposed)

Pros: breaks perfectly valid configurations slightly less than current state by rendering network configuration for all interfaces that are currently up
Cons: still breaks valid configurations that include late-loaded interfaces

c) block until the interface that openstack told us about is available

Pros: will always "work" when valid configuration is passed
Cons: worst case failure path with invalid configuration (instance hangs), unnecessarily slow boot

d) trust that network daemons can handle configurations which reference interfaces which might load late[1]: don't throw an exception, don't log an error, but do log an info/debug about the missing interface in case openstack did pass invalid network configuration

Pros: will always "work" when valid configuration is passed
Cons: no warning logged or exception thrown when invalid configuration is received from openstack

There is no clear "best" choice, this is an issue that has engineering tradoffs to consider.

Cloud-init currently does a). The tradoffs are between behaving correctly, affecting bootspeed, failure path behavior, and failure path ease of debugging.

If priority is to work correctly when correct configuration is passed is the priority, then c) or d) are probably superior. Preference to d) for better failure path behavior and no negative impact to boot speed.

If priority is logging noisily when invalid network configuration is passed (which would then require users to modify the kernel with built-in drivers or modify initrd to load drivers), then a) or b) are probably superior.

I haven't investigated when openstack might pass invalid configuration. Can users provide freeform network configuration or is this a machine-generated configuration that is generated either dynamically or via structured user input forms? If users can't pass free-form configuration, then a) and b) would seem significant less practical. Even if they can, I think that I would prefer to do the right thing if possible rather than maintain the ability to warn but break on false positives.

[1] They must - interfaces load late all the time on many platforms. Common offenders include virtio and high speed network adapters.

@holmanb holmanb added priority Fix soon new An issue that still needs triage and removed incomplete Action required by submitter labels Apr 27, 2024
@TheRealFalcon TheRealFalcon removed the new An issue that still needs triage label Apr 29, 2024
@holmanb holmanb added the new An issue that still needs triage label May 3, 2024
@a-dubs a-dubs removed the new An issue that still needs triage label Jun 24, 2024
@github-actions github-actions bot added the Stale label Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working correctly priority Fix soon Stale
Projects
None yet
Development

No branches or pull requests

6 participants