-
Notifications
You must be signed in to change notification settings - Fork 844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial hotplug support (SC-252) #936
Conversation
@@ -0,0 +1,220 @@ | |||
# This file is part of cloud-init. See LICENSE file for license information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some code (mainly the classes) in this file is a little more abstract than it needs to be. It's for the future support of disk events. Given that Ryan initially wrote it that way and it works and makes sense for future additions, it didn't make sense to make it any simpler.
@@ -29,6 +29,7 @@ class EventType(Enum): | |||
BOOT = "boot" | |||
BOOT_NEW_INSTANCE = "boot-new-instance" | |||
BOOT_LEGACY = "boot-legacy" | |||
HOTPLUG = 'hotplug' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth scoping this event type HOTPLUG_NETWORK since we have plans to handle disks at some point, then folks have the ability to opt-into hotplug disk events vs network events. What do you think? Or should both of those events be activated with a single HOTPLUG "event" subscription name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above this enum, there's an EventScope
that is used with EventType
for defining when we update. I think the Scope is where we'd differentiate between network vs disk hotplug.
@blackboxsw , I was able to successfully run tests on EC2. I also tested Debian and |
Not in this PR, but it might be nice at some point to have either:
$ sudo cloud-init query userdata
#cloud-config
updates:
network:
when: [hotplug]
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really looks great... and solid job on the integration test created.
I think there are a couple of things to think about or general questions:
- Why on a launched VM do I not see the user customized event updates logged in "Allowed events" logs What I currently see is
2021-07-09 03:43:38,606 - stages.py[DEBUG]: Allowed events: {<EventScope.NETWORK: 'network'>: {<EventType.BOOT_NEW_INSTANCE: 'boot-new-instance'>}}
- Do we want to surface structured hotplug status or config details in either
cloud-init status --long
orinstance-data.json
- device removes seem to get us into quite a noisy log loop given
2021-07-09 03:19:21,199 - hotplug_hook.py[DEBUG]: subsystem=net update attempt 2/5
while waiting for the device to disappear from metadata across retries since we perform the full _get_data operation on a datasource. I wonder if it might be worth the datasource growing a_get_data_network
oris_network_data_changed
method to make that query more specific in scope and purpose which will also reduce the overall log saturation while walking the entire IMDS each attempt. - cloud-init analyze show will incorrectly start representing each new hot plug event as a separate boot event because it is a repeated log entry key. I think we might have to instrument something like PR #915 comment to avoid repeated "boot" sections from analyze for device adds/removes. I'll probably re-open the recently closed PR tune azure ds telemetry annotation #915 to avoid this case for hotplug reported events as well as Azure's telemetry events.
Example invalid cloud-init analyze show due to 2x add/remove of the same NIC
Starting stage: hotplug-hook
|`->restored from cache with run check: DataSourceOpenStackLocal [net,ver=2] @00.00100s +00.00600s
Finished stage: (hotplug-hook) 06.70300 seconds
Total Time: 6.70300 seconds
-- Boot Record 04 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.
Starting stage: hotplug-hook
|`->restored from cache with run check: DataSourceOpenStackLocal [net,ver=2] @00.00000s +00.00500s
Finished stage: (hotplug-hook) 21.50300 seconds
Total Time: 21.50300 seconds
4 boot records analyzed
cloudinit/cmd/devel/hotplug_hook.py
Outdated
|
||
|
||
def get_parser(parser=None): | ||
"""Build or extend and arg parser for hotplug-hook utility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Build or extend and arg parser for hotplug-hook utility. | |
"""Build or extend an arg parser for hotplug-hook utility. |
|
||
if is_finished; then | ||
# open cloud-init's hotplug-hook fifo rw | ||
exec 3<>/run/cloud-init/hook-hotplug-cmd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to log some bread crumbs to /var/log/cloud-init.log about when this FIFO was created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the systemd service cloud-hotplugd is where these details are ... and I suspect we want to keep the hoplugd log separate ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raharper. Right-o, James and I ended up talking later on Friday about this and determined as well that interacting with journalctl and/or systemctl was useful enough, and there's enough logging happening when the handler is instatiated after hotplug that we get the gist of what's going on. If we add more visibility via cloud-init status --long or even instance-data.json to datasource configuration options I think that meets my desire for discoverability of ongoing "services" that cloud-init has active/enabled.
@@ -0,0 +1,11 @@ | |||
[Unit] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The perms on this file need to change as mentioned in complaints seen from journalctl
Jul 09 03:09:23 test-hotplug systemd[1]: Configuration file /lib/systemd/system/cloud-init-hotplugd.service is marked executable. Please remove executable permission bits. Proceeding anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beyond the nits that I mentioned throughout I think we might want to touch up the `cloud-init devel hotplug-hook a bit to better field bogus values for those testing on the commandline.
As it is currently I hit the following nasty traceback just trying to lean more about the subcommand
ubuntu@test-hotplug:~$ sudo cloud-init devel hotplug-hook
2021-07-09 04:08:22,223 - hotplug_hook.py[ERROR]: Received fatal exception handling hotplug!
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 211, in handle_args
handle_hotplug(
File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 140, in handle_hotplug
raise Exception(
Exception: hotplug-hook: cannot handle events for subsystem: None
Traceback (most recent call last):
File "/usr/bin/cloud-init", line 33, in <module>
sys.exit(load_entry_point('cloud-init==21.2', 'console_scripts', 'cloud-init')())
File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 920, in main
retval = util.log_time(
File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2392, in log_time
ret = func(*args, **kwargs)
File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 211, in handle_args
handle_hotplug(
File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 140, in handle_hotplug
raise Exception(
Exception: hotplug-hook: cannot handle events for subsystem: None
while not critical to functionality (since our call chain from systemd units/services is structured) it'd be nice to do a little validation on some of the arg values to better handle failure modes.
LOG.debug( | ||
'%s called with the following arguments:\n' | ||
'udevaction: %s\n' | ||
'subsystem: %s\n' | ||
'devpath: %s', | ||
name, args.udevaction, args.subsystem, args.devpath | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just me, but I find multiline logs hard to parse in cloud-init.log. can we drop the newlines from this somehow?
LOG.debug( | |
'%s called with the following arguments:\n' | |
'udevaction: %s\n' | |
'subsystem: %s\n' | |
'devpath: %s', | |
name, args.udevaction, args.subsystem, args.devpath | |
) | |
LOG.debug( | |
'%s called with the following arguments: {udevaction: %s, subsystem: %s, devpath: %s}', name, args.udevaction, args.subsystem, args.devpath | |
) |
|
||
[Service] | ||
Type=simple | ||
ExecStart=/bin/bash -c 'read args <&3; echo "args=$args"; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Man, had to re-read bash fifo and file handler docs a bit again to get my bearings here on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should update the service file with some doc describing what's happening. IIRC, we are reading in from FD:3 into the variable args. This allows passing in key/value pairs from the hotplug hook which contains the UDEV environmental context when the event is processed, to hotplugd service which is going to invoke cloud-init's hotplug handler with the event args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 will add doc comments to these files to give better context.
cloudinit/cmd/devel/hotplug_hook.py
Outdated
if 'reporting' in hotplug_init.cfg: | ||
reporting.update_configuration(hotplug_init.cfg.get('reporting')) | ||
|
||
# Logging isn't be setup until now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Logging isn't be setup until now | |
# Logging isn't going to be setup until now |
cloudinit/net/activators.py
Outdated
@@ -15,6 +15,19 @@ | |||
LOG = logging.getLogger(__name__) | |||
|
|||
|
|||
def _alter_interface(cmd, device_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well type hint it too
def _alter_interface(cmd, device_name): | |
def _alter_interface(cmd, device_name) -> bool: |
@staticmethod | ||
@abstractmethod | ||
def bring_down_interface(device_name: str) -> bool: | ||
raise NotImplementedError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice in these methods if we docstr that True means success and False means failure to perform the operation.
Please also add/updates docs:
|
This is one aspect of these events that's confusing and I haven't found a better way to represent in the logs. If you search later in the logs for "Allowed events", you'll likely see hotplug in there. The very first time we check to see if we can fetch/update metadata, we haven't even processed userdata yet, so we have to rely on the datasource defaults. HOTPLUG initially won't be represented there until after we've processed the userdata. |
There are a few things in play here. 1) We're not enabling hotplug support on by default. If we did, then the on-disk config of cloud-init would include hotplug in the events. 2) We likely don't want to react to hotplug events on first boot (boot-new instance or boot); In either of these cases the metadata service either has the new nic config or it doesn't. If it's present in the metadata then we don't need to handle any hotplug event (regular network config generation and blocking will handl it). If it isn't in the metadata service, we have to wait for that anyhow which would block boot. I don't recall if we leave the hotplug even in the FIFO, but we could process hotplug events that occurred during first-boot after cloud-final is completed. Or even just sleep for sometime until cloud-init.target is reached. 3) We want to let user-data control whether hotplug events are handled, so the counter log sequence is one where we see hotplug initially listed if the OS/Image has it enabled by default, only to post-user-data processing to see it being removed. We may want to enumerate the state (enabled/disabled) such that the log would show a state transition. That may be less confusing. |
@raharper @TheRealFalcon it doesn't seem like I can push to this branch. I added a couple of commit fixes for sysconfig changes that I tested on Rocky(CentOS) and pushed to upstream/james-hotplug. I did run into an SELinux related policy issue with systemd.socket reading from a FIFO which I opened as https://bugs.launchpad.net/cloud-init/+bug/1936229 I've pushed a comparable branch with a couple of extra commits here to see general hotplug on Rocky on EC2 work:
|
n/m PEBCAK I can push to this branch |
Adds a udev script which will invoke a hotplug hook script on all net add events. The script will write some udev arguments to a systemd FIFO socket (to ensure we have only instance of cloud-init running at a time), which is then read by a new service that calls a new 'cloud-init devel hotplug-hook' command to handle the new event. This hotplug-hook command will: - Fetch the pickled datsource - Verify that the hotplug event is supported/enabled - Update the metadata for the datasource - Ensure the hotplugged device exists within the datasource - Apply the config change on the datasource metadata - Bring up the new interface (or apply global network configuration) - Save the updated metadata back to the pickle cache Also scattered in some unrelated typing to make my life easier.
2. Remove the PCI device restriction 3. Increase the wait times when updating metadata from hotplug 4. Add a log message for the selected activator
@blackboxsw : I am +1 on your changes. If you think this is in a good spot, I still need an approval from you. |
+1 on this @TheRealFalcon Let's document the related SELinux bug in a followup PR. We can iterate on the rest of the "gaps" in the future. |
Proposed Commit Message
Additional Context
Currently enabled for openstack and EC2. Haven't tested anywhere else. Testing via LXD won't be easy/possible. New ethernet devices on LXD show up as one device and is then quickly moved to another device, so the udev add event doesn't give a valid device name. That may be a scenario we want to handle at some point, but doesn't seem necessary right now.
Test Steps
Created the following cloud-config:
then:
on instance:
once the instance is back up, on the host:
ip a
should show the new nic configured appropriately with the relevant logs at the bottom of cloud-init.logChecklist: