Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial hotplug support (SC-252) #936

Merged
merged 15 commits into from
Jul 19, 2021
Merged

Conversation

TheRealFalcon
Copy link
Member

@TheRealFalcon TheRealFalcon commented Jul 2, 2021

Proposed Commit Message

Initial hotplug support

Adds a udev script which will invoke a hotplug hook script on all net
add events. The script will write some udev arguments to a systemd FIFO
socket (to ensure we have only instance of cloud-init running at a
time), which is then read by a new service that calls a new 'cloud-init
devel hotplug-hook' command to handle the new event.

This hotplug-hook command will:
- Fetch the pickled datsource
- Verify that the hotplug event is supported/enabled
- Update the metadata for the datasource
- Ensure the hotplugged device exists within the datasource
- Apply the config change on the datasource metadata
- Bring up the new interface (or apply global network configuration)
- Save the updated metadata back to the pickle cache

Also scattered in some unrelated typing to make my life easier.

Additional Context

Currently enabled for openstack and EC2. Haven't tested anywhere else. Testing via LXD won't be easy/possible. New ethernet devices on LXD show up as one device and is then quickly moved to another device, so the udev add event doesn't give a valid device name. That may be a scenario we want to handle at some point, but doesn't seem necessary right now.

Test Steps

Created the following cloud-config:

#cloud-config
updates:
  network:
    when: ['hotplug']

then:

packages/bddeb
openstack server create --flavor m1.medium --image  cda80fdb-aa6a-4f82-8c89-352d6849766e --key-name <my_key> --nic net-id=<my_network> <server_name> --wait --user-data /my-user-data.yaml
openstack server add floating ip <id> 66650537-fbbb-4932-82fb-1b0ed9361cb2
scp cloud-init_all.deb ubuntu@<ip>:/tmp/cloud-init.deb

on instance:

dpkg -i /tmp/cloud-init.deb
cloud-init clean --logs --reboot

once the instance is back up, on the host:

openstack port create --network <network> <port_name>
openstack server add port <server_name> <port_name>

ip a should show the new nic configured appropriately with the relevant logs at the bottom of cloud-init.log

Checklist:

  • My code follows the process laid out in the documentation
  • I have updated or added any unit tests accordingly
  • I have updated or added any documentation accordingly

@TheRealFalcon TheRealFalcon added the wip Work in progress, do not land label Jul 2, 2021
@TheRealFalcon TheRealFalcon changed the title Initial hotplug support Initial hotplug support (SC-19) Jul 2, 2021
@@ -0,0 +1,220 @@
# This file is part of cloud-init. See LICENSE file for license information.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some code (mainly the classes) in this file is a little more abstract than it needs to be. It's for the future support of disk events. Given that Ryan initially wrote it that way and it works and makes sense for future additions, it didn't make sense to make it any simpler.

bash_completion/cloud-init Outdated Show resolved Hide resolved
cloudinit/cmd/devel/hotplug_hook.py Outdated Show resolved Hide resolved
@@ -29,6 +29,7 @@ class EventType(Enum):
BOOT = "boot"
BOOT_NEW_INSTANCE = "boot-new-instance"
BOOT_LEGACY = "boot-legacy"
HOTPLUG = 'hotplug'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth scoping this event type HOTPLUG_NETWORK since we have plans to handle disks at some point, then folks have the ability to opt-into hotplug disk events vs network events. What do you think? Or should both of those events be activated with a single HOTPLUG "event" subscription name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above this enum, there's an EventScope that is used with EventType for defining when we update. I think the Scope is where we'd differentiate between network vs disk hotplug.

@TheRealFalcon
Copy link
Member Author

@blackboxsw , I was able to successfully run tests on EC2. I also tested Debian and ifupdown was called and worked as expected.

@blackboxsw
Copy link
Collaborator

Not in this PR, but it might be nice at some point to have either:

  • cloud-init status --long report something in the details about whether hotplug FIFO is running/active
  • or not maybe we can also discuss whether instance-data,json should surface structured hotplug configuration too.
    The only config content related to whether hotplug is active in instance-data is
$ sudo cloud-init query userdata
#cloud-config
updates:
        network:
                when: [hotplug]
```

Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really looks great... and solid job on the integration test created.

I think there are a couple of things to think about or general questions:

  • Why on a launched VM do I not see the user customized event updates logged in "Allowed events" logs What I currently see is 2021-07-09 03:43:38,606 - stages.py[DEBUG]: Allowed events: {<EventScope.NETWORK: 'network'>: {<EventType.BOOT_NEW_INSTANCE: 'boot-new-instance'>}}
  • Do we want to surface structured hotplug status or config details in either cloud-init status --long or instance-data.json
  • device removes seem to get us into quite a noisy log loop given 2021-07-09 03:19:21,199 - hotplug_hook.py[DEBUG]: subsystem=net update attempt 2/5 while waiting for the device to disappear from metadata across retries since we perform the full _get_data operation on a datasource. I wonder if it might be worth the datasource growing a _get_data_network or is_network_data_changed method to make that query more specific in scope and purpose which will also reduce the overall log saturation while walking the entire IMDS each attempt.
  • cloud-init analyze show will incorrectly start representing each new hot plug event as a separate boot event because it is a repeated log entry key. I think we might have to instrument something like PR #915 comment to avoid repeated "boot" sections from analyze for device adds/removes. I'll probably re-open the recently closed PR tune azure ds telemetry annotation #915 to avoid this case for hotplug reported events as well as Azure's telemetry events.

Example invalid cloud-init analyze show due to 2x add/remove of the same NIC

Starting stage: hotplug-hook
|`->restored from cache with run check: DataSourceOpenStackLocal [net,ver=2] @00.00100s +00.00600s
Finished stage: (hotplug-hook) 06.70300 seconds

Total Time: 6.70300 seconds

-- Boot Record 04 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.

Starting stage: hotplug-hook
|`->restored from cache with run check: DataSourceOpenStackLocal [net,ver=2] @00.00000s +00.00500s
Finished stage: (hotplug-hook) 21.50300 seconds

Total Time: 21.50300 seconds

4 boot records analyzed

cloudinit/net/activators.py Outdated Show resolved Hide resolved


def get_parser(parser=None):
"""Build or extend and arg parser for hotplug-hook utility.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Build or extend and arg parser for hotplug-hook utility.
"""Build or extend an arg parser for hotplug-hook utility.


if is_finished; then
# open cloud-init's hotplug-hook fifo rw
exec 3<>/run/cloud-init/hook-hotplug-cmd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to log some bread crumbs to /var/log/cloud-init.log about when this FIFO was created.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the systemd service cloud-hotplugd is where these details are ... and I suspect we want to keep the hoplugd log separate ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raharper. Right-o, James and I ended up talking later on Friday about this and determined as well that interacting with journalctl and/or systemctl was useful enough, and there's enough logging happening when the handler is instatiated after hotplug that we get the gist of what's going on. If we add more visibility via cloud-init status --long or even instance-data.json to datasource configuration options I think that meets my desire for discoverability of ongoing "services" that cloud-init has active/enabled.

@@ -0,0 +1,11 @@
[Unit]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The perms on this file need to change as mentioned in complaints seen from journalctl

Jul 09 03:09:23 test-hotplug systemd[1]: Configuration file /lib/systemd/system/cloud-init-hotplugd.service is marked executable. Please remove executable permission bits. Proceeding anyway.

Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beyond the nits that I mentioned throughout I think we might want to touch up the `cloud-init devel hotplug-hook a bit to better field bogus values for those testing on the commandline.

As it is currently I hit the following nasty traceback just trying to lean more about the subcommand

ubuntu@test-hotplug:~$ sudo cloud-init devel hotplug-hook 
2021-07-09 04:08:22,223 - hotplug_hook.py[ERROR]: Received fatal exception handling hotplug!
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 211, in handle_args
    handle_hotplug(
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 140, in handle_hotplug
    raise Exception(
Exception: hotplug-hook: cannot handle events for subsystem: None
Traceback (most recent call last):
  File "/usr/bin/cloud-init", line 33, in <module>
    sys.exit(load_entry_point('cloud-init==21.2', 'console_scripts', 'cloud-init')())
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 920, in main
    retval = util.log_time(
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2392, in log_time
    ret = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 211, in handle_args
    handle_hotplug(
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 140, in handle_hotplug
    raise Exception(
Exception: hotplug-hook: cannot handle events for subsystem: None

while not critical to functionality (since our call chain from systemd units/services is structured) it'd be nice to do a little validation on some of the arg values to better handle failure modes.

Comment on lines +201 to +216
LOG.debug(
'%s called with the following arguments:\n'
'udevaction: %s\n'
'subsystem: %s\n'
'devpath: %s',
name, args.udevaction, args.subsystem, args.devpath
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just me, but I find multiline logs hard to parse in cloud-init.log. can we drop the newlines from this somehow?

Suggested change
LOG.debug(
'%s called with the following arguments:\n'
'udevaction: %s\n'
'subsystem: %s\n'
'devpath: %s',
name, args.udevaction, args.subsystem, args.devpath
)
LOG.debug(
'%s called with the following arguments: {udevaction: %s, subsystem: %s, devpath: %s}', name, args.udevaction, args.subsystem, args.devpath
)


[Service]
Type=simple
ExecStart=/bin/bash -c 'read args <&3; echo "args=$args"; \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man, had to re-read bash fifo and file handler docs a bit again to get my bearings here on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update the service file with some doc describing what's happening. IIRC, we are reading in from FD:3 into the variable args. This allows passing in key/value pairs from the hotplug hook which contains the UDEV environmental context when the event is processed, to hotplugd service which is going to invoke cloud-init's hotplug handler with the event args.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 will add doc comments to these files to give better context.

if 'reporting' in hotplug_init.cfg:
reporting.update_configuration(hotplug_init.cfg.get('reporting'))

# Logging isn't be setup until now
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Logging isn't be setup until now
# Logging isn't going to be setup until now

@@ -15,6 +15,19 @@
LOG = logging.getLogger(__name__)


def _alter_interface(cmd, device_name):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well type hint it too

Suggested change
def _alter_interface(cmd, device_name):
def _alter_interface(cmd, device_name) -> bool:

@staticmethod
@abstractmethod
def bring_down_interface(device_name: str) -> bool:
raise NotImplementedError()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice in these methods if we docstr that True means success and False means failure to perform the operation.

@blackboxsw
Copy link
Collaborator

Please also add/updates docs:

  • doc/rtd/topics/events.rst to update new event type/scoping support
  • maybe something in doc/rtd/topics/cli.rst for hotplug? (not certain we really want to start heavy advertising that functionality)

@TheRealFalcon
Copy link
Member Author

Why on a launched VM do I not see the user customized event updates logged in "Allowed events" logs What I currently see is 2021-07-09 03:43:38,606 - stages.py[DEBUG]: Allowed events: {<EventScope.NETWORK: 'network'>: {<EventType.BOOT_NEW_INSTANCE: 'boot-new-instance'>}}

This is one aspect of these events that's confusing and I haven't found a better way to represent in the logs. If you search later in the logs for "Allowed events", you'll likely see hotplug in there. The very first time we check to see if we can fetch/update metadata, we haven't even processed userdata yet, so we have to rely on the datasource defaults. HOTPLUG initially won't be represented there until after we've processed the userdata.

@raharper
Copy link
Collaborator

Why on a launched VM do I not see the user customized event updates logged in "Allowed events" logs What I currently see is 2021-07-09 03:43:38,606 - stages.py[DEBUG]: Allowed events: {<EventScope.NETWORK: 'network'>: {<EventType.BOOT_NEW_INSTANCE: 'boot-new-instance'>}}

This is one aspect of these events that's confusing and I haven't found a better way to represent in the logs. If you search later in the logs for "Allowed events", you'll likely see hotplug in there. The very first time we check to see if we can fetch/update metadata, we haven't even processed userdata yet, so we have to rely on the datasource defaults. HOTPLUG initially won't be represented there until after we've processed the userdata.

There are a few things in play here. 1) We're not enabling hotplug support on by default. If we did, then the on-disk config of cloud-init would include hotplug in the events. 2) We likely don't want to react to hotplug events on first boot (boot-new instance or boot); In either of these cases the metadata service either has the new nic config or it doesn't. If it's present in the metadata then we don't need to handle any hotplug event (regular network config generation and blocking will handl it). If it isn't in the metadata service, we have to wait for that anyhow which would block boot. I don't recall if we leave the hotplug even in the FIFO, but we could process hotplug events that occurred during first-boot after cloud-final is completed. Or even just sleep for sometime until cloud-init.target is reached. 3) We want to let user-data control whether hotplug events are handled, so the counter log sequence is one where we see hotplug initially listed if the OS/Image has it enabled by default, only to post-user-data processing to see it being removed. We may want to enumerate the state (enabled/disabled) such that the log would show a state transition. That may be less confusing.

@blackboxsw
Copy link
Collaborator

@raharper @TheRealFalcon it doesn't seem like I can push to this branch. I added a couple of commit fixes for sysconfig changes that I tested on Rocky(CentOS) and pushed to upstream/james-hotplug. I did run into an SELinux related policy issue with systemd.socket reading from a FIFO which I opened as https://bugs.launchpad.net/cloud-init/+bug/1936229

I've pushed a comparable branch with a couple of extra commits here to see general hotplug on Rocky on EC2 work:
The combined diff is;

index 095d2e1f..84aaafc9 100644
--- a/cloudinit/net/activators.py
+++ b/cloudinit/net/activators.py
@@ -135,7 +135,7 @@ class NetworkManagerActivator(NetworkActivator):
 
         Return True is successful, otherwise return False
         """
-        cmd = ['nmcli', 'connection', 'up', device_name]
+        cmd = ['nmcli', 'connection', 'up', 'ifname', device_name]
         return _alter_interface(cmd, device_name)
 
     @staticmethod
diff --git a/packages/redhat/cloud-init.spec.in b/packages/redhat/cloud-init.spec.in
index 16138012..b930709b 100644
--- a/packages/redhat/cloud-init.spec.in
+++ b/packages/redhat/cloud-init.spec.in
@@ -119,6 +119,12 @@ version_pys=$(cd "$RPM_BUILD_ROOT" && find . -name version.py -type f)
 ( cd "$RPM_BUILD_ROOT" &&
   sed -i "s,@@PACKAGED_VERSION@@,%{version}-%{release}," $version_pys )
 
+# patch hotplug /usr/libexec script path
+hotplug_file=$(cd "$RPM_BUILD_ROOT" && find . -name 10-cloud-init-hook-hotplug.rules -type f)
+
+( cd "$RPM_BUILD_ROOT" &&
+  sed -i "s,/usr/lib,%{_libexecdir}," $hotplug_file )
+
 %clean
 rm -rf $RPM_BUILD_ROOT
 
@@ -172,6 +178,7 @@ fi
 %files
 
 /lib/udev/rules.d/66-azure-ephemeral.rules
+/lib/udev/rules.d/10-cloud-init-hook-hotplug.rules
 
 %if "%{init_system}" == "systemd"
 /usr/lib/systemd/system-generators/cloud-init-generator
diff --git a/systemd/cloud-init-hotplugd.service b/systemd/cloud-init-hotplugd.service
index 6f231cdc..b64632ef 100644
--- a/systemd/cloud-init-hotplugd.service
+++ b/systemd/cloud-init-hotplugd.service
@@ -1,3 +1,14 @@
+# Paired with cloud-init-hotplugd.socket to read from the FIFO
+# /run/cloud-init/hook-hotplug-cmd which is created during a udev network
+# add or remove event as processed by 10-cloud-init-hook-hotplug.rules.
+
+# On start, read args from the FIFO, process and provide structured arguments
+# to `cloud-init devel hotplug-hook` which will setup or teardown network
+# devices as configured by user-data.
+
+# Known bug with an enforcing SELinux policy: LP: #1936229
+# cloud-init-hotplud.service will read args from file descriptor 3
+
 [Unit]
 Description=cloud-init hotplug hook daemon
 After=cloud-init-hotplugd.socket
diff --git a/systemd/cloud-init-hotplugd.socket b/systemd/cloud-init-hotplugd.socket
index f8f10486..aa093016 100644
--- a/systemd/cloud-init-hotplugd.socket
+++ b/systemd/cloud-init-hotplugd.socket
@@ -1,3 +1,8 @@
+# cloud-init-hotplugd.socket listens on the FIFO file
+# /run/cloud-init/hook-hotplug-cmd which is created during a udev network
+# add or remove event as processed by 10-cloud-init-hook-hotplug.rules.
+
+# Known bug with an enforcing SELinux policy: LP: #1936229
 [Unit]
 Description=cloud-init hotplug hook socket

@blackboxsw
Copy link
Collaborator

n/m PEBCAK I can push to this branch

TheRealFalcon and others added 14 commits July 19, 2021 08:33
Adds a udev script which will invoke a hotplug hook script on all net
add events. The script will write some udev arguments to a systemd FIFO
socket (to ensure we have only instance of cloud-init running at a
time), which is then read by a new service that calls a new 'cloud-init
devel hotplug-hook' command to handle the new event.

This hotplug-hook command will:
- Fetch the pickled datsource
- Verify that the hotplug event is supported/enabled
- Update the metadata for the datasource
- Ensure the hotplugged device exists within the datasource
- Apply the config change on the datasource metadata
- Bring up the new interface (or apply global network configuration)
- Save the updated metadata back to the pickle cache

Also scattered in some unrelated typing to make my life easier.
Co-authored-by: Chad Smith <chad.smith@canonical.com>
2. Remove the PCI device restriction
3. Increase the wait times when updating metadata from hotplug
4. Add a log message for the selected activator
@TheRealFalcon
Copy link
Member Author

@blackboxsw : I am +1 on your changes. If you think this is in a good spot, I still need an approval from you.

@TheRealFalcon TheRealFalcon changed the title Initial hotplug support (SC-19) Initial hotplug support (SC-252) Jul 19, 2021
@TheRealFalcon TheRealFalcon removed the wip Work in progress, do not land label Jul 19, 2021
@blackboxsw
Copy link
Collaborator

+1 on this @TheRealFalcon Let's document the related SELinux bug in a followup PR. We can iterate on the rest of the "gaps" in the future.

@TheRealFalcon TheRealFalcon merged commit 184c836 into canonical:main Jul 19, 2021
@TheRealFalcon TheRealFalcon deleted the hotplug branch July 19, 2021 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants