Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Services (apparmor, snapd.seeded, ...?) fail to start in nested lxd container #3813

Closed
ubuntu-server-builder opened this issue May 12, 2023 · 10 comments
Labels
launchpad Migrated from Launchpad

Comments

@ubuntu-server-builder
Copy link
Collaborator

This bug was originally filed in Launchpad as LP: #1905493

Launchpad details
affected_projects = ['apparmor', 'autopkgtest', 'snapd', 'dbus (Ubuntu)', 'lxd (Ubuntu)', 'systemd (Ubuntu)']
assignee = None
assignee_name = None
date_closed = 2021-03-17T19:54:56.525422+00:00
date_created = 2020-11-25T00:04:53.707777+00:00
date_fix_committed = None
date_fix_released = None
id = 1905493
importance = undecided
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1905493
milestone = None
owner = anonymouse67
owner_name = Ian Johnson
private = False
status = invalid
submitter = anonymouse67
submitter_name = Ian Johnson
tags = []
duplicates = []

Launchpad user Ian Johnson(anonymouse67) wrote on 2020-11-25T00:04:53.707777+00:00

When booting a nested lxd container inside another lxd container (just a normal container, not a VM) (i.e. just L2), using cloud-init -status --wait, the "." is just printed off infinitely and never returns.

@ubuntu-server-builder ubuntu-server-builder added the launchpad Migrated from Launchpad label May 12, 2023
@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Watkins(oddbloke) wrote on 2020-12-01T22:52:35.628306+00:00

Hi Ian,

I've just launched such a container and I see a bunch of non-cloud-init errors in the log and when I examine systemctl list-jobs, I see that the two running jobs are systemd-logind.service and snapd.seeded.service:

root@certain-cod:~# systemctl list-jobs
JOB UNIT TYPE STATE
114 cloud-final.service start waiting
125 snapd.autoimport.service start waiting
143 systemd-update-utmp-runlevel.service start waiting
116 cloud-config.service start waiting
1 graphical.target start waiting
691 systemd-logind.service start running
99 unattended-upgrades.service start waiting
110 cloud-init.target start waiting
115 snapd.seeded.service start running
2 multi-user.target start waiting

10 jobs listed.

Examining the journal, I see that systemd-logind.service is in a restart loop:

root@certain-cod:~# journalctl -u systemd-logind.service | grep Failed\ w
Dec 01 22:37:43 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:39:13 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:40:44 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:42:14 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:43:44 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:45:14 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:46:45 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:48:15 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.
Dec 01 22:49:45 certain-cod systemd[1]: systemd-logind.service: Failed with result 'timeout'.

This is blocking boot before cloud-init's later stages run, so as it is correctly indicating that it hasn't yet completed, I'm marking this Invalid for cloud-init. I'll add a systemd task instead, as that looks to be the source of the issue.

Cheers,

Dan

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Streetman(ddstreet) wrote on 2021-03-17T18:41:52.657003+00:00

The systemd-logind problem is due to dbus defaulting to apparmor mode 'enabled', but apparmor can't do much of anything inside a container so it fails to start, and dbus can't contact it.

In the 2nd level container, create a file like '/etc/dbus-1/system.d/no-apparmor.conf' with content:

Then restart the 2nd level container and recheck systemd-logind which should now work

Of course, fixing dbus should be a bit smarter about only disabling its use of apparmor if it's inside a container.

However, cloud-init status --wait still hangs after systemd-logind starts up, so that wasn't the original problem (or at least wasn't the only problem)

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Watkins(oddbloke) wrote on 2021-03-17T19:26:24.013040+00:00

Given that the logind issue is an AppArmor issue and, per my previous comment, "the two running jobs are systemd-logind.service and snapd.seeded.service", I suspect that we'll find that snapd is running into similar sorts of issues. I'll take a quick look now.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Ian Johnson(anonymouse67) wrote on 2021-03-17T19:51:36.237331+00:00

FWIW I know what the snapd issue is, the issue is that snapd does not and will not work in a nested LXD container, we need to add code to make snapd.seeded.service die/exit gracefully in this situation.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Watkins(oddbloke) wrote on 2021-03-17T19:54:48.252169+00:00

Yep, that's what I've found; cloud-init is just waiting for its later stages to run, which are blocked by snapd.seeded.service exiting.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Streetman(ddstreet) wrote on 2021-03-17T21:14:04.959162+00:00

it's interesting that apparmor appears to work ok in the first-level container, but fails in the nested container, e.g.:

$ lxc shell lp1905493-f
root@lp1905493-f:~# systemctl status apparmor
● apparmor.service - Load AppArmor profiles
Loaded: loaded (/lib/systemd/system/apparmor.service; enabled; vendor preset: enabled)
Active: active (exited) since Wed 2021-03-17 18:17:44 UTC; 2h 53min ago
Docs: man:apparmor(7)
https://gitlab.com/apparmor/apparmor/wikis/home/
Process: 118 ExecStart=/lib/apparmor/apparmor.systemd reload (code=exited, status=0/SUCCESS)
Main PID: 118 (code=exited, status=0/SUCCESS)

Mar 17 18:17:44 lp1905493-f systemd[1]: Starting Load AppArmor profiles...
Mar 17 18:17:44 lp1905493-f apparmor.systemd[118]: Restarting AppArmor
Mar 17 18:17:44 lp1905493-f apparmor.systemd[118]: Reloading AppArmor profiles
Mar 17 18:17:44 lp1905493-f apparmor.systemd[129]: Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
Mar 17 18:17:44 lp1905493-f systemd[1]: Finished Load AppArmor profiles.
root@lp1905493-f:# lxc shell layer2
root@layer2:
# systemctl status apparmor
● apparmor.service - Load AppArmor profiles
Loaded: loaded (/lib/systemd/system/apparmor.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2021-03-17 18:40:16 UTC; 2h 31min ago
Docs: man:apparmor(7)
https://gitlab.com/apparmor/apparmor/wikis/home/
Main PID: 105 (code=exited, status=1/FAILURE)

Mar 17 18:40:15 layer2 apparmor.systemd[147]: /sbin/apparmor_parser: Unable to replace "nvidia_modprobe". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:15 layer2 apparmor.systemd[157]: /sbin/apparmor_parser: Unable to replace "/usr/bin/man". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:15 layer2 apparmor.systemd[164]: /sbin/apparmor_parser: Unable to replace "/usr/sbin/tcpdump". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:16 layer2 apparmor.systemd[150]: /sbin/apparmor_parser: Unable to replace "/usr/lib/NetworkManager/nm-dhcp-client.action". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:16 layer2 apparmor.systemd[161]: /sbin/apparmor_parser: Unable to replace "mount-namespace-capture-helper". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:16 layer2 apparmor.systemd[161]: /sbin/apparmor_parser: Unable to replace "/usr/lib/snapd/snap-confine". Permission denied; attempted to load a profile while confined?
Mar 17 18:40:16 layer2 apparmor.systemd[105]: Error: At least one profile failed to load
Mar 17 18:40:16 layer2 systemd[1]: apparmor.service: Main process exited, code=exited, status=1/FAILURE
Mar 17 18:40:16 layer2 systemd[1]: apparmor.service: Failed with result 'exit-code'.
Mar 17 18:40:16 layer2 systemd[1]: Failed to start Load AppArmor profiles.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Dan Streetman(ddstreet) wrote on 2021-03-17T21:17:34.368658+00:00

I wonder if this is actually a problem with the specific apparmor profile that's created by lxd, maybe it doesn't provide enough permissions to allow the container's lxd to correctly pass the apparmor profile down to the nested container. Similar to how lxd locks down containers a bit too tight by default and requires enabling 'security.nesting' just to be able to create a nested container.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Launchpad Janitor(janitor) wrote on 2022-06-13T17:17:37.756437+00:00

Status changed to 'Confirmed' because the bug affects multiple users.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Christian Ehrhardt (paelzer) wrote on 2022-06-14T05:17:41.716650+00:00

Due to a ping on IRC I wanted to summarize the situation here as it seems this still affects people.

In nested LXD container we seem to have multiple issues:

  • apparmor service failing to start (might need to work with LXD to sort out why and how to fix it)

    • if it doesn't work at least fail to start more gracefully
    • comment 2 has a workaround to make dbus not insist on apparmor, but that is not a real fix we could generally apply
  • snapd snapd.seeded.service needs code to die/exit gracefully in this situation (as it won't work)

    • See comment 7, might have changed since then, but worth a revisit

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Jose Manuel Santamaria Lema(panfaust) wrote on 2022-06-18T18:58:36.314328+00:00

Hi there,

thanks of the update. Just in in case anyone else here is interested in a temporary workaround, this is what I did for my use case:

After that "runlevel", "systemctl is-system-running" and "cloud-init -status --wait" should work.

Last but not least, I would like to mention this issue affects running autopkgtests in nested containers (autopkgtest uses "runlevel" to detect if the container is properly started), I had an interesting conversation about this here (thanks a lot ddstreet for the help!):
https://irclogs.ubuntu.com/2022/06/13/%23ubuntu-devel.html#t15:05

@ubuntu-server-builder ubuntu-server-builder closed this as not planned Won't fix, can't repro, duplicate, stale May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
launchpad Migrated from Launchpad
Projects
None yet
Development

No branches or pull requests

1 participant