Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

binfmts_misc: Too many symlinks #142

Closed
ShadowEO opened this issue Apr 16, 2021 · 10 comments
Closed

binfmts_misc: Too many symlinks #142

ShadowEO opened this issue Apr 16, 2021 · 10 comments
Assignees
Labels
awaiting-response Awaiting response to suggestion/solution. bug Something isn't working

Comments

@ShadowEO
Copy link

Windows version (build number):
10.0.21354.1

Linux distribution:
Ubuntu-20.04 from Store

Genie version:
1.36

Describe the bug
Attempting to access binfmts_misc gives me the error "Too many symlinks" this puts a damper on building foreign chroots while inside the bottle. I also attempted to access it after disabling the proc-sys-fs-binfmt_misc.mount and proc-sys-fs-binfmt_misc.automount which allows me to mount binfmt_misc properly but only under root. If I attempt to mount it using sudo, the directory stays empty.

If the bug involves systemctl or a service running under systemd, confirm that you are running inside the bottle:
inside

To Reproduce
Steps to reproduce the behavior:

  1. Enter your bottle
  2. Attempt to access /proc/sys/fs/binfmt_misc

Expected behavior
I should see the binfmt_misc directory structure and be able to write to the register object inside it.

@ShadowEO ShadowEO added the bug Something isn't working label Apr 16, 2021
@cerebrate
Copy link
Member

This appears to be a problem specific to Ubuntu distributions that results in either binfmt_misc not being mounted properly, or else the mount breaking somehow when systemd starts under Ubuntu. Unfortunately, I don't have much more than that to go on, as there doesn't appear to be anything obviously wrong with the startup sequence.

I do, however, have a workaround. Simply remounting binfmt_misc over the top of the existing mount with:

sudo mount -t binfmt_misc binfmt_misc /proc/sys/fs/binfmt_misc

restores access. If this works for you, you can automate it by adding:

ls /proc/sys/fs/binfmt_misc > /dev/null 2>&1 || \
  mount -t binfmt_misc binfmt_misc /proc/sys/fs/binfmt_misc

to your /etc/rc.local file, creating it if it does not exist.

@cerebrate cerebrate added the awaiting-response Awaiting response to suggestion/solution. label Apr 23, 2021
@jitingcn
Copy link

I do, however, have a workaround. Simply remounting binfmt_misc over the top of the existing mount with:

sudo mount -t binfmt_misc binfmt_misc /proc/sys/fs/binfmt_misc

I am using archwsl and meet the same problem. This workaround worked for me. To make systemed-binfmt.service work:

$ sudo systemctl edit systemd-binfmt.service --full

And replace ExecStart part with:

ExecStart=/bin/sh -c "ls /proc/sys/fs/binfmt_misc > /dev/null 2>&1 || \
  mount -t binfmt_misc binfmt_misc /proc/sys/fs/binfmt_misc; \
  /usr/lib/systemd/systemd-binfmt"

@esgie
Copy link

esgie commented Jun 7, 2021

@jitingcn I like your solution as it utilizes preferred method of editing systemd configs by overriding defaults using drop-in files and also it does not require to enable running rc.local on boot (which IS possible on Arch using rc-local package from AUR, except it is considered depreciated…). Thanks. However, I feel that it would be a little more elegant to add the additional mount command in separate ExecStartPre part of the Unit rather than putting it into ExecStart, which can be done by adding below line to [Service] section:
ExecStartPre=/bin/sh -c "ls /proc/sys/fs/binfmt_misc >/dev/null 2>&1 || mount -t binfmt_misc binfmt_misc /proc/sys/fs/binfmt_misc”
and leaving ExecStart property unchanged. ExecStartPre is a dedicated property to define stuff which shall be executed before starting the service using command defined in ExecStart, so it should more likely work as your solution.

@esgie
Copy link

esgie commented Jun 7, 2021

I have investigated deeper into the problem. So basically, WSL seems to mount and use /proc/sys/fs/binfmt_misc for providing windows interop. It even registers a WSLInterop rule for handling Windows executables with /init. This seems to be wsl in-built mechanism which is executed arbitrarily sometime on the wsl initialization.

But from inside the genie (assuming we have masked all systemd binfmt related stuff and not touch the mountpoint at all) there’s something wrong. The mountpoint is listed in the output of commands like mount or cat /proc/mounts but the directory no longer contains leafs like 'register’ or 'status’ and if I try to unmount binfmt_misc, I receive an error that it is not mounted. And executing windows exe’s no longer works.

This is the reason of systemd units’ failures. proc-sys-fs-binfmt_misc.mount unit, which os the unit responsible for mounting binfmt_misc, reports that it has executed succefully, but in fact it does not perform any action, because it detects binfmt_misc as already mounted. Then, systemd-binfmt.service performs some magic over the „succefully mounted” binfmt_misc - I don’t know the details but it looks like some kind of overlay built basing on the current binfmt_misc contents and some additions, mounted in the same location.The overlay created upon empty dir is corrupted and any attempt of accessing its content result in ‚too many symbolic links level’ error.

so, root cause of the failure lies in the fact that from inside of the genie, the default wsl binfmt_misc is reported as mounted, but it behaves like it wasn’t when accessed.

In fact, all that is have to done to fix it is to unmount binfmt_misc mounted by WSL itself (or prevent wsl from mounting it at all if possible) before initializing the genie (from outside of it). This should be enought to systemd units related to binfmt start succefully with no manual ingerention and working as expected.

Please note that binfmt created by systemd is not wsl-specific, thus it is not preconfigured to support windows executables like the one we unmounted. But we can easily bring the Windows interoperability back by adding config manually, which is done by creating e.g. /etc/binfmt.d/99-WSLInterop.conf with below contents:
:WSLInterop:M::MZ::/init:F
The rule should be registered by systemd-binfmt.service automatically on genie initialization.

At last I have qemu-user-static + windows interoperability working + full systemd working under wsl at once without error and manual editions in systemd units,

@cerebrate
Copy link
Member

@esgie Thanks muchly for you work on solving this one, too.

I've added your fix to genie 1.43.

cerebrate added a commit that referenced this issue Jun 9, 2021
cerebrate added a commit that referenced this issue Jun 9, 2021
@esgie
Copy link

esgie commented Jun 9, 2021

Please take note on below findings. There is some additional stuff needed to be handled to avoid losing flexibility.
We unmount WSL's binfmt_misc and initialize the container. From inside the container everythings goes smooth. but if I create a new session outside the container (which could be a very frequent usecase, whenever you want to run some separate Linux command as quickly as possible (just like it was designed in wsl) independently of the full systemd session ran on the same distro), binfmt_misc remains unmounted and empty, thus WSLInterop isn't working.

But what I have discovered is that after mounting binfmt_misc again using ‘sudo mount -t binfmt_misc none /proc/sys/fs/binfmt_misc' from outside the container, it is not only mounted succefully, but in addition its get populated with the format definitions registered in systemd session inside the container. meanwhile, it seems not to break or interfere with stuff already running inside the container.

I was also worried about the systemd container shutdown and how does it will affect outside sessions. But thankfully, at this point, closing the container / stopping units responsible for binfmt does not break binfmt_misc for sessions run outside the genie. The only effect I've noticed is that all registered formats disappear (which is understood as systemd-binfmt unregisters all the stuff on stop), including WSLInterop. Therefore, we have to fix things again by echoeing the :WSLInterop:M::MZ::/init:F string to /proc/sys/fs/binfmt_misc/register

And in case we want to init systemd again, we unmount the binfmt_misc again, do the init, mount again etc.

So to sum up, if we want not to break things for sessions ran outside of genie, the algorithm of initializing the genie would be something like one below (assuming we are starting WSL so we are outside the genie and it is not running; also I assume that WSLInterop.conf has been installed to /etc/binfmt.d or other supported location, I guess the one in /usr/local/share or something will be more suitable if you plan to include it in the package) so interop will work under systemd. so the steps of initializing the genie are:

  • unmount binfmt_misc
  • init the genie
  • wait until it start (it may be enough to wait for systemd-binfmt.service I guess)
  • still from outside of genie, mount binfmt_again.
    binfmt_misc should now work from both outside and inside the genie. Plus the definitions registered in the container will be working outside of it as well, including WSLInterop. In fact, binfmt_misc should be unavailable only for the limited time required for systemd init.

The algorithm of closing the container:

  • trigger the genie's shutdown
  • wait until the stop is completed
  • echo the WSLInterop string to /proc/sys/fs/binfmt_misc/register
    binfmt_misc should now work outside the genie back again along with the default rule enabled and running. During the shutdown, interoperability wasn't available only for a limited time after systemd-binfmt unregistered all the formats on stop to the moment when genie closed succefully and we echoed the interop string right after.

That is, I think that the Microsoft's binfmt_misc implementation is somewhat non-standard and involves some interaction with the underlying system-distro I guess, which result in some strange stuff going on when combined with yet another contenerization. I have to say I do not completely get the relations between all the elements and they seem to behave quite strangely sometimes. For example, original WSLInterop rule refers to interpreter location named /tools/init and by default when running mount, you can see that /init is a mountpoint created from some weird instance named 'tools'. I guess it is the custom Microsoft's init, which has inbuilt feature of handling windows exes. However after we start the genie, /init available from inside the container is totally different. In fact, it must be a different - it must be systemd enabled /init in fact, running as pid 1, as this is the fundamental requirement of systemd to run... The custom microsoft /init does not seem to be available inside the container as it is overriden by its own one. So why Windows interoperability works fine inside the container after I register /init as an interpreter for Windows exes even, if /init which was able to handle them has been obpverridden by the one, which shouldn't - I don't know.

However, I found above steps working nicely providing as stable and flexible environment of utilizing binfmt_misc as possible without breaking things for longer than it's needed. But there are still things to understand.

I am sorry for not providing you with any code, but I am rather poor programmer ;(
For my own purposes I handle stuff using bash and the operations to be performed are rather basic, so I guess in case I can't bring the code, the description would far more useful to you, than those scripts.

please also note that same problem, e.g. systemd breaking stuff for outside sessions and vice versa, applies to systemd-resolved issue as well, as each type of session in order dns to work require different resolv.conf, so creating systemd one will result in non-working dns outside and reverting the default one will break it for systemd.
So surely in that case we need to handle some extra stuff as well.
But I haven't looked into it deeper yet.
regards.

@esgie
Copy link

esgie commented Jun 21, 2021

Hi,
I found the solution included in 1.43 quite unstable.
In fact it seemed to work at the beginning, but on some point in the sessions (not a distant one, like couple of minutes/half an hour) interop inside the genie was refusing to work with „invalid argument” error, which required me to wsl.exe —shutdown and restart.
That is caused by the fact the rule in binfmt.d points to /init as an interpreter. There are two inits - the WSL’s one, which is customized to handle windows exe’s, and systemd’s one, which is required to run as pid 1 etc. and isn’t able to run any Windows stuff.
So the init inside the genie can’t run Windows exe’s and that gives „invalid argument” error.
It seems that the solution works only very early after creating the container. I guess something gets „refreshed” at some point an bang, systemd-container does no longer see the old init, at least in some of its parts, and no more interop since then.
I had tried doing some gentle stuff with microsoft’s init - which is a mountpoint created in some non-standard way before all the userspace stuff happens I guess - ike hardlinking or bindmounting elsewhere. But what did actually work and fixed the interop altogether and made it stable was… just copying the /init to some other location before initializing the genie, like cp /init /init.wsl and changing the binfmt.d/WSLInterop.conf rule so it uses /init.wsl instead of /init as an interpreter.
Now initializing the genie and all the stuff that happens to /init will not affect the wsl interoperability, which will be perforemd by a cold-copy of genuine microsoft’s interpreter saved to /init.wsl.
And that works, Microsoft’s init is mounted in some very unusual way, but all in all it seems not to care at all about it’s filename or location etc. and working fine that way or another.

@cerebrate
Copy link
Member

cerebrate commented Jun 23, 2021

Interesting. I can't repro this behavior; I have a WSL/genie/systemd session with four days of uptime on it right now and it still interops just fine.

But the weird thing, so far as I'm concerned, is the apparent two-init issue you're seeing. Running systemd as init (pid 1) inside the container shouldn't do anything to the /init file; it's invoked directly as /usr/lib/systemd/systemd, as you can see if you look at /proc/(systemdpid)/exe from outside the bottle. /init shouldn't be touched by anything genie does, and so far as I know, neither systemd nor any other init feels the need to rewrite itself.

And indeed when I check in various ways for differences between inside and outside the bottle, the mount is the same:

tools on /init type 9p (ro,relatime,dirsync,aname=tools;fmask=022,loose,access=client,msize=65536,trans=fd,rfd=7,wfd=7)

the inode number (ls -i) is the same

8162774325450668 init

and when I use md5sum on them, they also come out with the identical checksum.

❯ md5sum /init
0258dfe90ae79a649a5c0d6aac80bbf9  /init

...so my next step, here, has to be to ask you if you can run those same steps on your system when/if interop stops working, and let me have the results. And if those show that /init actually has become a different file, I guess we'll need to find out how and when?

@esgie
Copy link

esgie commented Jun 24, 2021 via email

@cerebrate
Copy link
Member

I'm using a custom kernel myself (although compiled using a reconfiguration of the Microsoft config, with modules & DKMS - those units are recommended-disabled just for people running the stock kernel), so FWIW, it seems unlikely to be that.

Although, that said, I've never seen any of those other errors with my custom kernel, so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-response Awaiting response to suggestion/solution. bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants