New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel error: unregister_netdevice: waiting for mesh-vpn to become free. Usage count = 1 #1258
Comments
There were several such bugs found (and newly introduced?) up and down through the Linux network stack, for instance 8397ed36b7, 202f59afd44147 or f186ce61bb8235. And then there was a bigger cleanup from David Miller even: cf124db566e6b. Most of this seems to be present since 4.12 only, with no backport to stable kernels yet. So would be interesting to retry with some kernel >= 4.12. |
i also thought, this error might be done. |
I have been seeing this on real hardware in the field (TP-Link TL-WR841N/ND v11) when the tunneldigger broker is restarted several times (ca. 3 times) within a period of about 30 minutes. |
Just wanted to add some information: Yes, this this message points to a kernel bug, something is not decreasing a reference counter correctly there. And yes, some years ago we had found similar issues in batman-adv and after that did a thorough review of this. People tried the according patches and reported their issue as fixed. It seems that this issue reappears so far only for setups using tunneldigger instead of fastd? Could it be, that maybe therefore this time the l2tp kernel module might not handle its teardown properly? (as the issue is not reproduceable with fastd?) |
Sounds plausible. |
I just noticed this in the firmware for Freifunk Hamburg (linux kernel 4.4.93) with two nodes. Both were affected after running "wifi". |
@tokudan: Which batman-adv and Gluon version were those two nodes running? |
@T-X Freifunk Hamburg is at Gluon v2017.1.5 with batman-adv 2017.2 (from gluon-mesh-batman-adv-15).
The last message kept repeating until a reboot and commands like "batctl o" hung. [Update: Gluon 2017.1.5 not ..2.5 of course...] |
It seems a new refcounting issue was introduced a while ago, which would explain why reports have become more frequent recently: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commitdiff;h=999bb66b20b03c753801ecebf1ec2a03c6a63c96 Please test if the issue is still reproducible with latest master or v2017.1.x. |
I just tried with 2017.1.7 and could not reproduce this in my VM with a suspend-resume cycle. However, this is also no longer the same host machine that I used back in November. |
when testing latest v2017.1.x ( 1304907 ) yesterday i ran into |
Never mind, I am seeing this again in my VM. Probably the machine wasn't suspended long enough previously to trigger this, now it's been >90min and the error shows up again. |
It would be great if someone who can reproduce this issue with v2017.1.x or master could check the next branch, so we know if kernel 4.9.y/4.14.y are affected as well. Also include the hardware target and the exact kernel version of your build with your feedback. |
@RalfJung can reproduce it and he described in detail, how - so he or anyone else with time should be able to reproduce it |
I don't have a tunneldigger setup at my disposal; fastd is not affected by the issue in the same way, as the mesh-vpn interface is not torn down on reconnects with fastd. I might be able to get the same behaviour by changing the fastd config a bit... |
it happened to me in a fastd setup on 2018-05-05 while debugging another thing, but i'm not sure how to reproduce it as i was focussed on the other thing... |
I might have time to try this later. @NeoRaider if you like, you could use our site file from https://git.hacksaar.de/FreifunkSaar/gluon-site and test with with our tunneldigger servers. |
I compiled the current next branch (eef493d) and ran it in a VM. The issue occurred, as usual, after leaving the VM host suspended for the night. |
i also have the error message now with my home router, but not with tunneldigger: EDIT: happened again today |
on 2018/05/25 almost all of our devices running v2017.1.x were affected by this issue with mesh0/mesh1 not becoming free. |
i have invested much more time now hoping to figure this out.
the reproduction is possibly unrelated to hardware, as i could reproduce with the following devices:
also, i could reproduce it with the following different settings:
my reproduction works as follows:
i used the community network of Freifunk Altdorf. |
I just noticed the same after more and more nodes were detected as down by respondd / yanic (meshviewer) while bat-mesh was working. Devices so far:
Error: I am using tunneldigger on latest gluon master (no fw customizing, no custom packages). |
@rotanid You wanted to test batman-adv from gluon 2016.2.x. I've added it as branch batadv/2016.2.x in https://github.com/FreifunkVogtland/gluon/tree/batadv/2016.2.x with it. I think it is reasonable to just use kvm for the tests as wifi mesh neighbors don't seem to be relevant here ("$" is on the host system, "#" is in OpenWrt/LEDE).
You can find more batman-adv versions between https://github.com/FreifunkVogtland/gluon/tree/batadv/2016.2.x and https://github.com/FreifunkVogtland/gluon/tree/batadv/2018.1. |
Interesting! I guess this is what I "achieved" by putting the laptop running the node (in a VM) into suspend -- effectively no packets can go in or out any more. Just putting it into suspend for a couple seconds didn't do anything, but leaving it in suspend for half an hour reliably triggers the problem. |
@rotanid Can you test whether it:
I have now also added the branches with all the patches from 2016.4 as separate commits:
|
@rotanid: I found a big problem in the gateway netlink code from Andrew. Not sure whether this is your problem but it is at least a problem which can cause such a problem in gluon 2017.1.x. I will post a patch later today |
The patches were submitted to batman-adv: The branches for gluon master and 2017.1.x (yes, they need slightly different patches) can be found at: |
@ecsv thanks for your work! @RalfJung please also test your case with at least one of those two branches |
Merged for both master and v2017.1.x, thanks to everyone who helped getting to the root of this and fixing it! |
and thanks to the UPS vendor used by Hetzner for the failing UPS on 2018/05/24 which lead to the exact length of downtime which triggered the bug for us ;) |
I tried the 2017.1.x branch and was unable to reproduce. |
I get it reproducably on a regular Linux machine with kernel
with Debian buster, using only packages from the distribution on kernel
Cheers, |
@smoe Kernel 4.16 doesn't have the most recent batman-adv fixes and it might take a while until it gets them, if ever. |
@rotanid, I was not completely sure how to interpret your reply. Is there something to be done to shorten the time between batman-adv developments to an installable package for Debian unstable/testing? It should be possible for me to help with something along those lines, aiming at Debian experimental with it. Would like to consult you and the Debian kernel folks about best workflows, first, though. Concerning the exact issue at hand I should not attempt to fix this myself. Would not know where to start. Was only meaning to help with an easy way to get it reproduced. |
@smoe for official debian repositories, you will have to wait until there is a new official kernel including a new batman-adv release. there's little we can do to accelerate this. the alternative is to build a debian package with the latest batman-adv kernel module by yourself, which i did for Debian Jessie with Kernel 4.9 from backports. |
@rotanid Well, yes, if that could be automated then I offer to upload to Debian Experimental (or sign and upload elsewhere but since we have all other bits of the Freifunk gateway infrastructure in the distribution, I would like to stay close). I don't have any interaction with the Debian Kernel Team, they may have additional ideas, who I would feel like consulting with whatever plan we/you/I come up with. For instance if Debian would get the batman-adv module back as a separate package, then there could also be a batman-adv package distributed via the experimental section. Feel free contacting me via moeller@d.o to discuss things further. |
@ecsv probably knows best, but since batadv was part of the upstream kernel the dkms package for debian was discontinued in ~2011. fwiw: we would appreciate having a recent version packaged through dkms. |
I am seeing the following kernel error on some of my nodes:
In the past, I used to see an error very much like this on gateways; usually in the context of doing a
reboot
and then finding the machine stuck (and recoverable only via the server console). Batman had some bugs where it did not properly detach some interfaces. However, I haven't seen this error on gateways ever since we updated them to Batman 2017.x.I am running two nodes, both with Gluon 2017.1.3. One, for testing, is an x86 node running in a VM via libvirt. There, I can pretty much reproduce the above error as follows:
I then have to do a
reboot
of the VM (i.e., the node) to get it into a working state again.Just now, I saw this error the first time on real hardware (TP-Link TL-WR841N/ND v10). I was trying to do
/etc/init.d/tunneldigger restart
, and it got stuck.logread
shows that tunneldigger got the termination signal, but then failed to terminate. I sentkill -KILL
to the result ofpidof tunneldigger
, to no avail.dmesg
shows the above error, and it kept repeating. Ultimately I had toreboot
the device.Now I am a little worried that other devices might run into a similar issue when stopping services for an auto-update, and will just not work until someone notices and manually power cycles them.
The text was updated successfully, but these errors were encountered: