Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel error: unregister_netdevice: waiting for mesh-vpn to become free. Usage count = 1 #1258

Closed
RalfJung opened this issue Nov 12, 2017 · 38 comments
Labels
0. type: bug This is a bug 9. meta: upstream issue Issue pertains to an upstream project
Milestone

Comments

@RalfJung
Copy link
Contributor

I am seeing the following kernel error on some of my nodes:

unregister_netdevice: waiting for mesh-vpn to become free. Usage count = 1

In the past, I used to see an error very much like this on gateways; usually in the context of doing a reboot and then finding the machine stuck (and recoverable only via the server console). Batman had some bugs where it did not properly detach some interfaces. However, I haven't seen this error on gateways ever since we updated them to Batman 2017.x.

I am running two nodes, both with Gluon 2017.1.3. One, for testing, is an x86 node running in a VM via libvirt. There, I can pretty much reproduce the above error as follows:

  • Start the VM, make sure it is connected.
  • Put the host (i.e., my laptop) into stand-by.
  • Wait a little while (I usually get into the bus and drive home, or something like that).
  • Resume the host. Check the VM output, and I see the above error.

I then have to do a reboot of the VM (i.e., the node) to get it into a working state again.

Just now, I saw this error the first time on real hardware (TP-Link TL-WR841N/ND v10). I was trying to do /etc/init.d/tunneldigger restart, and it got stuck. logread shows that tunneldigger got the termination signal, but then failed to terminate. I sent kill -KILL to the result of pidof tunneldigger, to no avail. dmesg shows the above error, and it kept repeating. Ultimately I had to reboot the device.

Now I am a little worried that other devices might run into a similar issue when stopping services for an auto-update, and will just not work until someone notices and manually power cycles them.

@T-X
Copy link
Contributor

T-X commented Nov 13, 2017

There were several such bugs found (and newly introduced?) up and down through the Linux network stack, for instance 8397ed36b7, 202f59afd44147 or f186ce61bb8235. And then there was a bigger cleanup from David Miller even: cf124db566e6b. Most of this seems to be present since 4.12 only, with no backport to stable kernels yet. So would be interesting to retry with some kernel >= 4.12.

@rotanid rotanid added 0. type: bug This is a bug 9. meta: upstream issue Issue pertains to an upstream project labels Nov 14, 2017
@FreifunkUFO
Copy link

i also thought, this error might be done.
that batman-adv behaviour is the most reason to recommend using "reboot -f " instead of "reboot", everytime batman-adv is is usage!

@kaechele
Copy link
Contributor

kaechele commented Jan 7, 2018

I have been seeing this on real hardware in the field (TP-Link TL-WR841N/ND v11) when the tunneldigger broker is restarted several times (ca. 3 times) within a period of about 30 minutes.

@T-X
Copy link
Contributor

T-X commented Jan 13, 2018

Just wanted to add some information: Yes, this this message points to a kernel bug, something is not decreasing a reference counter correctly there.

And yes, some years ago we had found similar issues in batman-adv and after that did a thorough review of this. People tried the according patches and reported their issue as fixed.

It seems that this issue reappears so far only for setups using tunneldigger instead of fastd? Could it be, that maybe therefore this time the l2tp kernel module might not handle its teardown properly? (as the issue is not reproduceable with fastd?)

@RalfJung
Copy link
Contributor Author

Could it be, that maybe therefore this time the l2tp kernel module might not handle its teardown properly? (as the issue is not reproduceable with fastd?)

Sounds plausible.

@tokudan
Copy link
Contributor

tokudan commented Feb 14, 2018

I just noticed this in the firmware for Freifunk Hamburg (linux kernel 4.4.93) with two nodes. Both were affected after running "wifi".
It seems that batman was responsible for the issues there, as we do not use tunneldigger and "batctl o" and some other batman commands hung after that.
It's not (yet?) reliably reproducible though. After rebooting the node, running wifi did not trigger the issue again.

@T-X
Copy link
Contributor

T-X commented Feb 15, 2018

@tokudan: Which batman-adv and Gluon version were those two nodes running?

@tokudan
Copy link
Contributor

tokudan commented Feb 15, 2018

@T-X Freifunk Hamburg is at Gluon v2017.1.5 with batman-adv 2017.2 (from gluon-mesh-batman-adv-15).
The exact messages I saw on a TPLink 4900 while and after running "wifi":

Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client0: interface state ENABLED->DISABLED
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client0: AP-DISABLED
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client0: CTRL-EVENT-TERMINATING
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: nl80211: deinit ifname=client0 disabled_11b_rates=0
Wed Feb 14 23:11:38 2018 kern.info kernel: [174633.851473] device client0 left promiscuous mode
Wed Feb 14 23:11:38 2018 kern.info kernel: [174633.856335] br-client: port 4(client0) entered disabled state
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client1: interface state ENABLED->DISABLED
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client1: AP-STA-DISCONNECTED c0:bd:d1:d2:f9:be
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client1: AP-STA-DISCONNECTED d8:5b:2a:84:a6:61
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client1: AP-STA-DISCONNECTED d0:65:ca:74:fd:09
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client1: AP-STA-DISCONNECTED 00:71:cc:cd:4b:8f
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client1: AP-DISABLED
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: client1: CTRL-EVENT-TERMINATING
Wed Feb 14 23:11:38 2018 daemon.notice hostapd: nl80211: deinit ifname=client1 disabled_11b_rates=0
Wed Feb 14 23:11:38 2018 kern.info kernel: [174633.871863] device client1 left promiscuous mode
Wed Feb 14 23:11:38 2018 kern.info kernel: [174633.876851] br-client: port 5(client1) entered disabled state
Wed Feb 14 23:11:38 2018 daemon.notice netifd: Network device 'client0' link is down
Wed Feb 14 23:11:38 2018 daemon.notice netifd: Network device 'client1' link is down
Wed Feb 14 23:11:38 2018 kern.info kernel: [174633.937270] batman_adv: bat0: Interface deactivated: mesh1
Wed Feb 14 23:11:38 2018 daemon.notice netifd: Network device 'mesh1' link is down
Wed Feb 14 23:11:38 2018 daemon.notice netifd: Interface 'mesh_radio1' has link connectivity loss
Wed Feb 14 23:11:38 2018 kern.info kernel: [174633.977777] batman_adv: bat0: Interface deactivated: mesh0
Wed Feb 14 23:11:39 2018 daemon.notice netifd: Network device 'mesh0' link is down
Wed Feb 14 23:11:39 2018 daemon.notice netifd: Interface 'mesh_radio0' has link connectivity loss
Wed Feb 14 23:11:39 2018 kern.info kernel: [174634.052896] batman_adv: bat0: Removing interface: mesh1
Wed Feb 14 23:11:39 2018 daemon.notice netifd: Interface 'mesh_radio1' is disabled
Wed Feb 14 23:11:39 2018 kern.info kernel: [174634.179498] batman_adv: bat0: Removing interface: mesh0
Wed Feb 14 23:11:39 2018 daemon.info respondd[27822]: unable to read providers from '/usr/lib/respondd', ignoring
Wed Feb 14 23:11:49 2018 kern.emerg kernel: [174644.354817] unregister_netdevice: waiting for mesh1 to become free. Usage count = 1
Wed Feb 14 23:11:59 2018 kern.emerg kernel: [174654.598617] unregister_netdevice: waiting for mesh1 to become free. Usage count = 1
Wed Feb 14 23:12:09 2018 kern.emerg kernel: [174664.846348] unregister_netdevice: waiting for mesh1 to become free. Usage count = 1

The last message kept repeating until a reboot and commands like "batctl o" hung.
Respondd seems to have failed as well, resulting in the node appearing offline on the map.

[Update: Gluon 2017.1.5 not ..2.5 of course...]

@neocturne
Copy link
Member

It seems a new refcounting issue was introduced a while ago, which would explain why reports have become more frequent recently: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commitdiff;h=999bb66b20b03c753801ecebf1ec2a03c6a63c96

Please test if the issue is still reproducible with latest master or v2017.1.x.

@rotanid rotanid added the 5. needs: testing Testing of the changes is necessary label May 5, 2018
@RalfJung
Copy link
Contributor Author

RalfJung commented May 6, 2018

I just tried with 2017.1.7 and could not reproduce this in my VM with a suspend-resume cycle. However, this is also no longer the same host machine that I used back in November.

@rotanid
Copy link
Member

rotanid commented May 6, 2018

when testing latest v2017.1.x ( 1304907 ) yesterday i ran into
Sat May 5 20:34:21 2018 kern.emerg kernel: [ 3944.265393] unregister_netdevice: waiting for primary0 to become free. Usage count = 1
during tests with the autoupdater-wifi-fallback and ssid-changer.
it didn't occur again after another reboot, though

@RalfJung
Copy link
Contributor Author

RalfJung commented May 6, 2018

Never mind, I am seeing this again in my VM. Probably the machine wasn't suspended long enough previously to trigger this, now it's been >90min and the error shows up again.

@neocturne neocturne removed the 5. needs: testing Testing of the changes is necessary label May 10, 2018
@neocturne
Copy link
Member

It would be great if someone who can reproduce this issue with v2017.1.x or master could check the next branch, so we know if kernel 4.9.y/4.14.y are affected as well. Also include the hardware target and the exact kernel version of your build with your feedback.

@rotanid
Copy link
Member

rotanid commented May 12, 2018

@RalfJung can reproduce it and he described in detail, how - so he or anyone else with time should be able to reproduce it

@neocturne
Copy link
Member

I don't have a tunneldigger setup at my disposal; fastd is not affected by the issue in the same way, as the mesh-vpn interface is not torn down on reconnects with fastd. I might be able to get the same behaviour by changing the fastd config a bit...

@rotanid
Copy link
Member

rotanid commented May 12, 2018

it happened to me in a fastd setup on 2018-05-05 while debugging another thing, but i'm not sure how to reproduce it as i was focussed on the other thing...

@RalfJung
Copy link
Contributor Author

I might have time to try this later. @NeoRaider if you like, you could use our site file from https://git.hacksaar.de/FreifunkSaar/gluon-site and test with with our tunneldigger servers.

@RalfJung
Copy link
Contributor Author

I compiled the current next branch (eef493d) and ran it in a VM. The issue occurred, as usual, after leaving the VM host suspended for the night.

@rotanid
Copy link
Member

rotanid commented May 21, 2018

i also have the error message now with my home router, but not with tunneldigger:
[336077.155375] unregister_netdevice: waiting for mesh1 to become free. Usage count = 1
running Gluon v2017.1.7 , Kernel 4.4.129, batman-adv compat15, fastd, Router WDR4900 (dual-band ath9k), some wifi-mesh neighbors

EDIT: happened again today

@rotanid
Copy link
Member

rotanid commented May 25, 2018

on 2018/05/25 almost all of our devices running v2017.1.x were affected by this issue with mesh0/mesh1 not becoming free.
maybe commit f4e278e made things worse, not better? it's not the issue, tried without.

@rotanid
Copy link
Member

rotanid commented Jun 1, 2018

i have invested much more time now hoping to figure this out.
i can reproduce the issue now reliably, as i will describe below.
i could reproduce it with several Gluon versions:

  • several versions of the v2017.1.x branch, starting with the release v2017.1 up to the lastest update in branch (kernel 4.4.x, batman-adv 2017.1 and 2017.2)
  • Gluon master branch (upcoming v2018.1, kernel 4.4.x, batman-adv 2018.0)
    i couldn't reproduce it with Gluon v2016.2.7 (kernel 3.18.x, batman-adv 2016.2), although this doesn't mean the bug isn't there.

the reproduction is possibly unrelated to hardware, as i could reproduce with the following devices:

  • TP-Link WDR4900 (mpc85xx-generic dual-band)
  • TP-Link WDR3600 (ar71xx-generic ath9k dual-band)
  • TP-Link TL-WR1043ND v4 (ar71xx-generic ath9k single-band)
  • TP-Link Archer C59 (ar71xx-generic ath10k dual-band)

also, i could reproduce it with the following different settings:

  • Wireless Mesh on/off
  • LAN Mesh on/off
  • fastd VPN on/off (though it wasn't easy to test without fastd VPN enabled)

my reproduction works as follows:

  1. start (or reboot) the node
  2. wait until the gateway connection is pretty good ( batctl gwl )
  3. DROP all UDP packets coming from the Gluon node (either at your home router or on the gateway by using the public WAN IP address of the node)
  4. wait at least 8 minutes (more to be safe, wasn't always reproducible when waiting time was too short)
  5. do execute /etc/init.d/network restart on the node
  6. see dmesg | grep unregister

i used the community network of Freifunk Altdorf.
site.conf: https://github.com/tecff/site-ffa
examplary affected build: http://fw1.tecff.de/issue-1258-affected/

@kevin-olbrich
Copy link
Contributor

kevin-olbrich commented Jun 1, 2018

I just noticed the same after more and more nodes were detected as down by respondd / yanic (meshviewer) while bat-mesh was working.

Devices so far:

  • TP-LINK TL-WA801ND v2
  • Ubiquiti UniFi

Error:
unregister_netdevice: waiting for mesh0 to become free. Usage count = 1

I am using tunneldigger on latest gluon master (no fw customizing, no custom packages).
If this is an issue with reconnect to the broker because of packet loss on UDP (like test case above), there seems to be a second bug. The uplink were this node is located is stable and reliable and was not disconnected for the last two days (also there were are no errors visible in our logs and monitoring).

@ecsv
Copy link
Contributor

ecsv commented Jun 2, 2018

@rotanid You wanted to test batman-adv from gluon 2016.2.x. I've added it as branch batadv/2016.2.x in https://github.com/FreifunkVogtland/gluon/tree/batadv/2016.2.x with it.

I think it is reasonable to just use kvm for the tests as wifi mesh neighbors don't seem to be relevant here ("$" is on the host system, "#" is in OpenWrt/LEDE).

$ sudo iptables -D INPUT -p udp -j DROP
$ qemu-system-x86_64 -enable-kvm -snapshot -nographic -nic user,mac=02:11:22:33:44:01 gluon-*-x86-64.vmdk
wait 8 minutes for the node to be "stable" (check `batctl gwl`)
$ sudo iptables -A INPUT -p udp -j DROP
wait 8 minutes for the node to loose its neighbors in batman-adv and get into the bad state(?)
# /etc/init.d/network restart
wait for 60 seconds to see whether "unregister_netdevice: waiting for mesh0 to become free" appears in the kernel log

You can find more batman-adv versions between https://github.com/FreifunkVogtland/gluon/tree/batadv/2016.2.x and https://github.com/FreifunkVogtland/gluon/tree/batadv/2018.1.

@RalfJung
Copy link
Contributor Author

RalfJung commented Jun 2, 2018

  • DROP all UDP packets coming from the Gluon node (either at your home router or on the gateway by using the public WAN IP address of the node)
  • wait at least 8 minutes (more to be safe, wasn't always reproducible when waiting time was too short)

Interesting! I guess this is what I "achieved" by putting the laptop running the node (in a VM) into suspend -- effectively no packets can go in or out any more. Just putting it into suspend for a couple seconds didn't do anything, but leaving it in suspend for half an hour reliably triggers the problem.

@ecsv
Copy link
Contributor

ecsv commented Jun 2, 2018

@rotanid Can you test whether it:

  • 1e7b160 ("Switch to batman-adv 2016.4-0"): problem happens
  • 91a1dc0 ("Switch to batman-adv 2016.3-0"): problem doesn't happen

I have now also added the branches with all the patches from 2016.4 as separate commits:

@ecsv
Copy link
Contributor

ecsv commented Jun 2, 2018

@rotanid: I found a big problem in the gateway netlink code from Andrew. Not sure whether this is your problem but it is at least a problem which can cause such a problem in gluon 2017.1.x. I will post a patch later today

@ecsv
Copy link
Contributor

ecsv commented Jun 2, 2018

The patches were submitted to batman-adv:

The branches for gluon master and 2017.1.x (yes, they need slightly different patches) can be found at:

@rotanid
Copy link
Member

rotanid commented Jun 3, 2018

@ecsv thanks for your work!
i tested this by using your branches:
https://github.com/FreifunkVogtland/gluon/tree/batadv/gw_refcnt_fix
https://github.com/FreifunkVogtland/gluon/tree/batadv/gw_refcnt_fix-2017.1.x
i couldn't reproduce my issue anymore with these branches, while another device with unpatched v2017.1.x running the test at the same time still shows the issues.

@RalfJung please also test your case with at least one of those two branches

@neocturne
Copy link
Member

Merged for both master and v2017.1.x, thanks to everyone who helped getting to the root of this and fixing it!

@rotanid
Copy link
Member

rotanid commented Jun 3, 2018

and thanks to the UPS vendor used by Hetzner for the failing UPS on 2018/05/24 which lead to the exact length of downtime which triggered the bug for us ;)

@RalfJung
Copy link
Contributor Author

RalfJung commented Jun 4, 2018

I tried the 2017.1.x branch and was unable to reproduce.

@smoe
Copy link
Contributor

smoe commented Jun 20, 2018

I get it reproducably on a regular Linux machine with kernel

root@gw1:/home/moeller# systemctl stop fastd@$(systemd-escape ffoh-mesh-vpn)
Message from syslogd@gw1 at Jun 20 15:42:02 ...
kernel:[140542.561333] unregister_netdevice: waiting for ffoh-mesh-vpn to become free. Usage count = 1

with Debian buster, using only packages from the distribution on kernel

Linux version 4.16.0-2-amd64 (debian-kernel@lists.debian.org) (gcc version 7.3.0 (Debian 7.3.0-19)) #1 SMP Debian 4.16.12-1 (2018-05-27)

Cheers,
Steffen

@rotanid
Copy link
Member

rotanid commented Jun 20, 2018

@smoe Kernel 4.16 doesn't have the most recent batman-adv fixes and it might take a while until it gets them, if ever.
if you want to fix this, you would have to build latest batman-adv maint branch by yourself for Debian Buster, i have only built packages for Jessie so far.

@smoe
Copy link
Contributor

smoe commented Jun 21, 2018

@rotanid, I was not completely sure how to interpret your reply.

Is there something to be done to shorten the time between batman-adv developments to an installable package for Debian unstable/testing? It should be possible for me to help with something along those lines, aiming at Debian experimental with it. Would like to consult you and the Debian kernel folks about best workflows, first, though.

Concerning the exact issue at hand I should not attempt to fix this myself. Would not know where to start. Was only meaning to help with an easy way to get it reproduced.

@rotanid
Copy link
Member

rotanid commented Jun 21, 2018

@smoe for official debian repositories, you will have to wait until there is a new official kernel including a new batman-adv release. there's little we can do to accelerate this.

the alternative is to build a debian package with the latest batman-adv kernel module by yourself, which i did for Debian Jessie with Kernel 4.9 from backports.
to do this after every batman-bugfix is enough work, i really don't want to also have to do this for more than one debian version and especially affecting testing: after each debian kernel update...
ideally, this would be automated and the packages be put into a repository, but this hasn't been done yet.

@smoe
Copy link
Contributor

smoe commented Jun 22, 2018

@rotanid Well, yes, if that could be automated then I offer to upload to Debian Experimental (or sign and upload elsewhere but since we have all other bits of the Freifunk gateway infrastructure in the distribution, I would like to stay close). I don't have any interaction with the Debian Kernel Team, they may have additional ideas, who I would feel like consulting with whatever plan we/you/I come up with. For instance if Debian would get the batman-adv module back as a separate package, then there could also be a batman-adv package distributed via the experimental section. Feel free contacting me via moeller@d.o to discuss things further.

@mweinelt
Copy link
Contributor

mweinelt commented Jun 22, 2018

@ecsv probably knows best, but since batadv was part of the upstream kernel the dkms package for debian was discontinued in ~2011.

fwiw: we would appreciate having a recent version packaged through dkms.

@kaechele
Copy link
Contributor

This is probably the fix: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=27d53323664c549b5bb2dfaaf6f7ad6e0376a64e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. type: bug This is a bug 9. meta: upstream issue Issue pertains to an upstream project
Projects
None yet
Development

No branches or pull requests