Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash in nf_ct_del_from_dying_or_unconfirmed_list: "Kernel bug detected [...] #1468

Closed
T-X opened this issue Jul 10, 2018 · 10 comments
Closed
Labels
0. type: bug This is a bug

Comments

@T-X
Copy link
Contributor

T-X commented Jul 10, 2018

After some minutes a Ubiquiti Bullet M2, running Gluon master (basically v2018.1) with the multicast related patches in #1357 and a site config from Freifunk Hannover crashes for me in the conntrack code:

Modules:	ath9k@80de0000+1aa68	ath9k_common@80c88000+562e	xt_CT@80d2b000+a10	mac80211@80e00000+65c82	ath9k_hw@80d80000+57c8c	ath@80d38000+47d3	batman_adv@80d00000+2a013	cfg80211@80cc0000+394b8	compat@80c30000+2bed	ohci_platform@81b8c000+e50	ohci_hcd@81b70000+56df	ehci_platform@81bbe000+1150	ehci_hcd@81b68000+7f37	gpio_button_hotplug@81b60000+1870	usbcore@81ba0000+1e000	nls_base@81b62000+1380	usb_common@81b5d000+8ad
<7>[    0.000000] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
<7>[    0.000000] pcpu-alloc: [0] 0 
<6>[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 8128
<5>[    0.000000] Kernel command line:  board=UBNT-BM mtdparts=spi0.0:256k(u-boot)ro,64k(u-boot-env)ro,7552k(firmware),256k(cfg)ro,64k(EEPROM)ro console=ttyS0,115200 rootfstype=squashfs,jffs2 noinitrd
<6>[    0.000000] PID hash table entries: 128 (order: -3, 512 bytes)
<6>[    0.000000] Dentry cache hash table entries: 4096 (order: 2, 16384 bytes)
<6>[    0.000000] Inode-cache hash table entries: 2048 (order: 1, 8192 bytes)
<6>[    0.000000] Writing ErrCtl register=00000000
<6>[    0.000000] Readback ErrCtl register=00000000
<6>[    0.000000] Memory: 27220K/32768K available (3307K kernel code, 177K rwdata, 824K rodata, 292K init, 210K bss, 5548K reserved, 0K cma-reserved)
<6>[    0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
<6>[    0.000000] NR_IRQS:51
<6>[    0.000000] Clocks: CPU:390.000MHz, DDR:390.000MHz, AHB:195.000MHz, Ref:40.000MHz
<6>[    0.000000] clocksource: MIPS: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 9801335621 ns
<6>[    0.000013] sched_clock: 32 bits at 195MHz, resolution 5ns, wraps every 11012737021ns
<6>[    0.007873] Calibrating delay loop... 259.27 BogoMIPS (lpj=1296384)
<6>[    0.080536] pid_max: default: 32768 minimum: 301
<6>[    0.085317] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
<6>[    0.091912] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
<6>[    0.102900] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
<6>[    0.112745] futex hash table entries: 256 (order: -1, 3072 bytes)
<6>[    0.120683] NET: Registered protocol family 16
<6>[    0.127364] MIPS: machine is Ubiquiti Bullet M
<4>[    0.133921] registering PCI controller with io_map_base unset
<6>[    0.374320] PCI host bridge to bus 0000:00
<6>[    0.378426] pci_bus 0000:00: root bus resource [mem 0x10000000-0x13ffffff]
<6>[    0.385326] pci_bus 0000:00: root bus resource [io  0x0000]
<6>[    0.390874] pci_bus 0000:00: root bus resource [??? 0x00000000 flags 0x0]
<6>[    0.397636] pci_bus 0000:00: No busn resource found for root bus, will use [bus 00-ff]
<7>[    0.405572] pci 0000:00:00.0: [168c:002a] type 00 class 0x028000
<6>[    0.405616] pci 0000:00:00.0: fixup device configuration
<7>[    0.412165] pci 0000:00:00.0: reg 0x10: [mem 0x10000000-0x1000ffff 64bit]
<7>[    0.412302] pci 0000:00:00.0: supports D1
<7>[    0.412326] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
<7>[    0.412648] pci_bus 0000:00: busn_res: [bus 00-ff] end is updated to 00
<6>[    0.412698] pci 0000:00:00.0: BAR 0: assigned [mem 0x10000000-0x1000ffff 64bit]
<6>[    0.419991] pci 0000:00:00.0: using irq 40 for pin 1
<6>[    0.426253] clocksource: Switched to clocksource MIPS
<6>[    0.433176] NET: Registered protocol family 2
<6>[    0.438970] TCP established hash table entries: 1024 (order: 0, 4096 bytes)
<6>[    0.445925] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)
<6>[    0.452334] TCP: Hash tables configured (established 1024 bind 1024)
<6>[    0.458823] UDP hash table entries: 256 (order: 0, 4096 bytes)
<6>[    0.464657] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
<6>[    0.471360] NET: Registered protocol family 1
<7>[    0.475826] PCI: CLS 0 bytes, default 32
<4>[    0.482838] Crashlog allocated RAM at address 0x1f00000
<6>[    0.510414] squashfs: version 4.0 (2009/01/31) Phillip Lougher
<6>[    0.516297] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
<6>[    0.529975] io scheduler noop registered
<6>[    0.533901] io scheduler deadline registered (default)
<6>[    0.539619] Serial: 8250/16550 driver, 16 ports, IRQ sharing enabled
<6>[    0.550171] console [ttyS0] disabled
<6>[    0.573818] serial8250.0: ttyS0 at MMIO 0x18020000 (irq = 11, base_baud = 12187500) is a 16550A
<6>[    0.582496] console [ttyS0] enabled
<6>[    0.589504] bootconsole [early0] disabled
<4>[    0.604861] m25p80 spi0.0: found s25fl064k, expected m25p80
<6>[    0.622341] m25p80 spi0.0: s25fl064k (8192 Kbytes)
<5>[    0.627215] 5 cmdlinepart partitions found on MTD device spi0.0
<5>[    0.633153] Creating 5 MTD partitions on "spi0.0":
<5>[    0.637978] 0x000000000000-0x000000040000 : "u-boot"
<5>[    0.645029] 0x000000040000-0x000000050000 : "u-boot-env"
<5>[    0.652688] 0x000000050000-0x0000007b0000 : "firmware"
<5>[    0.670334] 2 uimage-fw partitions found on MTD device firmware
<5>[    0.676360] 0x000000050000-0x0000001c0000 : "kernel"
<5>[    0.683277] 0x0000001c0000-0x0000007b0000 : "rootfs"
<5>[    0.690619] mtd: device 4 (rootfs) set to be root filesystem
<5>[    0.696411] 1 squashfs-split partitions found on MTD device rootfs
<5>[    0.702623] 0x0000003d0000-0x0000007b0000 : "rootfs_data"
<5>[    0.710378] 0x0000007b0000-0x0000007f0000 : "cfg"
<5>[    0.717493] 0x0000007f0000-0x000000800000 : "EEPROM"
<6>[    0.729897] tun: Universal TUN/TAP device driver, 1.6
<6>[    0.734974] tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
<6>[    0.750673] libphy: ag71xx_mdio: probed
<6>[    1.337955] ag71xx ag71xx.0: connected to PHY at ag71xx-mdio.1:04 [uid=004dd041, driver=Generic PHY]
<6>[    1.348131] eth0: Atheros AG71xx at 0xb9000000, irq 4, mode:MII
<6>[    1.355502] nf_conntrack version 0.5.0 (425 buckets, 1700 max)
<6>[    1.362342] xt_time: kernel timezone is -0000
<6>[    1.367650] ip_tables: (C) 2000-2006 Netfilter Core Team
<6>[    1.374505] NET: Registered protocol family 10
<6>[    1.384483] ip6_tables: (C) 2000-2006 Netfilter Core Team
<6>[    1.390451] NET: Registered protocol family 17
<6>[    1.395018] bridge: automatic filtering via arp/ip/ip6tables has been deprecated. Update your scripts to load br_netfilter if you need this.
<6>[    1.407722] Ebtables v2.0 registered
<6>[    1.412154] 8021q: 802.1Q VLAN Support v1.8
<6>[    1.424885] VFS: Mounted root (squashfs filesystem) readonly on device 31:4.
<6>[    1.434473] Freeing unused kernel memory: 292K
<14>[    2.703009] init: Console is alive
<14>[    2.706813] init: - watchdog -
<14>[    3.857085] kmodloader: loading kernel modules from /etc/modules-boot.d/*
<6>[    3.979332] usbcore: registered new interface driver usbfs
<6>[    3.984983] usbcore: registered new interface driver hub
<6>[    3.990553] usbcore: registered new device driver usb
<6>[    4.003427] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
<6>[    4.012157] ehci-platform: EHCI generic platform driver
<6>[    4.021500] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
<6>[    4.029623] ohci-platform: OHCI generic platform driver
<14>[    4.035728] kmodloader: done loading kernel modules from /etc/modules-boot.d/*
<14>[    4.053573] init: - preinit -
<6>[    4.824994] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
<5>[    4.859163] random: procd: uninitialized urandom read (4 bytes read, 7 bits of entropy available)
<6>[    7.387881] eth0: link up (100Mbps/Full duplex)
<6>[    7.392526] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
<5>[    8.188197] jffs2: notice: (390) jffs2_build_xattr_subsystem: complete building xattr subsystem, 0 of xdatum (0 unchecked, 0 orphan) and 0 of xref (0 dead, 0 orphan) found.
<14>[    8.206134] mount_root: switching to jffs2 overlay
<12>[    8.220360] urandom-seed: Seeding with /etc/urandom.seed
<6>[    8.401767] eth0: link down
<14>[    8.420282] procd: - early -
<14>[    8.423306] procd: - watchdog -
<14>[    9.114355] procd: - watchdog -
<14>[    9.117966] procd: - ubus -
<5>[    9.173033] random: ubusd: uninitialized urandom read (4 bytes read, 18 bits of entropy available)
<5>[    9.182723] random: ubusd: uninitialized urandom read (4 bytes read, 18 bits of entropy available)
<5>[    9.191876] random: ubusd: uninitialized urandom read (4 bytes read, 18 bits of entropy available)
<5>[    9.201703] random: ubusd: uninitialized urandom read (4 bytes read, 18 bits of entropy available)
<5>[    9.210821] random: ubusd: uninitialized urandom read (4 bytes read, 18 bits of entropy available)
<5>[    9.220129] random: ubusd: uninitialized urandom read (4 bytes read, 18 bits of entropy available)
<5>[    9.229418] random: ubusd: uninitialized urandom read (4 bytes read, 18 bits of entropy available)
<14>[    9.238981] procd: - init -
<14>[    9.692717] kmodloader: loading kernel modules from /etc/modules.d/*
<6>[    9.704232] Loading modules backported from Linux version wt-2017-01-31-0-ge882dff19e7f
<6>[    9.712347] Backport generated by backports.git backports-20160324-13-g24da7d3c
<6>[    9.762287] batman_adv: B.A.T.M.A.N. advanced 2018.1 (compatibility version 15) loaded
<4>[    9.857736] PCI: Enabling device 0000:00:00.0 (0000 -> 0002)
<6>[    9.868819] ath: phy0: Ignoring endianness difference in EEPROM magic bytes.
<7>[    9.877466] ath: EEPROM regdomain: 0x0
<7>[    9.877482] ath: EEPROM indicates default country code should be used
<7>[    9.877496] ath: doing EEPROM country->regdmn map search
<7>[    9.877523] ath: country maps to regdmn code: 0x3a
<7>[    9.877538] ath: Country alpha2 being used: US
<7>[    9.877552] ath: Regpair used: 0x3a
<7>[    9.890672] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
<6>[    9.899336] ieee80211 phy0: Atheros AR9280 Rev:2 mem=0xb0000000, irq=40
<14>[   10.052256] kmodloader: done loading kernel modules from /etc/modules.d/*
<5>[   12.032421] random: jshn: uninitialized urandom read (4 bytes read, 23 bits of entropy available)
<5>[   12.094216] random: ubusd: uninitialized urandom read (4 bytes read, 23 bits of entropy available)
<6>[   24.398800] device eth0 entered promiscuous mode
<6>[   24.420567] IPv6: ADDRCONF(NETDEV_UP): br-wan: link is not ready
<6>[   24.493526] IPv6: ADDRCONF(NETDEV_UP): local-node: link is not ready
<6>[   24.655202] IPv6: ADDRCONF(NETDEV_UP): br-client: link is not ready
<6>[   24.750938] IPv6: ADDRCONF(NETDEV_CHANGE): local-node: link becomes ready
<6>[   25.032745] device local-port entered promiscuous mode
<6>[   25.058569] br-client: port 1(local-port) entered forwarding state
<6>[   25.064861] br-client: port 1(local-port) entered forwarding state
<6>[   25.071335] IPv6: ADDRCONF(NETDEV_CHANGE): br-client: link becomes ready
<6>[   26.561432] eth0: link up (100Mbps/Full duplex)
<6>[   26.566049] br-wan: port 1(eth0) entered forwarding state
<6>[   26.571599] br-wan: port 1(eth0) entered forwarding state
<6>[   26.736549] IPv6: ADDRCONF(NETDEV_CHANGE): br-wan: link becomes ready
<6>[   27.056342] br-client: port 1(local-port) entered forwarding state
<7>[   27.497043] ath: EEPROM regdomain: 0x8114
<7>[   27.497072] ath: EEPROM indicates we should expect a country code
<7>[   27.497089] ath: doing EEPROM country->regdmn map search
<7>[   27.497105] ath: country maps to regdmn code: 0x37
<7>[   27.497122] ath: Country alpha2 being used: DE
<7>[   27.497136] ath: Regpair used: 0x37
<7>[   27.497155] ath: regdomain 0x8114 dynamically updated by user
<6>[   28.566327] br-wan: port 1(eth0) entered forwarding state
<6>[   30.158715] batman_adv: bat0: Adding interface: primary0
<6>[   30.164091] batman_adv: bat0: Interface activated: primary0
<6>[   30.172943] 8021q: adding VLAN 0 to HW filter on device bat0
<6>[   30.221045] device bat0 entered promiscuous mode
<6>[   30.225801] br-client: port 2(bat0) entered forwarding state
<6>[   30.231617] br-client: port 2(bat0) entered forwarding state
<6>[   30.555346] batman_adv: bat0: Interface deactivated: primary0
<6>[   30.615065] IPv6: ADDRCONF(NETDEV_UP): primary0: link is not ready
<6>[   30.656375] batman_adv: bat0: No IGMP Querier present - multicast optimizations disabled
<6>[   30.664790] batman_adv: bat0: No MLD Querier present - multicast optimizations disabled
<6>[   30.954745] batman_adv: bat0: Interface activated: primary0
<6>[   32.226324] br-client: port 2(bat0) entered forwarding state
<6>[   32.430109] batman_adv: bat0: Changing gw mode from: off to: client
<6>[   32.475871] batman_adv: bat0: hop_penalty: Changing from: 30 to: 15
<6>[   32.518790] batman_adv: bat0: multicast_fanout: Changing from: 1 to: 16
<6>[   32.540954] batman_adv: bat0: New skb noflood mark: 0x00000004/0x00000004
<6>[   32.576910] batman_adv: bat0: orig_interval: Changing from: 1000 to: 5000
<6>[   33.105957] IPv6: ADDRCONF(NETDEV_UP): client0: link is not ready
<6>[   33.213512] device client0 entered promiscuous mode
<6>[   33.440542] IPv6: ADDRCONF(NETDEV_CHANGE): client0: link becomes ready
<6>[   33.447370] br-client: port 3(client0) entered forwarding state
<6>[   33.453411] br-client: port 3(client0) entered forwarding state
<6>[   34.069779] IPv6: ADDRCONF(NETDEV_UP): mesh0: link is not ready
<6>[   34.330072] IPv6: ADDRCONF(NETDEV_CHANGE): mesh0: link becomes ready
<6>[   34.706393] batman_adv: bat0: IGMP Querier appeared
<6>[   34.711316] batman_adv: bat0: MLD Querier appeared
<6>[   35.446333] br-client: port 3(client0) entered forwarding state
<6>[   37.079878] batman_adv: bat0: Adding interface: mesh0
<6>[   37.085324] batman_adv: bat0: Interface activated: mesh0
<5>[   38.597176] random: nonblocking pool is initialized
<6>[   42.202538] batman_adv: bat0: Adding interface: mesh-vpn
<6>[   42.207995] batman_adv: bat0: The MTU of interface mesh-vpn is too small (1394) to handle the transport of batman-adv packets. Packets going over this interface will be fragmented on layer2 which could impact the performance. Setting the MTU to 1532 would solve the problem.
<6>[   42.232309] batman_adv: bat0: Interface activated: mesh-vpn
<4>[   65.355477] Kernel bug detected[#1]:
<4>[   65.359096] CPU: 0 PID: 1600 Comm: dnsmasq Not tainted 4.4.135 #0
<4>[   65.365211] task: 8093ca20 ti: 81586000 task.ti: 81586000
<4>[   65.370621] $ 0   : 00000000 00000001 00000001 00000000
<4>[   65.375899] $ 4   : 80473af0 00000000 8178320e 00000104
<4>[   65.381178] $ 8   : 81760854 00000000 dfacbf28 deadbf28
<4>[   65.386458] $12   : 00000000 deadbef5 00003333 0000ff00
<4>[   65.391737] $16   : 80473af0 00000000 80fb5c00 80903420
<4>[   65.397016] $20   : 00000000 02080020 8172707c 02080020
<4>[   65.402296] $24   : 00000000 802b6060                  
<4>[   65.407575] $28   : 81586000 81587688 00000000 802b60cc
<4>[   65.412855] Hi    : 0000039c
<4>[   65.415743] Lo    : 0034f474
<4>[   65.418667] epc   : 802b57ec nf_ct_del_from_dying_or_unconfirmed_list+0x8/0x34
<4>[   65.425922] ra    : 802b60cc destroy_conntrack+0x6c/0xd8
<4>[   65.431250] Status: 1000d403	KERNEL EXL IE 
<4>[   65.435479] Cause : 10800034 (ExcCode 0d)
<4>[   65.439498] PrId  : 00019374 (MIPS 24Kc)
<4>[   65.443434] Modules linked in: ath9k ath9k_common xt_CT mac80211 ath9k_hw ath batman_adv cfg80211 compat ohci_platform ohci_hcd ehci_platform ehci_hcd gpio_button_hotplug usbcore nls_base usb_common
<4>[   65.461274] Process dnsmasq (pid: 1600, threadinfo=81586000, task=8093ca20, tls=772ffd48)
<4>[   65.469465] Stack : 80ac1438 80d21c58 00000042 80fb5c00 80fb5c00 802705f8 80903420 80d173dc
<4>[   65.469465] 	  80fb5c00 80272c68 80fb5c00 802735c8 0000001c 80fab700 80fb5c00 80903420
<4>[   65.469465] 	  0000001b 80d11898 8158772c 80fb5c00 80903420 80903420 80fab700 00000000
<4>[   65.469465] 	  80fb5c00 80fb5c00 00000000 80903000 80903420 00000000 00000056 00000000
<4>[   65.469465] 	  00000001 80d187e0 80f86900 815877fc 80f71000 00000000 00000000 817d3874
<4>[   65.469465] 	  ...
<4>[   65.505378] Call Trace:
<4>[   65.507844] [<802b57ec>] nf_ct_del_from_dying_or_unconfirmed_list+0x8/0x34
<4>[   65.514752] [<802b60cc>] destroy_conntrack+0x6c/0xd8
<4>[   65.519729] 
<4>[   65.521224] 
<4>[   65.521224] Code: 00000000  8c83000c  2c620001 <00020336> 8c820008  30450001  ac620000  14a00002  00000000 
<4>[   65.531330] ---[ end trace 23314c37a52ed39c ]---

===================================
Time: 1528487122.971121

Firmware binaries to reproduce can be found here: https://metameute.de/~tux/Freifunk/firmware/mcast-to-ucast3/

@T-X T-X added the 0. type: bug This is a bug label Jul 10, 2018
@neocturne
Copy link
Member

  • How reproducible is this?
  • Does the issue occur on other hardware?
  • What about an unpatched Gluon?

@neocturne
Copy link
Member

Please also check current v2018.1.x (with update to 4.4.139) and Gluon master.

@mweinelt
Copy link
Contributor

mweinelt commented Jan 26, 2019

I can currently repeatedly reproduce this when I try to upgrade my 1043NDv2. We are running de19cd5 with these patchsets: #1329, #1357, #1616.

root@64289-rocket-science:/tmp# sysupgrade gluon-ffda-1.3~20190126-tp-link-tl-wr1043n-nd-v2-sysupgrade.bin
Image metadata not found
cat: write error: Broken pipe
cat: write error: Broken pipe
cat: write error: Broken pipe
Saving config files...
Commencing upgrade. Closing all shell sessions.
Watchdog handover: fd=3
- watchdog -
killall: telnetd: no process killed
Sending TERM to remaining processes ... uradvd micrond odhcp6c sse-multiplexd udhcpc odhcp6c gluon-radv-filt uhttpd dnsmasq hostapd [  137.808744] device client0 left promiscuous mode
[  137.813703] br-client: port 4(client0) entered disabled state
ntpd respondd ubusd logd [  138.174560] br-client: port 3(bat0) entered disabled state
[  138.180309] br-client: port 2(local-port) entered disabled state
[  138.186489] br-client: port 1(eth1) entered disabled state
haveged haveged: haveged: Stopping due to signal 15
dnsmasq gluon-arp-limit [  138.260576] device bat0 left promiscuous mode
[  138.265139] br-client: port 3(bat0) entered disabled state
netifd 
[  138.323728] device eth1 left promiscuous mode
[  138.328287] br-client: port 1(eth1) entered disabled state
[  138.385064] eth1: link down
[  138.389593] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[  138.443053] device local-port left promiscuous mode
[  138.448234] br-client: port 2(local-port) entered disabled state
[  138.503372] IPv6: ADDRCONF(NETDEV_UP): local-port: link is not ready
[  138.557032] br-client: port 1(bat0) entered blocking state
[  138.562635] br-client: port 1(bat0) entered disabled state
[  138.568591] device bat0 entered promiscuous mode
[  138.638953] br-client: port 2(client0) entered blocking state
[  138.644909] br-client: port 2(client0) entered disabled state
[  138.651057] device client0 entered promiscuous mode
[  138.676212] eth1: link up (1000Mbps/Full duplex)
[  138.689530] device client0 left promiscuous mode
[  138.694353] br-client: port 2(client0) entered disabled state
[  138.701730] br-client: port 2(eth1) entered blocking state
[  138.708918] br-client: port 2(eth1) entered disabled state
[  138.715520] device eth1 entered promiscuous mode
[  138.743372] batman_adv: bat0: Interface deactivated: mesh0
[  138.768330] br-client: port 3(local-port) entered blocking state
[  138.774471] br-client: port 3(local-port) entered disabled state
[  138.786397] device local-port entered promiscuous mode
[  138.792113] batman_adv: bat0: Removing interface: mesh0
[  138.885538] br-client: port 3(local-port) entered blocking state
[  138.891643] br-client: port 3(local-port) entered forwarding state
[  138.897998] br-client: port 2(eth1) entered blocking state
[  138.903576] br-client: port 2(eth1) entered forwarding state
[  138.909336] br-client: port 1(bat0) entered blocking state
[  138.914915] br-client: port 1(bat0) entered forwarding state
[  138.920761] IPv6: ADDRCONF(NETDEV_UP): br-client: link is not ready
[  139.032716] batman_adv: bat0: No IGMP Querier present - multicast optimizations disabled
[  139.040931] batman_adv: bat0: No MLD Querier present - multicast optimizations disabled
[  139.942728] IPv6: ADDRCONF(NETDEV_CHANGE): br-client: link becomes ready
[  140.172888] Kernel bug detected[#1]:
[  140.176533] CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 4.9.146 #0
[  140.182816] Workqueue: ipv6_addrconf addrconf_dad_work
[  140.188028] task: 8382cfa8 task.stack: 83844000
[  140.192618] $ 0   : 00000000 00000001 00000001 00000000
[  140.197952] $ 4   : 804e1948 00000000 0003d94f 83b5c240
[  140.203285] $ 8   : 0003d94e 83b5c240 8322a400 00ff0000
[  140.208617] $12   : 00000000 00000000 00000000 00000000
[  138.689530] device client0 left promiscuous mode
[  138.694353] br-client: port 2(client0) entered disabled state
[  138.701730] br-client: port 2(eth1) entered blocking state
[  138.708918] br-client: port 2(eth1) entered disabled state
[  138.715520] device eth1 entered promiscuous mode
[  138.743372] batman_adv: bat0: Interface deactivated: mesh0
[  138.768330] br-client: port 3(local-port) entered blocking state
[  138.774471] br-client: port 3(local-port) entered disabled state
[  138.786397] device local-port entered promiscuous mode
[  138.792113] batman_adv: bat0: Removing interface: mesh0
[  138.885538] br-client: port 3(local-port) entered blocking state
[  138.891643] br-client: port 3(local-port) entered forwarding state
[  138.897998] br-client: port 2(eth1) entered blocking state
[  138.903576] br-client: port 2(eth1) entered forwarding state
[  138.909336] br-client: port 1(bat0) entered blocking state
[  138.914915] br-client: port 1(bat0) entered forwarding state
[  138.920761] IPv6: ADDRCONF(NETDEV_UP): br-client: link is not ready
[  139.032716] batman_adv: bat0: No IGMP Querier present - multicast optimizations disabled
[  139.040931] batman_adv: bat0: No MLD Querier present - multicast optimizations disabled
[  139.942728] IPv6: ADDRCONF(NETDEV_CHANGE): br-client: link becomes ready
[  140.172888] Kernel bug detected[#1]:
[  140.176533] CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 4.9.146 #0
[  140.182816] Workqueue: ipv6_addrconf addrconf_dad_work
[  140.188028] task: 8382cfa8 task.stack: 83844000
[  140.192618] $ 0   : 00000000 00000001 00000001 00000000
[  140.197952] $ 4   : 804e1948 00000000 0003d94f 83b5c240
[  140.203285] $ 8   : 0003d94e 83b5c240 8322a400 00ff0000
[  140.208617] $12   : 00000000 00000000 00000000 00000000
[  140.213950] $16   : 804e1948 82d6c600 83b5c3c0 82d06c60
[  140.219283] $20   : 00000000 02080020 83b5c240 00000000
[  140.224616] $24   : 00000000 802f3f68                  
[  140.229949] $28   : 83844000 83845948 00000000 802f3fcc
[  140.235282] Hi    : 00000199
[  140.238195] Lo    : 0017d5a6
[  140.241129] epc   : 802f3e20 nf_ct_del_from_dying_or_unconfirmed_list+0x8/0x34
[  140.248452] ra    : 802f3fcc destroy_conntrack+0x64/0xbc
[  140.253833] Status: 1100fc03 KERNEL EXL IE 
[  140.258101] Cause : 00800034 (ExcCode 0d)
[  140.262167] PrId  : 00019750 (MIPS 74Kc)
[  140.266143] Modules linked in: ath9k ath9k_common batman_adv ath9k_hw ath mac80211 cfg80211 xt_CT compat ledtrig_usbport ehci_platform ehci_hcd gpio_button_hotplug usbcore nls_base usb_common
[  140.283555] Process kworker/0:0 (pid: 4, threadinfo=83844000, task=8382cfa8, tls=00000000)
[  140.291926] Stack : 83b5c240 00000000 83b5c240 830d74ec 83b5c240 802a8a08 8322a478 830e22ac
[  140.300462]         00000082 83b5c240 83b5c240 802a8c28 82d06c60 830d76e4 83b5c3c0 802abb2c
[  140.308998]         83b5c240 802a8c64 00000000 0000001e 82d6c600 83b5c3c0 0000001e 830d1a50
[  140.317534]         00000000 83b5c3c0 83845a04 82d06800 82d6c600 00000000 83b5c3c0 83b5c3c0
[  140.326070]         00000000 82d06800 82d06c60 0000004e 00000000 00000001 00000000 830d8e2c
[  140.334607]         ...
[  140.337097] Call Trace:
[  140.339576] [<802f3e20>] nf_ct_del_from_dying_or_unconfirmed_list+0x8/0x34
[  140.346548] [<802f3fcc>] destroy_conntrack+0x64/0xbc
[  140.351576] Code: 2610fd4c  8c830010  2c620001 <00020336> 8c82000c  30450001  ac620000  14a00002  00000000 
[  140.361526] 
[  140.363067] ---[ end trace 4f1b72475bbc8cfd ]---
[  140.369716] Kernel panic - not syncing: Fatal exception in interrupt
[  140.377275] Rebooting in 3 seconds..

@rotanid
Copy link
Member

rotanid commented Jan 26, 2019

@T-X is the author of the multicast patch, maybe he has an idea

@T-X
Copy link
Contributor Author

T-X commented Jan 26, 2019

Hm, this crash is triggered through a failing assertion in netfliter-conntrack here:

https://elixir.bootlin.com/linux/v4.9.146/source/net/netfilter/nf_conntrack_core.c#L354:

BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode));

Going through the "git log net/netfilter/nf_conntrack_core.c" 368982cd ("netfilter: nfnetlink_queue: resolve clash for unconfirmed conntracks") got my attention. It says:

In nfqueue, two consecutive skbuffs may race to create the conntrack entry. Hence, the one that loses the race gets dropped due to clash in the insertion into the hashes from the nf_conntrack_confirm() path.

I'm wondering whether that might have something to do with it. The batman-adv multicast-to-multi-unicast patch works quite similar to how broadcast packets are duplicated and transmitted on an interface three times. One notable difference to the broadcast flooding is though, that each transmission is delayed a bit for broadcast packets.

Maybe from a conntrack perspective these duplicate packets get hashed to the same value and since we are queueing them quickly one after the other maybe they might run into this bug / race condition in conntrack?

The mentioned patch was added with Linux 4.18 and was not backported to stable kernels.

@T-X
Copy link
Contributor Author

T-X commented Jan 26, 2019

On the other hand, I'm wondering why conntrack would want to look at batman-adv frames at all. I'm wondering whether we are missing to reset the protocol type somewhere in batman-adv or something like this. At least an "skb->protocol = htons(ETH_P_BATMAN);" is there in batadv_send_skb_packet().

Or maybe this bug might not directly have something to do with batman-adv.

@mweinelt
Copy link
Contributor

The sysupgrade worked just fine after I disabled the batman-adv multicast mode (batctl mm 0).

@T-X
Copy link
Contributor Author

T-X commented Jan 29, 2019

Killing netifd with a node connected to a mesh of reasonable size also triggered this issue reliably.

Problem is found, was a bug in the batman-adv multicast-to-multi-unicast patch and is fixed in #1357 with v4.

@neocturne
Copy link
Member

Is there still a need to keep this issue open, or can it be tracked as part of #1357?

@T-X
Copy link
Contributor Author

T-X commented Feb 6, 2019

It's fixed and noted in the changelog v5 of #1357. So yes, can be closed here.

@T-X T-X closed this as completed Feb 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. type: bug This is a bug
Projects
None yet
Development

No branches or pull requests

4 participants