New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High load on some devices after v2017.1.x update #1243

Open
mweinelt opened this Issue Oct 22, 2017 · 103 comments

Comments

Projects
None yet
@mweinelt
Contributor

mweinelt commented Oct 22, 2017

The devices were fine before, but with 2017.1.x high loads appeared. They seem to originate somewhere in the kernelspace.

Only some models are affected, but then all devices of that model experience this issue:

  • TP-Link WR842ND v2
  • TP-Link WR1043ND v1
  • Ubiquiti Nanostation Loco M2

Probably more, but since our Grafana is currently down it's cumbersome to find more.

@mweinelt mweinelt changed the title from High load on some devices after 2017.1.x update to High load on some devices after v2017.1.x update Oct 22, 2017

@NeoRaider

This comment has been minimized.

Member

NeoRaider commented Oct 22, 2017

I've seen the issue on a WR841ND (I think it was v9 or v10). Possibly, all models with 32MB are affected?

@mweinelt

This comment has been minimized.

Contributor

mweinelt commented Oct 22, 2017

Unaffected:

  • TP-Link WR841ND v8
  • TP-Link WR841ND v10

Partially affected:

  • TP-Link WR841ND v9
@rotanid

This comment has been minimized.

Member

rotanid commented Oct 22, 2017

we also noticed that devices with more mesh neighbours are more likely to be affected - what a surprise!

@rotanid rotanid added the bug label Oct 22, 2017

@Tarnatos

This comment has been minimized.

Contributor

Tarnatos commented Oct 23, 2017

I can confirm that dir Freifunk Nord

@mweinelt

This comment has been minimized.

Contributor

mweinelt commented Oct 23, 2017

Device loads by model on the FFDA network where load > 2.0

{
  "TP-Link TL-MR3420 v1": [
    8.63
  ],
  "TP-Link TL-WR841N/ND v11": [
    7.85,
    12.47,
    3.44,
    3.9
  ],
  "TP-Link TL-WR940N v4": [
    4.48
  ],
  "Ubiquiti AirRouter": [
    3.09
  ],
  "TP-Link TL-WR710N v1": [
    6.2
  ],
  "TP-Link TL-WR842N/ND v2": [
    2.28,
    12.43,
    8.98,
    15.45,
    8.98,
    3.47,
    7.22,
    10.07,
    12.14,
    9.2,
    7.25,
    8.48,
    2.91,
    6.22,
    8.91,
    3.14,
    2.13,
    9.9,
    10.92
  ],
  "Ubiquiti PicoStation M2": [
    9.97
  ],
  "Linksys WRT160NL": [
    4.69
  ],
  "TP-Link TL-WR710N v2.1": [
    4.19,
    11.26
  ],
  "TP-Link TL-WR1043N/ND v1": [
    6.98,
    3.57,
    6.25,
    2.23,
    5.62
  ],
  "TP-Link TL-WR841N/ND v9": [
    4.98,
    4.55,
    4.08,
    2.67,
    4.89,
    2.57,
    9.66,
    7.81
  ],
  "TP-Link TL-WA850RE v1": [
    7.71
  ],
  "TP-Link TL-WR841N/ND v10": [
    2.87,
    5.32,
    5.59
  ],
  "TP-Link TL-WA901N/ND v3": [
    7.96
  ],
  "Ubiquiti NanoStation loco M2": [
    4.94
  ],
  "TP-Link TL-WR842N/ND v1": [
    2.49
  ]
}
@blocktrron

This comment has been minimized.

Contributor

blocktrron commented Oct 23, 2017

Affected nodes on the FFDA Network grouped by SoC:

AR9341

TP-Link TL-WR842N/ND v2 19
TP-Link TL-WA801N/ND v2 1
TP-Link TL-WA850RE v1 1

QCA9533

TP-Link TL-WR841N/ND v9 7
TP-Link TL-WR841N/ND v10 5
TP-Link TL-WR841N/ND v11 2

AR9132

TP-Link TL-WR1043N/ND v1 4

AR9331

TP-Link TL-WR710N v2.1 2
TP-Link TL-WR710N v1 1

AR7240

Ubiquiti NanoStation loco M2 1
Ubiquiti PicoStation M2 1

AR7241

TP-Link TL-MR3420 v1 1
Ubiquiti AirRouter 1

AR9130

Linksys WRT160NL 1

@A-Kasper

This comment has been minimized.

Contributor

A-Kasper commented Oct 25, 2017

Hi!

I don't got any nodes with a higher load than 0.8 in our network.
we've had these problems when using batman 2017.x first time.. but then a bug with multicast optimization in batman-adv was found. After disabling this in firmware AND at all gateways the load was going down.

Could you please provide these information:

  1. How many Nodes are on old firmware in your network? (maybe someone can add the first gluon Version without MO to name the affected versions....)
  2. Which Batmand-adv is running on your Gateways?
  3. Is Multicast Optimisation activated on gateway level?

I think this could be related. Maybe you could try to update your batman-adv gateways and disable multicast optimizations with "batctl mm 0" (don't forget your mapserver ;))

If the behaviour is related, you could try to eliminate the fulltable orgy by disabling all vpn tunnels for a minute to make the mismarked packages disappear

@mweinelt

This comment has been minimized.

Contributor

mweinelt commented Oct 25, 2017

  1. ~700 new (v2017.1.3), ~30 old (v2017.1.2 and earlier)
  2. batman-adv v2017.3
  3. yes

Multicast optimizations are still disabled on v2017.1.x, see c6a3afa.

@A-Kasper

This comment has been minimized.

Contributor

A-Kasper commented Oct 25, 2017

I think the MO Bug is still not addressed. As already mentioned our load is in normal range (but we have a smaller network). If you are able to, I would like to ask you to check if the load goes down if you disable MO at all gateways via batctl mm 0 i would like to make sure it's not still this mo thing.....

Our Network went stable after ALL Sources of mismarked packages were eliminated... Gateways had the biggest effect. I'm not sure, but if my memories are correct it was you who had a look at our network dump... if so: can you see these full table requests?

@mweinelt

This comment has been minimized.

Contributor

mweinelt commented Oct 25, 2017

@A-Kasper

This comment has been minimized.

Contributor

A-Kasper commented Oct 25, 2017

@mweinelt

This comment has been minimized.

Contributor

mweinelt commented Oct 27, 2017

Disabled mm on our gateways.

Looking at this node for example at this time of day (3:00)
https://meshviewer.darmstadt.freifunk.net/#/de/map/30b5c2c2ead4

  • Model 841 v9
  • Load average 4,83
  • RAM 74,7% used
    "memory": {
     "total": 27808,
     "free": 5236,
     "buffers": 992,
     "cached": 1696
    },
  • Clients 0
  • Traffic negligible

5,2 MB of "free" RAM does not seem to be enough if we suspect the issue arises due to high memory fragmentation.

@A-Kasper

This comment has been minimized.

Contributor

A-Kasper commented Oct 27, 2017

@rotanid

This comment has been minimized.

Member

rotanid commented Oct 27, 2017

@A-Kasper

  1. please try to not comment if you have nothing new to contribute. this makes the whole issue harder and harder to read. you could have waited with your comment until you had a look at your mem.
  2. how many nodes does you network have? what about large groups of wireless-meshing-nodes? maybe your network is simply too small to have this issue.
@CodeFetch

This comment has been minimized.

Contributor

CodeFetch commented Oct 31, 2017

@mweinelt Do you have any custom ash-scripts running on your nodes?
I have had an ash-script running and my node got slower and slower until logging in via ssh took about a minute. The same happened when a monitoring script logged into the router via SSH and just executed some commands. That's why I think it could be a problem with ash/Busybox. With only lua scripts running on the router executed by micrond everything works as expected.

@mweinelt

This comment has been minimized.

Contributor

mweinelt commented Oct 31, 2017

@CodeFetch No, we're not running anything custom.

@rotanid

This comment has been minimized.

Member

rotanid commented Nov 6, 2017

perhaps this issue is actually the same as #753 - only that it appears earlier than before

@mweinelt

This comment has been minimized.

Contributor

mweinelt commented Nov 6, 2017

Looking at the issue by SoC alone does not seem to yield a clear result.

[
  {
    "family": "AR9531",
    "loadavg": 0.17,
    "devices": {
      "TP-Link TL-WR842N/ND v3": 0.17
    }
  },
  {
    "family": "QCA9558",
    "loadavg": 0.18,
    "devices": {
      "TP-Link TL-WR1043N/ND v3": 0.16,
      "TP-Link Archer C7 v2": 0.19,
      "TP-Link TL-WR1043N/ND v2": 0.2,
      "TP-Link Archer C5 v1": 0.21
    }
  },
  {
    "family": "QCA9563",
    "loadavg": 0.19,
    "devices": {
      "TP-Link TL-WR1043N/ND v4": 0.19
    }
  },
  {
    "family": "AR7240",
    "loadavg": 0.67,
    "devices": {
      "Ubiquiti NanoStation loco M2": 0.59,
      "Ubiquiti NanoStation M2": 1.61,
      "TP-Link TL-WA801N/ND v1": 1.25,
      "TP-Link TL-WR740N/ND v4": 1.18,
      "TP-Link TL-WR740N/ND v1": 0.3,
      "TP-Link TL-WA901N/ND v1": 0.04,
      "TP-Link TL-WA830RE v1": 0.17,
      "TP-Link TL-WR741N/ND v1": 0.41,
      "TP-Link TL-WR841N/ND v5": 0.06
    }
  },
  {
    "family": "QCA9533",
    "loadavg": 1.05,
    "devices": {
      "TP-Link TL-WR841N/ND v11": 0.79,
      "TP-Link TL-WR841N/ND v10": 1.18,
      "TP-Link TL-WR841N/ND v9": 1.1
    }
  },
  {
    "family": "AR2316A",
    "loadavg": 1.49,
    "devices": {
      "Ubiquiti PicoStation M2": 1.49
    }
  },
  {
    "family": "AR9344",
    "loadavg": 0.21,
    "devices": {
      "TP-Link CPE210 v1.1": 0.21,
      "TP-Link TL-WDR4300 v1": 0.2,
      "TP-Link CPE210 v1.0": 0.24,
      "TP-Link TL-WDR3600 v1": 0.17
    }
  },
  {
    "family": "AR9341",
    "loadavg": 2.5,
    "devices": {
      "TP-Link TL-WR842N/ND v2": 3.5,
      "TP-Link TL-WA850RE v1": 1.42,
      "TP-Link TL-WR841N/ND v8": 1.25,
      "TP-Link TL-WA801N/ND v2": 0.18,
      "TP-Link TL-WR941N/ND v5": 0.07,
      "TP-Link TL-WA901N/ND v3": 2.26,
      "TP-Link TL-WA860RE v1": 0.15
    }
  },
  {
    "family": "AR9331",
    "loadavg": 2.76,
    "devices": {
      "TP-Link TL-WR710N v2.1": 6.76,
      "TP-Link TL-MR3020 v1": 1.15,
      "TP-Link TL-WR710N v1": 3.62,
      "TP-Link TL-WR741N/ND v4": 0.21,
      "TP-Link TL-WR710N v2": 0.12,
      "TP-Link TL-WA701N/ND v2": 0.61
    }
  },
  {
    "family": "AR9132",
    "loadavg": 1.95,
    "devices": {
      "TP-Link TL-WR1043N/ND v1": 2.51,
      "TP-Link TL-WR941N/ND v2": 0.18,
      "TP-Link TL-WA901N/ND v2": 0.11
    }
  },
  {
    "family": "AR7241",
    "loadavg": 2.56,
    "devices": {
      "TP-Link TL-MR3420 v1": 2.78,
      "TP-Link TL-WR842N/ND v1": 2.12,
      "Ubiquiti AirRouter": 2.77
    }
  },
  {
    "family": "TP9343",
    "loadavg": 1.06,
    "devices": {
      "TP-Link TL-WR941N/ND v6": 0.44,
      "TP-Link TL-WR940N v4": 2.07,
      "TP-Link TL-WA901N/ND v4": 0.07
    }
  },
  {
    "family": "MT7621AT",
    "loadavg": 0.08,
    "devices": {
      "D-Link DIR-860L B1": 0.08
    }
  },
  {
    "family": "AR7161",
    "loadavg": 0.26,
    "devices": {
      "Buffalo WZR-HP-AG300H/WZR-600DHP": 0.26
    }
  },
  {
    "family": "AR1311",
    "loadavg": 0.21,
    "devices": {
      "D-Link DIR-505 rev. A2": 0.21
    }
  },
  {
    "family": "AR9350",
    "loadavg": 0.15,
    "devices": {
      "TP-Link CPE510 v1.0": 0.15,
      "TP-Link CPE510 v1.1": 0.14
    }
  },
  {
    "family": "AR9130",
    "loadavg": 7.59,
    "devices": {
      "Linksys WRT160NL": 7.59
    }
  },
  {
    "family": "AR9342",
    "loadavg": 0.02,
    "devices": {
      "Ubiquiti Loco M XW": 0.02
    }
  }
]
@edeso

This comment has been minimized.

Contributor

edeso commented Nov 10, 2017

hey All,

on an Ubiquiti NanoStation loco M2 the issue seems to go away when the mesh wlan is deactivated. at least it looks that way since 2 days.

this is kbu freifunk, where the wireless mesh config looks like this

config wifi-iface 'ibss_radio0'
        option ifname 'ibss0'
        option network 'ibss_radio0'
        option device 'radio0'
        option bssid '02:d2:22:01:fc:22'
        option disabled '1'
        option mcast_rate '12000'
        option mode 'adhoc'
        option macaddr '42:84:e7:d9:c1:32'
        option ssid '02:d2:22:01:fc:22'

..ede

@mweinelt

This comment has been minimized.

Contributor

mweinelt commented Nov 10, 2017

Does this NSM2 Loco have a VPN connection or is it otherwise connected to the Mesh after disabling the WiFi-Mesh?

How big is the batadv L2 domain? Originators? Transtable (global) size?

@T-X

This comment has been minimized.

Contributor

T-X commented Nov 13, 2017

At least the three devices in the original post are all 32MB RAM and 8MB flash devices. Is this ticket a duplicate of #1197 maybe? Or could we somehow separate these two tickets more clearly?

Just a crazy idea... As decently fast microSD cards seem to have gotten quite cheap: I'd be curious whether attaching some flash storage to the USB port of a router and configuring one partition for swap and one for /tmp/ would make a difference. For instance this plus this would cost less than 10€. Maybe there's even a decently fast, usable USB flash stick for less than 5€. Not suggesting this as a fix, but curious whether that'd change anything.

Also, if some people with devices constantly having high loads could recompile with CONFIG_KERNEL_SLABINFO=y and could dump /proc/slabinfo, that could be helpful (I asked for this in #1197, too).

@mweinelt

This comment has been minimized.

Contributor

mweinelt commented Nov 13, 2017

We tested an image for the 1043v1 without additional USB modules, that are currently being loaded unconditionally - and the device seemed to behave fine again.

@blocktrron can maybe tell us more about what he saw during tests.

@blocktrron

This comment has been minimized.

Contributor

blocktrron commented Nov 13, 2017

At Freifunk Darmstadt, we were able to observe that 8MB/32MB devices rebooted frequently, most likely due to additional RAM usage of the integrated USB support.

We were also able to recreate the problems (High load/crashing) on an OpenWRT based Gluon by writing to tmpfs. Crashing/High load only occurred, when RAM was filled before the batman transglobal table was initialized, in case the router was already connected to the mesh.

When the Router ist booted without visible neighbors, then filling RAM and connecting to the Network, the node was not affected by crashing or high load.

@mweinelt

This comment has been minimized.

Contributor

mweinelt commented Nov 14, 2017

Fyi: We're building images from master with slabinfo enabled and without additional USB modules tonight. If we can trigger the issue with those, we'll post slabinfo, else we'll retry again with the USB modules installed,

@mweinelt

This comment has been minimized.

Contributor

mweinelt commented Nov 14, 2017

From our 1043v1 rebooting in circles:

OOM Reboot

[   90.450092] hotplug-call invoked oom-killer: gfp_mask=0x2420848, order=0, oom_score_adj=0
[   90.458394] CPU: 0 PID: 2327 Comm: hotplug-call Not tainted 4.4.93 #0
[   90.464869] Stack : 803e96e4 00000000 00000001 80440000 807d5764 80434e63 803ca228 00000917
	  804a378c 00001b20 00000040 00000000 00000000 800a787c 00000006 00000000
	  00000000 00000000 803cdd4c 8172199c 804a6542 800a57f8 02420848 00000000
	  00000001 801f9300 00000000 00000000 00000000 00000000 00000000 00000000
	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
	  ...
[   90.500943] Call Trace:
[   90.503422] [<80071f1c>] show_stack+0x54/0x88
[   90.507826] [<800d498c>] dump_header.isra.4+0x48/0x130
[   90.513003] [<800d515c>] check_panic_on_oom+0x48/0x84
[   90.518102] [<800d5288>] out_of_memory+0xf0/0x324
[   90.522847] [<800d8da0>] __alloc_pages_nodemask+0x6b8/0x724
[   90.528488] [<800d1b44>] pagecache_get_page+0x154/0x278
[   90.533765] [<80136e94>] __getblk_slow+0x15c/0x374
[   90.538617] [<8015e518>] squashfs_read_data+0x1c8/0x6e8
[   90.543888] [<80162728>] squashfs_readpage_block+0x32c/0x4d8
[   90.549602] [<801603a4>] squashfs_readpage+0x5bc/0x6d0
[   90.554780] [<800dc53c>] __do_page_cache_readahead+0x1f8/0x264
[   90.560673] [<800d393c>] filemap_fault+0x1ac/0x458
[   90.565526] [<800eeb4c>] __do_fault+0x3c/0xa8
[   90.569925] [<800f1d84>] handle_mm_fault+0x478/0xb14
[   90.574934] [<80076be8>] __do_page_fault+0x134/0x470
[   90.579944] [<80060820>] ret_from_exception+0x0/0x10
[   90.584933] 
[   90.586446] Mem-Info:
[   90.588769] active_anon:820 inactive_anon:9 isolated_anon:0
[   90.588769]  active_file:136 inactive_file:154 isolated_file:0
[   90.588769]  unevictable:0 dirty:0 writeback:0 unstable:0
[   90.588769]  slab_reclaimable:211 slab_unreclaimable:3104
[   90.588769]  mapped:59 shmem:29 pagetables:104 bounce:0
[   90.588769]  free:293 free_pcp:0 free_cma:0
[   90.620556] Normal free:1172kB min:1024kB low:1280kB high:1536kB active_anon:3280kB inactive_anon:36kB active_file:544kB inactive_file:616kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:32768kB managed:27776kB mlocked:0kB dirty:0kB writeback:0kB mapped:236kB shmem:116kB slab_reclaimable:844kB slab_unreclaimable:12416kB kernel_stack:472kB pagetables:416kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:6972 all_unreclaimable? yes
[   90.664376] lowmem_reserve[]: 0 0
[   90.667738] Normal: 49*4kB (UME) 80*8kB (UME) 13*16kB (UME) 4*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1172kB
[   90.680287] 319 total pagecache pages
[   90.683970] 0 pages in swap cache
[   90.687314] Swap cache stats: add 0, delete 0, find 0/0
[   90.692565] Free swap  = 0kB
[   90.695472] Total swap = 0kB
[   90.698373] 8192 pages RAM
[   90.701093] 0 pages HighMem/MovableOnly
[   90.704947] 1248 pages reserved
[   90.708117] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[   90.716721] [  515]     0   515      297       49       3       0        0             0 ubusd
[   90.725384] [  516]     0   516      296       40       4       0        0             0 ash
[   90.733889] [  807]     0   807      306       68       4       0        0             0 logd
[   90.742477] [  814]     0   814      429      189       4       0        0             0 haveged
[   90.751328] [ 1055]     0  1055      447       82       4       0        0             0 netifd
[   90.760092] [ 1102]     0  1102      264       40       3       0        0             0 dropbear
[   90.769029] [ 1119]     0  1119      225       42       3       0        0             0 uradvd
[   90.777794] [ 1330]     0  1330      296       39       4       0        0             0 udhcpc
[   90.786557] [ 1332]     0  1332      254       44       3       0        0             0 odhcp6c
[   90.795396] [ 1343]     0  1343      254       49       3       0        0             0 odhcp6c
[   90.804250] [ 1487]     0  1487      225       44       4       0        0             0 micrond
[   90.813101] [ 1521]     0  1521      224       39       3       0        0             0 sse-multiplexd
[   90.822562] [ 1685]     0  1685      320       50       3       0        0             0 uhttpd
[   90.831325] [ 1794]     0  1794      383       76       3       0        0             0 hostapd
[   90.840177] [ 1809]   453  1809      353      137       4       0        0             0 dnsmasq
[   90.849033] [ 1830]     0  1830      280       52       4       0        0             0 dnsmasq
[   90.857887] [ 2116]     0  2116      320       63       4       0        0             0 fastd
[   90.866563] [ 2213]     0  2213      517       71       3       0        0             0 respondd
[   90.875502] [ 2223]     0  2223      306       50       3       0        0             0 hotplug-call
[   90.884777] [ 2295]     0  2295      296       40       3       0        0             0 ntpd
[   90.893369] [ 2326]     0  2326      327       75       4       0        0             0 dhcpv6.script
[   90.902742] [ 2327]     0  2327      306       47       3       0        0             0 hotplug-call
[   90.912030] [ 2332]     0  2332      326       72       4       0        0             0 gluon-respondd
[   90.921492] [ 2342]     0  2342      326       71       4       0        0             0 gluon-respondd
[   90.930952] [ 2343]     0  2343      326       71       4       0        0             0 gluon-respondd
[   90.940413] [ 2345]     0  2345      293       62       3       0        0             0 jsonfilter
[   90.949525] [ 2346]     0  2346      212       42       3       0        0             0 ubus
[   90.958114] [ 2349]     0  2349      382       60       4       0        0             0 procd
[   90.966786] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
[   90.966786] 
[   90.980773] Rebooting in 3 seconds..

Slabinfo

slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
mesh_rmc                0     0     72    56  1  :  tunables  0  0  0  :  slabdata  0    0    0
nf-frags                0     0     184   22  1  :  tunables  0  0  0  :  slabdata  0    0    0
nf_conntrack_1          7     15    264   15  1  :  tunables  0  0  0  :  slabdata  1    1    0
nf_conntrack_expect     0     0     208   19  1  :  tunables  0  0  0  :  slabdata  0    0    0
fq_flow_cache           0     0     112   36  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tt_roam_cache    0     0     96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tt_req_cache     0     0     96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tt_change_cache  0     64    64    64  1  :  tunables  0  0  0  :  slabdata  1    1    0
batadv_tt_orig_cache    0     0     96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tg_cache         0     0     96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tl_cache         2     42    96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
sd_ext_cdb              2     51    80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-128              2     15    2112  15  8  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-64               2     15    1088  15  4  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-32               2     14    576   14  2  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-16               2     12    320   12  1  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-8                2     21    192   21  1  :  tunables  0  0  0  :  slabdata  1    1    0
scsi_data_buffer        0     0     64    64  1  :  tunables  0  0  0  :  slabdata  0    0    0
bridge_fdb_cache        11    42    96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
ip6-frags               0     0     184   22  1  :  tunables  0  0  0  :  slabdata  0    0    0
fib6_nodes              21    42    96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
ip6_dst_cache           44    56    288   14  1  :  tunables  0  0  0  :  slabdata  4    4    0
ip6_mrt_cache           0     0     160   25  1  :  tunables  0  0  0  :  slabdata  0    0    0
PINGv6                  0     0     832   19  4  :  tunables  0  0  0  :  slabdata  0    0    0
RAWv6                   8     19    832   19  4  :  tunables  0  0  0  :  slabdata  1    1    0
UDPLITEv6               0     0     800   10  2  :  tunables  0  0  0  :  slabdata  0    0    0
UDPv6                   4     10    800   10  2  :  tunables  0  0  0  :  slabdata  1    1    0
tw_sock_TCPv6           0     0     232   17  1  :  tunables  0  0  0  :  slabdata  0    0    0
request_sock_TCPv6      0     0     280   14  1  :  tunables  0  0  0  :  slabdata  0    0    0
TCPv6                   1     10    1536  10  4  :  tunables  0  0  0  :  slabdata  1    1    0
jffs2_xattr_ref         0     0     72    56  1  :  tunables  0  0  0  :  slabdata  0    0    0
jffs2_xattr_datum       0     0     104   39  1  :  tunables  0  0  0  :  slabdata  0    0    0
jffs2_inode_cache       165   168   72    56  1  :  tunables  0  0  0  :  slabdata  3    3    0
jffs2_node_frag         63    112   72    56  1  :  tunables  0  0  0  :  slabdata  2    2    0
jffs2_refblock          96    104   296   13  1  :  tunables  0  0  0  :  slabdata  8    8    0
jffs2_tmp_dnode         0     51    80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
jffs2_raw_inode         0     32    128   32  1  :  tunables  0  0  0  :  slabdata  1    1    0
jffs2_raw_dirent        0     42    96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
jffs2_full_dnode        131   192   64    64  1  :  tunables  0  0  0  :  slabdata  3    3    0
jffs2_i                 87    88    368   11  1  :  tunables  0  0  0  :  slabdata  8    8    0
squashfs_inode_cache    611   620   384   10  1  :  tunables  0  0  0  :  slabdata  62   62   0
fasync_cache            4     56    72    56  1  :  tunables  0  0  0  :  slabdata  1    1    0
posix_timers_cache      0     0     200   20  1  :  tunables  0  0  0  :  slabdata  0    0    0
UNIX                    15    26    608   13  2  :  tunables  0  0  0  :  slabdata  2    2    0
ip4-frags               0     0     168   24  1  :  tunables  0  0  0  :  slabdata  0    0    0
ip_mrt_cache            0     0     160   25  1  :  tunables  0  0  0  :  slabdata  0    0    0
UDP-Lite                0     0     704   11  2  :  tunables  0  0  0  :  slabdata  0    0    0
tcp_bind_bucket         1     42    96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
inet_peer_cache         2     21    192   21  1  :  tunables  0  0  0  :  slabdata  1    1    0
secpath_cache           0     0     96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
flow_cache              0     0     152   26  1  :  tunables  0  0  0  :  slabdata  0    0    0
xfrm_dst_cache          0     0     320   12  1  :  tunables  0  0  0  :  slabdata  0    0    0
ip_fib_trie             13    51    80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
ip_fib_alias            14    51    80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
ip_dst_cache            1     21    192   21  1  :  tunables  0  0  0  :  slabdata  1    1    0
PING                    0     0     672   12  2  :  tunables  0  0  0  :  slabdata  0    0    0
RAW                     2     12    672   12  2  :  tunables  0  0  0  :  slabdata  1    1    0
UDP                     1     11    704   11  2  :  tunables  0  0  0  :  slabdata  1    1    0
tw_sock_TCP             0     0     232   17  1  :  tunables  0  0  0  :  slabdata  0    0    0
request_sock_TCP        0     0     280   14  1  :  tunables  0  0  0  :  slabdata  0    0    0
TCP                     1     11    1408  11  4  :  tunables  0  0  0  :  slabdata  1    1    0
eventpoll_pwq           28    51    80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
eventpoll_epi           28    64    128   32  1  :  tunables  0  0  0  :  slabdata  2    2    0
inotify_inode_mark      0     0     96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
blkdev_queue            6     8     976   8   2  :  tunables  0  0  0  :  slabdata  1    1    0
blkdev_requests         24    32    256   16  1  :  tunables  0  0  0  :  slabdata  2    2    0
blkdev_ioc              3     39    104   39  1  :  tunables  0  0  0  :  slabdata  1    1    0
bio-0                   14    64    256   16  1  :  tunables  0  0  0  :  slabdata  4    4    0
biovec-256              14    20    3136  10  8  :  tunables  0  0  0  :  slabdata  2    2    0
biovec-128              0     0     1600  10  4  :  tunables  0  0  0  :  slabdata  0    0    0
biovec-64               0     0     832   19  4  :  tunables  0  0  0  :  slabdata  0    0    0
biovec-16               0     0     256   16  1  :  tunables  0  0  0  :  slabdata  0    0    0
uid_cache               0     0     96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
sock_inode_cache        44    60    384   10  1  :  tunables  0  0  0  :  slabdata  6    6    0
skbuff_fclone_cache     0     0     448   9   1  :  tunables  0  0  0  :  slabdata  0    0    0
skbuff_head_cache       258   304   256   16  1  :  tunables  0  0  0  :  slabdata  19   19   0
file_lock_cache         0     24    168   24  1  :  tunables  0  0  0  :  slabdata  1    1    0
file_lock_ctx           19    56    72    56  1  :  tunables  0  0  0  :  slabdata  1    1    0
shmem_inode_cache       153   154   360   11  1  :  tunables  0  0  0  :  slabdata  14   14   0
pool_workqueue          6     8     512   8   1  :  tunables  0  0  0  :  slabdata  1    1    0
proc_inode_cache        413   418   360   11  1  :  tunables  0  0  0  :  slabdata  38   38   0
sigqueue                0     21    192   21  1  :  tunables  0  0  0  :  slabdata  1    1    0
bdev_cache              4     9     448   9   1  :  tunables  0  0  0  :  slabdata  1    1    0
kernfs_node_cache       9232  9248  128   32  1  :  tunables  0  0  0  :  slabdata  289  289  0
mnt_cache               22    32    256   16  1  :  tunables  0  0  0  :  slabdata  2    2    0
filp                    238   294   192   21  1  :  tunables  0  0  0  :  slabdata  14   14   0
inode_cache             1396  1404  328   12  1  :  tunables  0  0  0  :  slabdata  117  117  0
dentry                  3509  3520  184   22  1  :  tunables  0  0  0  :  slabdata  160  160  0
names_cache             3     7     4160  7   8  :  tunables  0  0  0  :  slabdata  1    1    0
buffer_head             2472  2484  112   36  1  :  tunables  0  0  0  :  slabdata  69   69   0
nsproxy                 0     0     72    56  1  :  tunables  0  0  0  :  slabdata  0    0    0
vm_area_struct          446   540   136   30  1  :  tunables  0  0  0  :  slabdata  18   18   0
mm_struct               32    57    416   19  2  :  tunables  0  0  0  :  slabdata  3    3    0
fs_cache                30    84    96    42  1  :  tunables  0  0  0  :  slabdata  2    2    0
files_cache             31    64    256   16  1  :  tunables  0  0  0  :  slabdata  4    4    0
signal_cache            60    84    576   14  2  :  tunables  0  0  0  :  slabdata  6    6    0
sighand_cache           60    80    3136  10  8  :  tunables  0  0  0  :  slabdata  8    8    0
task_struct             60    72    1336  12  4  :  tunables  0  0  0  :  slabdata  6    6    0
cred_jar                92    125   160   25  1  :  tunables  0  0  0  :  slabdata  5    5    0
anon_vma_chain          310   510   80    51  1  :  tunables  0  0  0  :  slabdata  10   10   0
anon_vma                228   357   80    51  1  :  tunables  0  0  0  :  slabdata  7    7    0
pid                     63    126   96    42  1  :  tunables  0  0  0  :  slabdata  3    3    0
radix_tree_node         206   209   352   11  1  :  tunables  0  0  0  :  slabdata  19   19   0
idr_layer_cache         72    84    1112  14  4  :  tunables  0  0  0  :  slabdata  6    6    0
kmalloc-8192            10    12    8320  3   8  :  tunables  0  0  0  :  slabdata  4    4    0
kmalloc-4096            543   560   4224  7   8  :  tunables  0  0  0  :  slabdata  80   80   0
kmalloc-2048            84    90    2176  15  8  :  tunables  0  0  0  :  slabdata  6    6    0
kmalloc-1024            131   140   1152  14  4  :  tunables  0  0  0  :  slabdata  10   10   0
kmalloc-512             447   456   640   12  2  :  tunables  0  0  0  :  slabdata  38   38   0
kmalloc-256             359   370   384   10  1  :  tunables  0  0  0  :  slabdata  37   37   0
kmalloc-128             8162  8192  256   16  1  :  tunables  0  0  0  :  slabdata  512  512  0
kmem_cache_node         113   128   128   32  1  :  tunables  0  0  0  :  slabdata  4    4    0
kmem_cache              113   128   256   16  1  :  tunables  0  0  0  :  slabdata  8    8    0

@mweinelt

This comment has been minimized.

Contributor

mweinelt commented Nov 14, 2017

Behaviour does not improve when:

  • removing fq_codel and using pfifo_fast
  • disabling Airtime Fairness

Memory usage on the device looks like this after boot and before connecting to the mesh:

root@64283-ranzload:/# echo m > /proc/sysrq-trigger 
[   60.205101] sysrq: SysRq : Show Memory
[   60.208967] Mem-Info:
[   60.211292] active_anon:641 inactive_anon:8 isolated_anon:0
[   60.211292]  active_file:538 inactive_file:261 isolated_file:0
[   60.211292]  unevictable:0 dirty:0 writeback:0 unstable:0
[   60.211292]  slab_reclaimable:474 slab_unreclaimable:2651
[   60.211292]  mapped:379 shmem:22 pagetables:78 bounce:0
[   60.211292]  free:472 free_pcp:0 free_cma:0
[   60.243091] Normal free:1888kB min:1024kB low:1280kB high:1536kB active_anon:2564kB inactive_anon:32kB active_file:2152kB inactive_file:1044kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:32768kB managed:27776kB mlocked:0kB dirty:0kB writeback:0kB mapped:1516kB shmem:88kB slab_reclaimable:1896kB slab_unreclaimable:10604kB kernel_stack:424kB pagetables:312kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   60.286822] lowmem_reserve[]: 0 0
[   60.290173] Normal: 24*4kB (U) 84*8kB (UM) 70*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1888kB
[   60.301938] 821 total pagecache pages
[   60.305632] 0 pages in swap cache
[   60.308970] Swap cache stats: add 0, delete 0, find 0/0
[   60.314223] Free swap  = 0kB
[   60.317130] Total swap = 0kB
[   60.320022] 8192 pages RAM
[   60.322743] 0 pages HighMem/MovableOnly
[   60.326608] 1248 pages reserved

Slabinfo`after bootup, as stated on IRC it looks like the OOM happens as soon as the device connects to the mesh.

slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
mesh_rmc                1023   1064   72    56  1  :  tunables  0  0  0  :  slabdata  19   19   0
nf-frags                0      0      184   22  1  :  tunables  0  0  0  :  slabdata  0    0    0
nf_conntrack_1          7      15     264   15  1  :  tunables  0  0  0  :  slabdata  1    1    0
nf_conntrack_expect     0      0      208   19  1  :  tunables  0  0  0  :  slabdata  0    0    0
fq_flow_cache           0      0      112   36  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tt_roam_cache    0      0      96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tt_req_cache     0      42     96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
batadv_tt_change_cache  0      64     64    64  1  :  tunables  0  0  0  :  slabdata  1    1    0
batadv_tt_orig_cache    0      0      96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tg_cache         0      0      96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tl_cache         10     42     96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
sd_ext_cdb              2      51     80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-128              2      15     2112  15  8  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-64               2      15     1088  15  4  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-32               2      14     576   14  2  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-16               2      12     320   12  1  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-8                2      21     192   21  1  :  tunables  0  0  0  :  slabdata  1    1    0
scsi_data_buffer        0      0      64    64  1  :  tunables  0  0  0  :  slabdata  0    0    0
bridge_fdb_cache        12     42     96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
ip6-frags               0      0      184   22  1  :  tunables  0  0  0  :  slabdata  0    0    0
fib6_nodes              36     42     96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
ip6_dst_cache           62     84     288   14  1  :  tunables  0  0  0  :  slabdata  6    6    0
ip6_mrt_cache           0      0      160   25  1  :  tunables  0  0  0  :  slabdata  0    0    0
PINGv6                  0      0      832   19  4  :  tunables  0  0  0  :  slabdata  0    0    0
RAWv6                   8      19     832   19  4  :  tunables  0  0  0  :  slabdata  1    1    0
UDPLITEv6               0      0      800   10  2  :  tunables  0  0  0  :  slabdata  0    0    0
UDPv6                   6      10     800   10  2  :  tunables  0  0  0  :  slabdata  1    1    0
tw_sock_TCPv6           0      0      232   17  1  :  tunables  0  0  0  :  slabdata  0    0    0
request_sock_TCPv6      0      0      280   14  1  :  tunables  0  0  0  :  slabdata  0    0    0
TCPv6                   4      10     1536  10  4  :  tunables  0  0  0  :  slabdata  1    1    0
jffs2_xattr_ref         0      0      72    56  1  :  tunables  0  0  0  :  slabdata  0    0    0
jffs2_xattr_datum       0      0      104   39  1  :  tunables  0  0  0  :  slabdata  0    0    0
jffs2_inode_cache       227    280    72    56  1  :  tunables  0  0  0  :  slabdata  5    5    0
jffs2_node_frag         63     112    72    56  1  :  tunables  0  0  0  :  slabdata  2    2    0
jffs2_refblock          104    104    296   13  1  :  tunables  0  0  0  :  slabdata  8    8    0
jffs2_tmp_dnode         0      51     80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
jffs2_raw_inode         0      32     128   32  1  :  tunables  0  0  0  :  slabdata  1    1    0
jffs2_raw_dirent        0      42     96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
jffs2_full_dnode        132    192    64    64  1  :  tunables  0  0  0  :  slabdata  3    3    0
jffs2_i                 88     88     368   11  1  :  tunables  0  0  0  :  slabdata  8    8    0
squashfs_inode_cache    620    620    384   10  1  :  tunables  0  0  0  :  slabdata  62   62   0
fasync_cache            4      56     72    56  1  :  tunables  0  0  0  :  slabdata  1    1    0
posix_timers_cache      0      0      200   20  1  :  tunables  0  0  0  :  slabdata  0    0    0
UNIX                    20     26     608   13  2  :  tunables  0  0  0  :  slabdata  2    2    0
ip4-frags               0      0      168   24  1  :  tunables  0  0  0  :  slabdata  0    0    0
ip_mrt_cache            0      0      160   25  1  :  tunables  0  0  0  :  slabdata  0    0    0
UDP-Lite                0      0      704   11  2  :  tunables  0  0  0  :  slabdata  0    0    0
tcp_bind_bucket         4      42     96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
inet_peer_cache         1      21     192   21  1  :  tunables  0  0  0  :  slabdata  1    1    0
secpath_cache           0      0      96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
flow_cache              0      0      152   26  1  :  tunables  0  0  0  :  slabdata  0    0    0
xfrm_dst_cache          0      0      320   12  1  :  tunables  0  0  0  :  slabdata  0    0    0
ip_fib_trie             13     51     80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
ip_fib_alias            14     51     80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
ip_dst_cache            1      21     192   21  1  :  tunables  0  0  0  :  slabdata  1    1    0
PING                    0      0      672   12  2  :  tunables  0  0  0  :  slabdata  0    0    0
RAW                     2      12     672   12  2  :  tunables  0  0  0  :  slabdata  1    1    0
UDP                     4      11     704   11  2  :  tunables  0  0  0  :  slabdata  1    1    0
tw_sock_TCP             0      0      232   17  1  :  tunables  0  0  0  :  slabdata  0    0    0
request_sock_TCP        0      0      280   14  1  :  tunables  0  0  0  :  slabdata  0    0    0
TCP                     4      11     1408  11  4  :  tunables  0  0  0  :  slabdata  1    1    0
eventpoll_pwq           30     51     80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
eventpoll_epi           33     64     128   32  1  :  tunables  0  0  0  :  slabdata  2    2    0
inotify_inode_mark      2      42     96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
blkdev_queue            6      8      976   8   2  :  tunables  0  0  0  :  slabdata  1    1    0
blkdev_requests         24     32     256   16  1  :  tunables  0  0  0  :  slabdata  2    2    0
blkdev_ioc              2      39     104   39  1  :  tunables  0  0  0  :  slabdata  1    1    0
bio-0                   14     64     256   16  1  :  tunables  0  0  0  :  slabdata  4    4    0
biovec-256              14     20     3136  10  8  :  tunables  0  0  0  :  slabdata  2    2    0
biovec-128              0      0      1600  10  4  :  tunables  0  0  0  :  slabdata  0    0    0
biovec-64               0      0      832   19  4  :  tunables  0  0  0  :  slabdata  0    0    0
biovec-16               0      0      256   16  1  :  tunables  0  0  0  :  slabdata  0    0    0
uid_cache               1      42     96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
sock_inode_cache        69     80     384   10  1  :  tunables  0  0  0  :  slabdata  8    8    0
skbuff_fclone_cache     0      0      448   9   1  :  tunables  0  0  0  :  slabdata  0    0    0
skbuff_head_cache       622    720    256   16  1  :  tunables  0  0  0  :  slabdata  45   45   0
file_lock_cache         1      24     168   24  1  :  tunables  0  0  0  :  slabdata  1    1    0
file_lock_ctx           19     56     72    56  1  :  tunables  0  0  0  :  slabdata  1    1    0
shmem_inode_cache       162    165    360   11  1  :  tunables  0  0  0  :  slabdata  15   15   0
pool_workqueue          6      8      512   8   1  :  tunables  0  0  0  :  slabdata  1    1    0
proc_inode_cache        7      44     360   11  1  :  tunables  0  0  0  :  slabdata  4    4    0
sigqueue                0      21     192   21  1  :  tunables  0  0  0  :  slabdata  1    1    0
bdev_cache              4      9      448   9   1  :  tunables  0  0  0  :  slabdata  1    1    0
kernfs_node_cache       9267   9280   128   32  1  :  tunables  0  0  0  :  slabdata  290  290  0
mnt_cache               22     32     256   16  1  :  tunables  0  0  0  :  slabdata  2    2    0
filp                    287    420    192   21  1  :  tunables  0  0  0  :  slabdata  20   20   0
inode_cache             796    1032   328   12  1  :  tunables  0  0  0  :  slabdata  86   86   0
dentry                  1966   3432   184   22  1  :  tunables  0  0  0  :  slabdata  156  156  0
names_cache             0      7      4160  7   8  :  tunables  0  0  0  :  slabdata  1    1    0
buffer_head             1008   1008   112   36  1  :  tunables  0  0  0  :  slabdata  28   28   0
nsproxy                 0      0      72    56  1  :  tunables  0  0  0  :  slabdata  0    0    0
vm_area_struct          484    540    136   30  1  :  tunables  0  0  0  :  slabdata  18   18   0
mm_struct               30     57     416   19  2  :  tunables  0  0  0  :  slabdata  3    3    0
fs_cache                29     84     96    42  1  :  tunables  0  0  0  :  slabdata  2    2    0
files_cache             30     64     256   16  1  :  tunables  0  0  0  :  slabdata  4    4    0
signal_cache            59     70     576   14  2  :  tunables  0  0  0  :  slabdata  5    5    0
sighand_cache           59     70     3136  10  8  :  tunables  0  0  0  :  slabdata  7    7    0
task_struct             59     72     1336  12  4  :  tunables  0  0  0  :  slabdata  6    6    0
cred_jar                111    150    160   25  1  :  tunables  0  0  0  :  slabdata  6    6    0
anon_vma_chain          362    459    80    51  1  :  tunables  0  0  0  :  slabdata  9    9    0
anon_vma                269    357    80    51  1  :  tunables  0  0  0  :  slabdata  7    7    0
pid                     64     126    96    42  1  :  tunables  0  0  0  :  slabdata  3    3    0
radix_tree_node         210    220    352   11  1  :  tunables  0  0  0  :  slabdata  20   20   0
idr_layer_cache         82     84     1112  14  4  :  tunables  0  0  0  :  slabdata  6    6    0
kmalloc-8192            13     15     8320  3   8  :  tunables  0  0  0  :  slabdata  5    5    0
kmalloc-4096            649    716    4224  7   8  :  tunables  0  0  0  :  slabdata  134  134  0
kmalloc-2048            84     90     2176  15  8  :  tunables  0  0  0  :  slabdata  6    6    0
kmalloc-1024            144    168    1152  14  4  :  tunables  0  0  0  :  slabdata  12   12   0
kmalloc-512             452    480    640   12  2  :  tunables  0  0  0  :  slabdata  40   40   0
kmalloc-256             1046   1050   384   10  1  :  tunables  0  0  0  :  slabdata  105  105  0
kmalloc-128             12701  12848  256   16  1  :  tunables  0  0  0  :  slabdata  803  803  0
kmem_cache_node         113    128    128   32  1  :  tunables  0  0  0  :  slabdata  4    4    0
kmem_cache              113    128    256   16  1  :  tunables  0  0  0  :  slabdata  8    8    0

@Adorfer

This comment has been minimized.

Contributor

Adorfer commented Nov 15, 2017

testing with a 1043v1 sounds a bit like "trying to fix 2 issues with one shot", since (at least everybody seems to know for sure) that 1043v1 is instable by design, even in CC and BB. (eventhough it's just a hanging wifi, not a high load or a reboot.)

@edeso

This comment has been minimized.

Contributor

edeso commented May 16, 2018

@TomSiener @depressivum wanted to check the speed of my Loco M2 and found that it too didn't give out IPs anymore. connecting was fine, but no IP.. raising limit as described by @TomSiener resolved this and also had the mentioned effect on throughput (using 512000 now, which is half of the default setting at least, waiting for the load to come back and will decrease then further).

funny though that i am sure when i applied the workaround by the end of last year it worked fine with a limit of 200. something must have changed in the network maybe the batman versions on the uplink nodes? ..ede

@depressivum

This comment has been minimized.

depressivum commented May 16, 2018

No luck. After ~18 Hours with very little load, it increased to >1.5. Lowered limit form 20 000 to 10 000.

@H4ndl3

This comment has been minimized.

Contributor

H4ndl3 commented May 16, 2018

This bug really seems to be triggered by the WAN or LAN side. The test node of @MPW1412 ran fine for 4 days with the current master image and one wireless mesh node. As soon as he connected another node (running a master image too) by cable (so mesh on LAN), the load bug appeared shortly after.

@MPW1412

This comment has been minimized.

Contributor

MPW1412 commented May 19, 2018

Summary of what @NeoRaider and @H4ndl3 tested on the setup I provided: Best guess so far is, that there's a bug in the caching and paging algorithm in the newer kernel version, which might not reveal itself in standard OpenWrt or LEDE due to a much lower workload in the standard usecase as a router compared to handling of batman and vpn in the typical Freifunk setup.

Thanks for your work on this!

@omniuwo

This comment has been minimized.

omniuwo commented May 25, 2018

Hi, we (pjodd.se) are experiencing this as well.

Most of our nodes are running older gluon, based on 2016.1.x releases, but we're beta testing later gluon (now from the master branch, commit f51eac7) on two nodes.

The two nodes are both tl-wr841nd, but one is v11 and the other is v8.4. The v8.4 does not suffer from this issue, but it also has no WLAN mesh neighbours, but it's connected to the mesh via one of our gateways over fastd and usually has up to 6 WLAN clients.

The v11 one has lots of neighbours, 6-7 direct WLAN mesh neighbours and more in the area around it. This one usually have a load of up to 0.4 for about 7 hours, then the load starts to climb for one to several hours until it reboots and start over.

The pattern here is ~7 hours of reasonable load before it starts to climb, even when there is little to no client traffic in the mesh, and only on the node with WLAN mesh neighbours.

I tried setting fq_memory_limit to lower values, 2048 and 4096, but that made it unusable for clients, 131072 gave significantly lower throughput, so I set it back to the default of 262144.

@mweinelt

This comment has been minimized.

Contributor

mweinelt commented Jun 5, 2018

@omniuwo How high does the load climb? Are you sure the devices are not running out of memory?

@rubo77

This comment has been minimized.

Contributor

rubo77 commented Jun 7, 2018

There is a bounty collection on the way for this bug:

https://forum.freifunk.net/t/geld-sammeln-belohnung-fuer-das-loesen-der-auslastungsproblematik-im-gluon-issue-1243/18988

From the gluon mailing list:

probably everybody on this list knows about issue #1243, the high load

on many devices with firmware based on v2017.1.* or LEDE in general.

Progress seems to be stuck and probably implementing new features for
Gluon is much more fun then hunting this annoying bug.

But this bug is a real blocker for the development of our Freifunknetz
here in Münsterland, North Rhine-Westphalia. For more than a year, we
couldn't roll out a stable release.

As Gluon is FOSS it is of course in everybody's hand to enhance the
software continuously and so we tried to support Neoraider by setting up
a test system. This brought some new ideas, but no real breakthrough so
far. The only other way I can think of to support without having the
direct knowledge to hack the code, is giving financial support.

So we, the Förderverein freie Infrastruktur e. V., the incorporated
association for Freifunk in Münsterland, were thinking about putting out
a bounty on this issue. But we're reluctant to do so, as we're unsure
how the main developer and maintainer community would react to such a
move. In other words, we just don't want to step into this and affront
this well working community.

In the #gluon-irc someone said, that a winner-takes-it-all approach is
probably not the best way. So I was thinking about splitting the bounty
percentaged into three parts:

  • 30% for implementing or fixing the dynamic tracing in the linux
    kernel for the MIPS architecture: As far as I understood Neoraider, this
    missing tool is the main obstacle to hunt this bug down.
  • 30% for actually finding the bug
  • 40% for fixing it or other obstacles that come along the way

I will propose to the members of the association to provide 250 € as a
start and maybe other Freifunk associations will follow, so that we
might raise 1.500 to 2.000 €. If more money made a difference, we could
fill out an application for support for more funds on this at the
Staatskanzlei NRW, but that shouldn't be the first step in this new
approach.

If the money could be raised, maybe someone is willing to fix the
dynamic tracing for MIPS for 450 to 600 €. Maybe that is illusionary,
maybe not. I don't know.

To attract external developers, I was thinking about putting it up on
bountysource.com. But we'd be open to alternative suggestions.

Please give your thoughts about this.

Regards,
Matthias

@freifunk-gluon freifunk-gluon deleted a comment from MPW1412 Jun 7, 2018

@sumpfralle

This comment has been minimized.

sumpfralle commented Jun 9, 2018

I am hesitant to throw possibly unrelated issues into this discussion, but maybe OpenWrt's issue #1544 could be related?

Quick summary:

  • affected device: Ubiquiti Nanostation M5 HP XM (this model is comparable to Nanostation M2)
    • our community does not use any of the other models (TP-Link, ....) mentioned above, thus I cannot comment on those
  • the problem started with LEDE 17.01 (Gluon v2017.1)
    • it did not happen with Chaos Calmer (Gluon v2016.x)
  • load climbs to (and persists at) 8 or higher
  • top shows multiple megabytes of free memory during the problem (out of 32 MB)
  • usage of the Wifi connection and/or the second ethernet port increases probability of the problem
  • it looks like workingset_refault (see /proc/vmstat) could be related
@rubo77

This comment has been minimized.

Contributor

rubo77 commented Jun 18, 2018

Does this patch finally solve the issue?

openwrt/openwrt#1024

@mweinelt

This comment has been minimized.

Contributor

mweinelt commented Jun 18, 2018

No, it fixes a way to analyze the load issue on MIPS architecture.

@rotanid

This comment has been minimized.

Member

rotanid commented Jun 19, 2018

@rubo77
additionally i'll repeat what i wrote elsewhere:
there is not THE one and only issue!
"high load" can be the result of several issues and some were already fixed which solved the problem for several communities by running latest Gluon master branch and latest batman-adv on gateways.
so if you also have problems, maybe they would be already gone after updating nodes and gateways.

@ecsv

This comment has been minimized.

Contributor

ecsv commented Jul 7, 2018

We just had an interesting (not reproducible) problem with a single Nanostation M2. Maybe helpful to demonstrate @rotanid's statement "there is not THE one and only issue!" even when everything looks the same:

  • no clients attached
  • didn't seem to see any other device via ibss0
  • only Mesh-on-WAN active
  • load was 3/4-14 (caused by respondd, multiple odhcpc6 and the dhcpv6 scripts, radv filter, ...)
  • software reboot (done multiple times) didn't change the behavior
  • removing client0 and ibss0 seemed to make the device more responsive (but this was not really measured - just noticed that I could enter commands via SSH better than before)
  • installing 2018.1 image with reduced squashfs block size reduced load to ~1-2 (still more than expected and it was only measured a couple of minutes - so would maybe even reached 14 again)

Solution was: remove power from device, wait a couple of seconds and reattach it. I still don't know what caused it. Maybe a lot of ath9k HW resets caused by a currently unknown and insufficiently handled HW problem (which survives reboots)? Anyway, I can't turn back time and so I can not do a cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset on the broken device.

@CodeFetch

This comment has been minimized.

Contributor

CodeFetch commented Jul 8, 2018

The load bug is a thrashing issue. When a node is in a low memory situation it constantly needs to reread blocks of flash memory which need to be decompressed. Using perf I tried to rule out other causes and found that the LZMA compression of SquashFS definitely causes the high load.

Thus the question arises what causes the low memory situation.

@ecsv I thought there was a ath9k issue, too, but I found out that when the load bug occurs the beacons just stuck as they don't get out in time and the chip is being resetted again and again. That's also the reason why the SSIDs of APs with the load bug are sometimes disappearing for a while.

@rotanid rotanid modified the milestones: 2018.1, 2018.2 Jul 9, 2018

@blocktrron

This comment has been minimized.

Contributor

blocktrron commented Aug 17, 2018

I've noticed the kernel for ar71xx-tiny is currently compiled with USB support in place while we already exclude kmod packages for USB support. So it might be beneficial to remove USB support from the kernel altogether?

@Adorfer

This comment has been minimized.

Contributor

Adorfer commented Aug 17, 2018

I can well reproduce on 2018.1 (compared to 2016.2.x)

  • 841v9 on L2TP-WAN, Wifi totally off, MoW disabled, MoL enabled. demand of >30MBit/s. load nearly constantly >2, uptime before crash hardly more than 30 minutes-1 hour.
  • 841v9 on MoW, 2-3 APclients, but 5-6 Wifimeshlinks.
    load 2-3 a day >2, then uptime before crash hardly more than 30 minutes-1 hour.

On a 64MB RAM devices it runs in nearly exactly the same setup flawlessly. I am really considering upgrading old devices with bigger RAM (since i understood that the bootloader is detecting that automatically).

@omniuwo

This comment has been minimized.

omniuwo commented Sep 18, 2018

@mweinelt Sorry that I didn't reply earlier, I don't really remember but way above 1 (20-40 reported before crash/reboot).

We recently upgraded one of our gateways from Debian Jessie, and thus from batman v2014.x, and tried a v2018.1.x build on one of the nodes having the most frequent reboots (it was one of two running v2017.1.x) and didn't see this issue anymore. We ran the new build on those two nodes for a week and then pushed new stable images that most of our network now run.

@TomSiener

This comment has been minimized.

TomSiener commented Sep 24, 2018

Alter we upgraded our gateways to batman 2018.2 a few weeks ago, I repeated the tests with a WR841v9.
The phenomena is easier to reproduce with a client connected to one of the lan ports (e.g. Nanostation or Unifi with stock fw). Then the cpu load rises within minutes or hours.
Without this, the node can run several days before the cpu-load rises. (or on some nodes never)

with gluon 2018.1 and WiFI enabled:
image
with gluon 2018.1 and WiFi disabled:
image
with gluon 2016.2.7 and WiFI enabled:
image

So, the high load bug ist still there, even with batman 2018.2 on the gateways and gluon 2018.1 on the node.

@mweinelt

This comment has been minimized.

Contributor

mweinelt commented Sep 24, 2018

We recently moved our first nodes into smaller domains and that resolved this issue on many devices as well. It's obviously a composite issue that's pretty hard to fix.

Both load and memory usage significantly drop:
Load/Memory

The CPU is less busy because it sees far less packets:
Management
Traffic

Airtime gets freed up because less noise needs to be forwarded:
Airtime

I have many of these examples. The gist is:

  • don't let your mesh get too large
  • drop unnecessary noise in your network

Besides this it's alot of guess work, like for example reducing the squashfs block size (2b20864).

In general I think we profit far more from tests of the master branch.

@rotanid

This comment has been minimized.

Member

rotanid commented Sep 24, 2018

@TomSiener thanks for the update.

So, the high load bug ist still there, even with batman 2018.2 on the gateways and gluon 2018.1 on the node.
sure, that's why we didn't close this issue/ticket...

@mweinelt also thanks for the information.

maybe this can be improved further if someone with deep knowledge of the systems involved uses the work done by @CodeFetch (dynamic ftrace etc) to get an insight into the "why"

@blocktrron

This comment has been minimized.

Contributor

blocktrron commented Sep 30, 2018

As Freifunk Darmstadt now completed the migration of it's network, we now have domains with max. 70 Nodes per Domain.

We already see the problems regarding high load greatly improve, if not gone completely. Stats

Surely, this is not a fix. I would also go as far to say that there is probably no real fix. We should probably accept that those devices just do not have enough ram to fulfil their task (And even the split is probably only a temporary improvement).

Another example of a very problematic node: Stats

@hauetaler

This comment has been minimized.

hauetaler commented Oct 29, 2018

Same issue here on a Nanostation M2 (XM) with a webcam connected to the second ethernet port... without POE passthrough enabled, the device is running fine, POE passthrough activated causes the error to occur. the effect was previously reproducible at any time.
poe-p

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment