Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High load on some devices after v2017.1.x update #1243

Closed
mweinelt opened this issue Oct 22, 2017 · 124 comments
Closed

High load on some devices after v2017.1.x update #1243

mweinelt opened this issue Oct 22, 2017 · 124 comments
Labels
0. type: bug This is a bug 9. meta: known issue Known issue which should be mentioned in release notes

Comments

@mweinelt
Copy link
Contributor

The devices were fine before, but with 2017.1.x high loads appeared. They seem to originate somewhere in the kernelspace.

Only some models are affected, but then all devices of that model experience this issue:

  • TP-Link WR842ND v2
  • TP-Link WR1043ND v1
  • Ubiquiti Nanostation Loco M2

Probably more, but since our Grafana is currently down it's cumbersome to find more.

@mweinelt mweinelt changed the title High load on some devices after 2017.1.x update High load on some devices after v2017.1.x update Oct 22, 2017
@neocturne
Copy link
Member

I've seen the issue on a WR841ND (I think it was v9 or v10). Possibly, all models with 32MB are affected?

@mweinelt
Copy link
Contributor Author

mweinelt commented Oct 22, 2017

Unaffected:

  • TP-Link WR841ND v8
  • TP-Link WR841ND v10

Partially affected:

  • TP-Link WR841ND v9

@rotanid
Copy link
Member

rotanid commented Oct 22, 2017

we also noticed that devices with more mesh neighbours are more likely to be affected - what a surprise!

@rotanid rotanid added the 0. type: bug This is a bug label Oct 22, 2017
@Tarnatos
Copy link
Contributor

I can confirm that dir Freifunk Nord

@mweinelt
Copy link
Contributor Author

mweinelt commented Oct 23, 2017

Device loads by model on the FFDA network where load > 2.0

{
  "TP-Link TL-MR3420 v1": [
    8.63
  ],
  "TP-Link TL-WR841N/ND v11": [
    7.85,
    12.47,
    3.44,
    3.9
  ],
  "TP-Link TL-WR940N v4": [
    4.48
  ],
  "Ubiquiti AirRouter": [
    3.09
  ],
  "TP-Link TL-WR710N v1": [
    6.2
  ],
  "TP-Link TL-WR842N/ND v2": [
    2.28,
    12.43,
    8.98,
    15.45,
    8.98,
    3.47,
    7.22,
    10.07,
    12.14,
    9.2,
    7.25,
    8.48,
    2.91,
    6.22,
    8.91,
    3.14,
    2.13,
    9.9,
    10.92
  ],
  "Ubiquiti PicoStation M2": [
    9.97
  ],
  "Linksys WRT160NL": [
    4.69
  ],
  "TP-Link TL-WR710N v2.1": [
    4.19,
    11.26
  ],
  "TP-Link TL-WR1043N/ND v1": [
    6.98,
    3.57,
    6.25,
    2.23,
    5.62
  ],
  "TP-Link TL-WR841N/ND v9": [
    4.98,
    4.55,
    4.08,
    2.67,
    4.89,
    2.57,
    9.66,
    7.81
  ],
  "TP-Link TL-WA850RE v1": [
    7.71
  ],
  "TP-Link TL-WR841N/ND v10": [
    2.87,
    5.32,
    5.59
  ],
  "TP-Link TL-WA901N/ND v3": [
    7.96
  ],
  "Ubiquiti NanoStation loco M2": [
    4.94
  ],
  "TP-Link TL-WR842N/ND v1": [
    2.49
  ]
}

@blocktrron
Copy link
Member

Affected nodes on the FFDA Network grouped by SoC:

AR9341

TP-Link TL-WR842N/ND v2 19
TP-Link TL-WA801N/ND v2 1
TP-Link TL-WA850RE v1 1

QCA9533

TP-Link TL-WR841N/ND v9 7
TP-Link TL-WR841N/ND v10 5
TP-Link TL-WR841N/ND v11 2

AR9132

TP-Link TL-WR1043N/ND v1 4

AR9331

TP-Link TL-WR710N v2.1 2
TP-Link TL-WR710N v1 1

AR7240

Ubiquiti NanoStation loco M2 1
Ubiquiti PicoStation M2 1

AR7241

TP-Link TL-MR3420 v1 1
Ubiquiti AirRouter 1

AR9130

Linksys WRT160NL 1

@A-Kasper
Copy link
Contributor

Hi!

I don't got any nodes with a higher load than 0.8 in our network.
we've had these problems when using batman 2017.x first time.. but then a bug with multicast optimization in batman-adv was found. After disabling this in firmware AND at all gateways the load was going down.

Could you please provide these information:

  1. How many Nodes are on old firmware in your network? (maybe someone can add the first gluon Version without MO to name the affected versions....)
  2. Which Batmand-adv is running on your Gateways?
  3. Is Multicast Optimisation activated on gateway level?

I think this could be related. Maybe you could try to update your batman-adv gateways and disable multicast optimizations with "batctl mm 0" (don't forget your mapserver ;))

If the behaviour is related, you could try to eliminate the fulltable orgy by disabling all vpn tunnels for a minute to make the mismarked packages disappear

@mweinelt
Copy link
Contributor Author

  1. ~700 new (v2017.1.3), ~30 old (v2017.1.2 and earlier)
  2. batman-adv v2017.3
  3. yes

Multicast optimizations are still disabled on v2017.1.x, see c6a3afa.

@A-Kasper
Copy link
Contributor

I think the MO Bug is still not addressed. As already mentioned our load is in normal range (but we have a smaller network). If you are able to, I would like to ask you to check if the load goes down if you disable MO at all gateways via batctl mm 0 i would like to make sure it's not still this mo thing.....

Our Network went stable after ALL Sources of mismarked packages were eliminated... Gateways had the biggest effect. I'm not sure, but if my memories are correct it was you who had a look at our network dump... if so: can you see these full table requests?

@mweinelt
Copy link
Contributor Author

@A-Kasper
Copy link
Contributor

A-Kasper commented Oct 25, 2017 via email

@mweinelt
Copy link
Contributor Author

mweinelt commented Oct 27, 2017

Disabled mm on our gateways.

Looking at this node for example at this time of day (3:00)
https://meshviewer.darmstadt.freifunk.net/#/de/map/30b5c2c2ead4

  • Model 841 v9
  • Load average 4,83
  • RAM 74,7% used
    "memory": {
     "total": 27808,
     "free": 5236,
     "buffers": 992,
     "cached": 1696
    },
  • Clients 0
  • Traffic negligible

5,2 MB of "free" RAM does not seem to be enough if we suspect the issue arises due to high memory fragmentation.

@A-Kasper
Copy link
Contributor

A-Kasper commented Oct 27, 2017 via email

@rotanid
Copy link
Member

rotanid commented Oct 27, 2017

@A-Kasper

  1. please try to not comment if you have nothing new to contribute. this makes the whole issue harder and harder to read. you could have waited with your comment until you had a look at your mem.
  2. how many nodes does you network have? what about large groups of wireless-meshing-nodes? maybe your network is simply too small to have this issue.

@CodeFetch
Copy link
Contributor

@mweinelt Do you have any custom ash-scripts running on your nodes?
I have had an ash-script running and my node got slower and slower until logging in via ssh took about a minute. The same happened when a monitoring script logged into the router via SSH and just executed some commands. That's why I think it could be a problem with ash/Busybox. With only lua scripts running on the router executed by micrond everything works as expected.

@mweinelt
Copy link
Contributor Author

@CodeFetch No, we're not running anything custom.

@rotanid
Copy link
Member

rotanid commented Nov 6, 2017

perhaps this issue is actually the same as #753 - only that it appears earlier than before

@mweinelt
Copy link
Contributor Author

mweinelt commented Nov 6, 2017

Looking at the issue by SoC alone does not seem to yield a clear result.

[
  {
    "family": "AR9531",
    "loadavg": 0.17,
    "devices": {
      "TP-Link TL-WR842N/ND v3": 0.17
    }
  },
  {
    "family": "QCA9558",
    "loadavg": 0.18,
    "devices": {
      "TP-Link TL-WR1043N/ND v3": 0.16,
      "TP-Link Archer C7 v2": 0.19,
      "TP-Link TL-WR1043N/ND v2": 0.2,
      "TP-Link Archer C5 v1": 0.21
    }
  },
  {
    "family": "QCA9563",
    "loadavg": 0.19,
    "devices": {
      "TP-Link TL-WR1043N/ND v4": 0.19
    }
  },
  {
    "family": "AR7240",
    "loadavg": 0.67,
    "devices": {
      "Ubiquiti NanoStation loco M2": 0.59,
      "Ubiquiti NanoStation M2": 1.61,
      "TP-Link TL-WA801N/ND v1": 1.25,
      "TP-Link TL-WR740N/ND v4": 1.18,
      "TP-Link TL-WR740N/ND v1": 0.3,
      "TP-Link TL-WA901N/ND v1": 0.04,
      "TP-Link TL-WA830RE v1": 0.17,
      "TP-Link TL-WR741N/ND v1": 0.41,
      "TP-Link TL-WR841N/ND v5": 0.06
    }
  },
  {
    "family": "QCA9533",
    "loadavg": 1.05,
    "devices": {
      "TP-Link TL-WR841N/ND v11": 0.79,
      "TP-Link TL-WR841N/ND v10": 1.18,
      "TP-Link TL-WR841N/ND v9": 1.1
    }
  },
  {
    "family": "AR2316A",
    "loadavg": 1.49,
    "devices": {
      "Ubiquiti PicoStation M2": 1.49
    }
  },
  {
    "family": "AR9344",
    "loadavg": 0.21,
    "devices": {
      "TP-Link CPE210 v1.1": 0.21,
      "TP-Link TL-WDR4300 v1": 0.2,
      "TP-Link CPE210 v1.0": 0.24,
      "TP-Link TL-WDR3600 v1": 0.17
    }
  },
  {
    "family": "AR9341",
    "loadavg": 2.5,
    "devices": {
      "TP-Link TL-WR842N/ND v2": 3.5,
      "TP-Link TL-WA850RE v1": 1.42,
      "TP-Link TL-WR841N/ND v8": 1.25,
      "TP-Link TL-WA801N/ND v2": 0.18,
      "TP-Link TL-WR941N/ND v5": 0.07,
      "TP-Link TL-WA901N/ND v3": 2.26,
      "TP-Link TL-WA860RE v1": 0.15
    }
  },
  {
    "family": "AR9331",
    "loadavg": 2.76,
    "devices": {
      "TP-Link TL-WR710N v2.1": 6.76,
      "TP-Link TL-MR3020 v1": 1.15,
      "TP-Link TL-WR710N v1": 3.62,
      "TP-Link TL-WR741N/ND v4": 0.21,
      "TP-Link TL-WR710N v2": 0.12,
      "TP-Link TL-WA701N/ND v2": 0.61
    }
  },
  {
    "family": "AR9132",
    "loadavg": 1.95,
    "devices": {
      "TP-Link TL-WR1043N/ND v1": 2.51,
      "TP-Link TL-WR941N/ND v2": 0.18,
      "TP-Link TL-WA901N/ND v2": 0.11
    }
  },
  {
    "family": "AR7241",
    "loadavg": 2.56,
    "devices": {
      "TP-Link TL-MR3420 v1": 2.78,
      "TP-Link TL-WR842N/ND v1": 2.12,
      "Ubiquiti AirRouter": 2.77
    }
  },
  {
    "family": "TP9343",
    "loadavg": 1.06,
    "devices": {
      "TP-Link TL-WR941N/ND v6": 0.44,
      "TP-Link TL-WR940N v4": 2.07,
      "TP-Link TL-WA901N/ND v4": 0.07
    }
  },
  {
    "family": "MT7621AT",
    "loadavg": 0.08,
    "devices": {
      "D-Link DIR-860L B1": 0.08
    }
  },
  {
    "family": "AR7161",
    "loadavg": 0.26,
    "devices": {
      "Buffalo WZR-HP-AG300H/WZR-600DHP": 0.26
    }
  },
  {
    "family": "AR1311",
    "loadavg": 0.21,
    "devices": {
      "D-Link DIR-505 rev. A2": 0.21
    }
  },
  {
    "family": "AR9350",
    "loadavg": 0.15,
    "devices": {
      "TP-Link CPE510 v1.0": 0.15,
      "TP-Link CPE510 v1.1": 0.14
    }
  },
  {
    "family": "AR9130",
    "loadavg": 7.59,
    "devices": {
      "Linksys WRT160NL": 7.59
    }
  },
  {
    "family": "AR9342",
    "loadavg": 0.02,
    "devices": {
      "Ubiquiti Loco M XW": 0.02
    }
  }
]

@edeso
Copy link
Contributor

edeso commented Nov 10, 2017

hey All,

on an Ubiquiti NanoStation loco M2 the issue seems to go away when the mesh wlan is deactivated. at least it looks that way since 2 days.

this is kbu freifunk, where the wireless mesh config looks like this

config wifi-iface 'ibss_radio0'
        option ifname 'ibss0'
        option network 'ibss_radio0'
        option device 'radio0'
        option bssid '02:d2:22:01:fc:22'
        option disabled '1'
        option mcast_rate '12000'
        option mode 'adhoc'
        option macaddr '42:84:e7:d9:c1:32'
        option ssid '02:d2:22:01:fc:22'

..ede

@mweinelt
Copy link
Contributor Author

Does this NSM2 Loco have a VPN connection or is it otherwise connected to the Mesh after disabling the WiFi-Mesh?

How big is the batadv L2 domain? Originators? Transtable (global) size?

@T-X
Copy link
Contributor

T-X commented Nov 13, 2017

At least the three devices in the original post are all 32MB RAM and 8MB flash devices. Is this ticket a duplicate of #1197 maybe? Or could we somehow separate these two tickets more clearly?

Just a crazy idea... As decently fast microSD cards seem to have gotten quite cheap: I'd be curious whether attaching some flash storage to the USB port of a router and configuring one partition for swap and one for /tmp/ would make a difference. For instance this plus this would cost less than 10€. Maybe there's even a decently fast, usable USB flash stick for less than 5€. Not suggesting this as a fix, but curious whether that'd change anything.

Also, if some people with devices constantly having high loads could recompile with CONFIG_KERNEL_SLABINFO=y and could dump /proc/slabinfo, that could be helpful (I asked for this in #1197, too).

@mweinelt
Copy link
Contributor Author

We tested an image for the 1043v1 without additional USB modules, that are currently being loaded unconditionally - and the device seemed to behave fine again.

@blocktrron can maybe tell us more about what he saw during tests.

@blocktrron
Copy link
Member

blocktrron commented Nov 13, 2017

At Freifunk Darmstadt, we were able to observe that 8MB/32MB devices rebooted frequently, most likely due to additional RAM usage of the integrated USB support.

We were also able to recreate the problems (High load/crashing) on an OpenWRT based Gluon by writing to tmpfs. Crashing/High load only occurred, when RAM was filled before the batman transglobal table was initialized, in case the router was already connected to the mesh.

When the Router ist booted without visible neighbors, then filling RAM and connecting to the Network, the node was not affected by crashing or high load.

@mweinelt
Copy link
Contributor Author

Fyi: We're building images from master with slabinfo enabled and without additional USB modules tonight. If we can trigger the issue with those, we'll post slabinfo, else we'll retry again with the USB modules installed,

@mweinelt
Copy link
Contributor Author

mweinelt commented Nov 14, 2017

From our 1043v1 rebooting in circles:

OOM Reboot

[   90.450092] hotplug-call invoked oom-killer: gfp_mask=0x2420848, order=0, oom_score_adj=0
[   90.458394] CPU: 0 PID: 2327 Comm: hotplug-call Not tainted 4.4.93 #0
[   90.464869] Stack : 803e96e4 00000000 00000001 80440000 807d5764 80434e63 803ca228 00000917
	  804a378c 00001b20 00000040 00000000 00000000 800a787c 00000006 00000000
	  00000000 00000000 803cdd4c 8172199c 804a6542 800a57f8 02420848 00000000
	  00000001 801f9300 00000000 00000000 00000000 00000000 00000000 00000000
	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
	  ...
[   90.500943] Call Trace:
[   90.503422] [<80071f1c>] show_stack+0x54/0x88
[   90.507826] [<800d498c>] dump_header.isra.4+0x48/0x130
[   90.513003] [<800d515c>] check_panic_on_oom+0x48/0x84
[   90.518102] [<800d5288>] out_of_memory+0xf0/0x324
[   90.522847] [<800d8da0>] __alloc_pages_nodemask+0x6b8/0x724
[   90.528488] [<800d1b44>] pagecache_get_page+0x154/0x278
[   90.533765] [<80136e94>] __getblk_slow+0x15c/0x374
[   90.538617] [<8015e518>] squashfs_read_data+0x1c8/0x6e8
[   90.543888] [<80162728>] squashfs_readpage_block+0x32c/0x4d8
[   90.549602] [<801603a4>] squashfs_readpage+0x5bc/0x6d0
[   90.554780] [<800dc53c>] __do_page_cache_readahead+0x1f8/0x264
[   90.560673] [<800d393c>] filemap_fault+0x1ac/0x458
[   90.565526] [<800eeb4c>] __do_fault+0x3c/0xa8
[   90.569925] [<800f1d84>] handle_mm_fault+0x478/0xb14
[   90.574934] [<80076be8>] __do_page_fault+0x134/0x470
[   90.579944] [<80060820>] ret_from_exception+0x0/0x10
[   90.584933] 
[   90.586446] Mem-Info:
[   90.588769] active_anon:820 inactive_anon:9 isolated_anon:0
[   90.588769]  active_file:136 inactive_file:154 isolated_file:0
[   90.588769]  unevictable:0 dirty:0 writeback:0 unstable:0
[   90.588769]  slab_reclaimable:211 slab_unreclaimable:3104
[   90.588769]  mapped:59 shmem:29 pagetables:104 bounce:0
[   90.588769]  free:293 free_pcp:0 free_cma:0
[   90.620556] Normal free:1172kB min:1024kB low:1280kB high:1536kB active_anon:3280kB inactive_anon:36kB active_file:544kB inactive_file:616kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:32768kB managed:27776kB mlocked:0kB dirty:0kB writeback:0kB mapped:236kB shmem:116kB slab_reclaimable:844kB slab_unreclaimable:12416kB kernel_stack:472kB pagetables:416kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:6972 all_unreclaimable? yes
[   90.664376] lowmem_reserve[]: 0 0
[   90.667738] Normal: 49*4kB (UME) 80*8kB (UME) 13*16kB (UME) 4*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1172kB
[   90.680287] 319 total pagecache pages
[   90.683970] 0 pages in swap cache
[   90.687314] Swap cache stats: add 0, delete 0, find 0/0
[   90.692565] Free swap  = 0kB
[   90.695472] Total swap = 0kB
[   90.698373] 8192 pages RAM
[   90.701093] 0 pages HighMem/MovableOnly
[   90.704947] 1248 pages reserved
[   90.708117] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[   90.716721] [  515]     0   515      297       49       3       0        0             0 ubusd
[   90.725384] [  516]     0   516      296       40       4       0        0             0 ash
[   90.733889] [  807]     0   807      306       68       4       0        0             0 logd
[   90.742477] [  814]     0   814      429      189       4       0        0             0 haveged
[   90.751328] [ 1055]     0  1055      447       82       4       0        0             0 netifd
[   90.760092] [ 1102]     0  1102      264       40       3       0        0             0 dropbear
[   90.769029] [ 1119]     0  1119      225       42       3       0        0             0 uradvd
[   90.777794] [ 1330]     0  1330      296       39       4       0        0             0 udhcpc
[   90.786557] [ 1332]     0  1332      254       44       3       0        0             0 odhcp6c
[   90.795396] [ 1343]     0  1343      254       49       3       0        0             0 odhcp6c
[   90.804250] [ 1487]     0  1487      225       44       4       0        0             0 micrond
[   90.813101] [ 1521]     0  1521      224       39       3       0        0             0 sse-multiplexd
[   90.822562] [ 1685]     0  1685      320       50       3       0        0             0 uhttpd
[   90.831325] [ 1794]     0  1794      383       76       3       0        0             0 hostapd
[   90.840177] [ 1809]   453  1809      353      137       4       0        0             0 dnsmasq
[   90.849033] [ 1830]     0  1830      280       52       4       0        0             0 dnsmasq
[   90.857887] [ 2116]     0  2116      320       63       4       0        0             0 fastd
[   90.866563] [ 2213]     0  2213      517       71       3       0        0             0 respondd
[   90.875502] [ 2223]     0  2223      306       50       3       0        0             0 hotplug-call
[   90.884777] [ 2295]     0  2295      296       40       3       0        0             0 ntpd
[   90.893369] [ 2326]     0  2326      327       75       4       0        0             0 dhcpv6.script
[   90.902742] [ 2327]     0  2327      306       47       3       0        0             0 hotplug-call
[   90.912030] [ 2332]     0  2332      326       72       4       0        0             0 gluon-respondd
[   90.921492] [ 2342]     0  2342      326       71       4       0        0             0 gluon-respondd
[   90.930952] [ 2343]     0  2343      326       71       4       0        0             0 gluon-respondd
[   90.940413] [ 2345]     0  2345      293       62       3       0        0             0 jsonfilter
[   90.949525] [ 2346]     0  2346      212       42       3       0        0             0 ubus
[   90.958114] [ 2349]     0  2349      382       60       4       0        0             0 procd
[   90.966786] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
[   90.966786] 
[   90.980773] Rebooting in 3 seconds..

Slabinfo

slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
mesh_rmc                0     0     72    56  1  :  tunables  0  0  0  :  slabdata  0    0    0
nf-frags                0     0     184   22  1  :  tunables  0  0  0  :  slabdata  0    0    0
nf_conntrack_1          7     15    264   15  1  :  tunables  0  0  0  :  slabdata  1    1    0
nf_conntrack_expect     0     0     208   19  1  :  tunables  0  0  0  :  slabdata  0    0    0
fq_flow_cache           0     0     112   36  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tt_roam_cache    0     0     96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tt_req_cache     0     0     96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tt_change_cache  0     64    64    64  1  :  tunables  0  0  0  :  slabdata  1    1    0
batadv_tt_orig_cache    0     0     96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tg_cache         0     0     96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tl_cache         2     42    96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
sd_ext_cdb              2     51    80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-128              2     15    2112  15  8  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-64               2     15    1088  15  4  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-32               2     14    576   14  2  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-16               2     12    320   12  1  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-8                2     21    192   21  1  :  tunables  0  0  0  :  slabdata  1    1    0
scsi_data_buffer        0     0     64    64  1  :  tunables  0  0  0  :  slabdata  0    0    0
bridge_fdb_cache        11    42    96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
ip6-frags               0     0     184   22  1  :  tunables  0  0  0  :  slabdata  0    0    0
fib6_nodes              21    42    96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
ip6_dst_cache           44    56    288   14  1  :  tunables  0  0  0  :  slabdata  4    4    0
ip6_mrt_cache           0     0     160   25  1  :  tunables  0  0  0  :  slabdata  0    0    0
PINGv6                  0     0     832   19  4  :  tunables  0  0  0  :  slabdata  0    0    0
RAWv6                   8     19    832   19  4  :  tunables  0  0  0  :  slabdata  1    1    0
UDPLITEv6               0     0     800   10  2  :  tunables  0  0  0  :  slabdata  0    0    0
UDPv6                   4     10    800   10  2  :  tunables  0  0  0  :  slabdata  1    1    0
tw_sock_TCPv6           0     0     232   17  1  :  tunables  0  0  0  :  slabdata  0    0    0
request_sock_TCPv6      0     0     280   14  1  :  tunables  0  0  0  :  slabdata  0    0    0
TCPv6                   1     10    1536  10  4  :  tunables  0  0  0  :  slabdata  1    1    0
jffs2_xattr_ref         0     0     72    56  1  :  tunables  0  0  0  :  slabdata  0    0    0
jffs2_xattr_datum       0     0     104   39  1  :  tunables  0  0  0  :  slabdata  0    0    0
jffs2_inode_cache       165   168   72    56  1  :  tunables  0  0  0  :  slabdata  3    3    0
jffs2_node_frag         63    112   72    56  1  :  tunables  0  0  0  :  slabdata  2    2    0
jffs2_refblock          96    104   296   13  1  :  tunables  0  0  0  :  slabdata  8    8    0
jffs2_tmp_dnode         0     51    80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
jffs2_raw_inode         0     32    128   32  1  :  tunables  0  0  0  :  slabdata  1    1    0
jffs2_raw_dirent        0     42    96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
jffs2_full_dnode        131   192   64    64  1  :  tunables  0  0  0  :  slabdata  3    3    0
jffs2_i                 87    88    368   11  1  :  tunables  0  0  0  :  slabdata  8    8    0
squashfs_inode_cache    611   620   384   10  1  :  tunables  0  0  0  :  slabdata  62   62   0
fasync_cache            4     56    72    56  1  :  tunables  0  0  0  :  slabdata  1    1    0
posix_timers_cache      0     0     200   20  1  :  tunables  0  0  0  :  slabdata  0    0    0
UNIX                    15    26    608   13  2  :  tunables  0  0  0  :  slabdata  2    2    0
ip4-frags               0     0     168   24  1  :  tunables  0  0  0  :  slabdata  0    0    0
ip_mrt_cache            0     0     160   25  1  :  tunables  0  0  0  :  slabdata  0    0    0
UDP-Lite                0     0     704   11  2  :  tunables  0  0  0  :  slabdata  0    0    0
tcp_bind_bucket         1     42    96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
inet_peer_cache         2     21    192   21  1  :  tunables  0  0  0  :  slabdata  1    1    0
secpath_cache           0     0     96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
flow_cache              0     0     152   26  1  :  tunables  0  0  0  :  slabdata  0    0    0
xfrm_dst_cache          0     0     320   12  1  :  tunables  0  0  0  :  slabdata  0    0    0
ip_fib_trie             13    51    80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
ip_fib_alias            14    51    80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
ip_dst_cache            1     21    192   21  1  :  tunables  0  0  0  :  slabdata  1    1    0
PING                    0     0     672   12  2  :  tunables  0  0  0  :  slabdata  0    0    0
RAW                     2     12    672   12  2  :  tunables  0  0  0  :  slabdata  1    1    0
UDP                     1     11    704   11  2  :  tunables  0  0  0  :  slabdata  1    1    0
tw_sock_TCP             0     0     232   17  1  :  tunables  0  0  0  :  slabdata  0    0    0
request_sock_TCP        0     0     280   14  1  :  tunables  0  0  0  :  slabdata  0    0    0
TCP                     1     11    1408  11  4  :  tunables  0  0  0  :  slabdata  1    1    0
eventpoll_pwq           28    51    80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
eventpoll_epi           28    64    128   32  1  :  tunables  0  0  0  :  slabdata  2    2    0
inotify_inode_mark      0     0     96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
blkdev_queue            6     8     976   8   2  :  tunables  0  0  0  :  slabdata  1    1    0
blkdev_requests         24    32    256   16  1  :  tunables  0  0  0  :  slabdata  2    2    0
blkdev_ioc              3     39    104   39  1  :  tunables  0  0  0  :  slabdata  1    1    0
bio-0                   14    64    256   16  1  :  tunables  0  0  0  :  slabdata  4    4    0
biovec-256              14    20    3136  10  8  :  tunables  0  0  0  :  slabdata  2    2    0
biovec-128              0     0     1600  10  4  :  tunables  0  0  0  :  slabdata  0    0    0
biovec-64               0     0     832   19  4  :  tunables  0  0  0  :  slabdata  0    0    0
biovec-16               0     0     256   16  1  :  tunables  0  0  0  :  slabdata  0    0    0
uid_cache               0     0     96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
sock_inode_cache        44    60    384   10  1  :  tunables  0  0  0  :  slabdata  6    6    0
skbuff_fclone_cache     0     0     448   9   1  :  tunables  0  0  0  :  slabdata  0    0    0
skbuff_head_cache       258   304   256   16  1  :  tunables  0  0  0  :  slabdata  19   19   0
file_lock_cache         0     24    168   24  1  :  tunables  0  0  0  :  slabdata  1    1    0
file_lock_ctx           19    56    72    56  1  :  tunables  0  0  0  :  slabdata  1    1    0
shmem_inode_cache       153   154   360   11  1  :  tunables  0  0  0  :  slabdata  14   14   0
pool_workqueue          6     8     512   8   1  :  tunables  0  0  0  :  slabdata  1    1    0
proc_inode_cache        413   418   360   11  1  :  tunables  0  0  0  :  slabdata  38   38   0
sigqueue                0     21    192   21  1  :  tunables  0  0  0  :  slabdata  1    1    0
bdev_cache              4     9     448   9   1  :  tunables  0  0  0  :  slabdata  1    1    0
kernfs_node_cache       9232  9248  128   32  1  :  tunables  0  0  0  :  slabdata  289  289  0
mnt_cache               22    32    256   16  1  :  tunables  0  0  0  :  slabdata  2    2    0
filp                    238   294   192   21  1  :  tunables  0  0  0  :  slabdata  14   14   0
inode_cache             1396  1404  328   12  1  :  tunables  0  0  0  :  slabdata  117  117  0
dentry                  3509  3520  184   22  1  :  tunables  0  0  0  :  slabdata  160  160  0
names_cache             3     7     4160  7   8  :  tunables  0  0  0  :  slabdata  1    1    0
buffer_head             2472  2484  112   36  1  :  tunables  0  0  0  :  slabdata  69   69   0
nsproxy                 0     0     72    56  1  :  tunables  0  0  0  :  slabdata  0    0    0
vm_area_struct          446   540   136   30  1  :  tunables  0  0  0  :  slabdata  18   18   0
mm_struct               32    57    416   19  2  :  tunables  0  0  0  :  slabdata  3    3    0
fs_cache                30    84    96    42  1  :  tunables  0  0  0  :  slabdata  2    2    0
files_cache             31    64    256   16  1  :  tunables  0  0  0  :  slabdata  4    4    0
signal_cache            60    84    576   14  2  :  tunables  0  0  0  :  slabdata  6    6    0
sighand_cache           60    80    3136  10  8  :  tunables  0  0  0  :  slabdata  8    8    0
task_struct             60    72    1336  12  4  :  tunables  0  0  0  :  slabdata  6    6    0
cred_jar                92    125   160   25  1  :  tunables  0  0  0  :  slabdata  5    5    0
anon_vma_chain          310   510   80    51  1  :  tunables  0  0  0  :  slabdata  10   10   0
anon_vma                228   357   80    51  1  :  tunables  0  0  0  :  slabdata  7    7    0
pid                     63    126   96    42  1  :  tunables  0  0  0  :  slabdata  3    3    0
radix_tree_node         206   209   352   11  1  :  tunables  0  0  0  :  slabdata  19   19   0
idr_layer_cache         72    84    1112  14  4  :  tunables  0  0  0  :  slabdata  6    6    0
kmalloc-8192            10    12    8320  3   8  :  tunables  0  0  0  :  slabdata  4    4    0
kmalloc-4096            543   560   4224  7   8  :  tunables  0  0  0  :  slabdata  80   80   0
kmalloc-2048            84    90    2176  15  8  :  tunables  0  0  0  :  slabdata  6    6    0
kmalloc-1024            131   140   1152  14  4  :  tunables  0  0  0  :  slabdata  10   10   0
kmalloc-512             447   456   640   12  2  :  tunables  0  0  0  :  slabdata  38   38   0
kmalloc-256             359   370   384   10  1  :  tunables  0  0  0  :  slabdata  37   37   0
kmalloc-128             8162  8192  256   16  1  :  tunables  0  0  0  :  slabdata  512  512  0
kmem_cache_node         113   128   128   32  1  :  tunables  0  0  0  :  slabdata  4    4    0
kmem_cache              113   128   256   16  1  :  tunables  0  0  0  :  slabdata  8    8    0

@mweinelt
Copy link
Contributor Author

mweinelt commented Nov 14, 2017

Behaviour does not improve when:

  • removing fq_codel and using pfifo_fast
  • disabling Airtime Fairness

Memory usage on the device looks like this after boot and before connecting to the mesh:

root@64283-ranzload:/# echo m > /proc/sysrq-trigger 
[   60.205101] sysrq: SysRq : Show Memory
[   60.208967] Mem-Info:
[   60.211292] active_anon:641 inactive_anon:8 isolated_anon:0
[   60.211292]  active_file:538 inactive_file:261 isolated_file:0
[   60.211292]  unevictable:0 dirty:0 writeback:0 unstable:0
[   60.211292]  slab_reclaimable:474 slab_unreclaimable:2651
[   60.211292]  mapped:379 shmem:22 pagetables:78 bounce:0
[   60.211292]  free:472 free_pcp:0 free_cma:0
[   60.243091] Normal free:1888kB min:1024kB low:1280kB high:1536kB active_anon:2564kB inactive_anon:32kB active_file:2152kB inactive_file:1044kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:32768kB managed:27776kB mlocked:0kB dirty:0kB writeback:0kB mapped:1516kB shmem:88kB slab_reclaimable:1896kB slab_unreclaimable:10604kB kernel_stack:424kB pagetables:312kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   60.286822] lowmem_reserve[]: 0 0
[   60.290173] Normal: 24*4kB (U) 84*8kB (UM) 70*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1888kB
[   60.301938] 821 total pagecache pages
[   60.305632] 0 pages in swap cache
[   60.308970] Swap cache stats: add 0, delete 0, find 0/0
[   60.314223] Free swap  = 0kB
[   60.317130] Total swap = 0kB
[   60.320022] 8192 pages RAM
[   60.322743] 0 pages HighMem/MovableOnly
[   60.326608] 1248 pages reserved

Slabinfo`after bootup, as stated on IRC it looks like the OOM happens as soon as the device connects to the mesh.

slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
mesh_rmc                1023   1064   72    56  1  :  tunables  0  0  0  :  slabdata  19   19   0
nf-frags                0      0      184   22  1  :  tunables  0  0  0  :  slabdata  0    0    0
nf_conntrack_1          7      15     264   15  1  :  tunables  0  0  0  :  slabdata  1    1    0
nf_conntrack_expect     0      0      208   19  1  :  tunables  0  0  0  :  slabdata  0    0    0
fq_flow_cache           0      0      112   36  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tt_roam_cache    0      0      96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tt_req_cache     0      42     96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
batadv_tt_change_cache  0      64     64    64  1  :  tunables  0  0  0  :  slabdata  1    1    0
batadv_tt_orig_cache    0      0      96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tg_cache         0      0      96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
batadv_tl_cache         10     42     96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
sd_ext_cdb              2      51     80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-128              2      15     2112  15  8  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-64               2      15     1088  15  4  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-32               2      14     576   14  2  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-16               2      12     320   12  1  :  tunables  0  0  0  :  slabdata  1    1    0
sgpool-8                2      21     192   21  1  :  tunables  0  0  0  :  slabdata  1    1    0
scsi_data_buffer        0      0      64    64  1  :  tunables  0  0  0  :  slabdata  0    0    0
bridge_fdb_cache        12     42     96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
ip6-frags               0      0      184   22  1  :  tunables  0  0  0  :  slabdata  0    0    0
fib6_nodes              36     42     96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
ip6_dst_cache           62     84     288   14  1  :  tunables  0  0  0  :  slabdata  6    6    0
ip6_mrt_cache           0      0      160   25  1  :  tunables  0  0  0  :  slabdata  0    0    0
PINGv6                  0      0      832   19  4  :  tunables  0  0  0  :  slabdata  0    0    0
RAWv6                   8      19     832   19  4  :  tunables  0  0  0  :  slabdata  1    1    0
UDPLITEv6               0      0      800   10  2  :  tunables  0  0  0  :  slabdata  0    0    0
UDPv6                   6      10     800   10  2  :  tunables  0  0  0  :  slabdata  1    1    0
tw_sock_TCPv6           0      0      232   17  1  :  tunables  0  0  0  :  slabdata  0    0    0
request_sock_TCPv6      0      0      280   14  1  :  tunables  0  0  0  :  slabdata  0    0    0
TCPv6                   4      10     1536  10  4  :  tunables  0  0  0  :  slabdata  1    1    0
jffs2_xattr_ref         0      0      72    56  1  :  tunables  0  0  0  :  slabdata  0    0    0
jffs2_xattr_datum       0      0      104   39  1  :  tunables  0  0  0  :  slabdata  0    0    0
jffs2_inode_cache       227    280    72    56  1  :  tunables  0  0  0  :  slabdata  5    5    0
jffs2_node_frag         63     112    72    56  1  :  tunables  0  0  0  :  slabdata  2    2    0
jffs2_refblock          104    104    296   13  1  :  tunables  0  0  0  :  slabdata  8    8    0
jffs2_tmp_dnode         0      51     80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
jffs2_raw_inode         0      32     128   32  1  :  tunables  0  0  0  :  slabdata  1    1    0
jffs2_raw_dirent        0      42     96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
jffs2_full_dnode        132    192    64    64  1  :  tunables  0  0  0  :  slabdata  3    3    0
jffs2_i                 88     88     368   11  1  :  tunables  0  0  0  :  slabdata  8    8    0
squashfs_inode_cache    620    620    384   10  1  :  tunables  0  0  0  :  slabdata  62   62   0
fasync_cache            4      56     72    56  1  :  tunables  0  0  0  :  slabdata  1    1    0
posix_timers_cache      0      0      200   20  1  :  tunables  0  0  0  :  slabdata  0    0    0
UNIX                    20     26     608   13  2  :  tunables  0  0  0  :  slabdata  2    2    0
ip4-frags               0      0      168   24  1  :  tunables  0  0  0  :  slabdata  0    0    0
ip_mrt_cache            0      0      160   25  1  :  tunables  0  0  0  :  slabdata  0    0    0
UDP-Lite                0      0      704   11  2  :  tunables  0  0  0  :  slabdata  0    0    0
tcp_bind_bucket         4      42     96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
inet_peer_cache         1      21     192   21  1  :  tunables  0  0  0  :  slabdata  1    1    0
secpath_cache           0      0      96    42  1  :  tunables  0  0  0  :  slabdata  0    0    0
flow_cache              0      0      152   26  1  :  tunables  0  0  0  :  slabdata  0    0    0
xfrm_dst_cache          0      0      320   12  1  :  tunables  0  0  0  :  slabdata  0    0    0
ip_fib_trie             13     51     80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
ip_fib_alias            14     51     80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
ip_dst_cache            1      21     192   21  1  :  tunables  0  0  0  :  slabdata  1    1    0
PING                    0      0      672   12  2  :  tunables  0  0  0  :  slabdata  0    0    0
RAW                     2      12     672   12  2  :  tunables  0  0  0  :  slabdata  1    1    0
UDP                     4      11     704   11  2  :  tunables  0  0  0  :  slabdata  1    1    0
tw_sock_TCP             0      0      232   17  1  :  tunables  0  0  0  :  slabdata  0    0    0
request_sock_TCP        0      0      280   14  1  :  tunables  0  0  0  :  slabdata  0    0    0
TCP                     4      11     1408  11  4  :  tunables  0  0  0  :  slabdata  1    1    0
eventpoll_pwq           30     51     80    51  1  :  tunables  0  0  0  :  slabdata  1    1    0
eventpoll_epi           33     64     128   32  1  :  tunables  0  0  0  :  slabdata  2    2    0
inotify_inode_mark      2      42     96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
blkdev_queue            6      8      976   8   2  :  tunables  0  0  0  :  slabdata  1    1    0
blkdev_requests         24     32     256   16  1  :  tunables  0  0  0  :  slabdata  2    2    0
blkdev_ioc              2      39     104   39  1  :  tunables  0  0  0  :  slabdata  1    1    0
bio-0                   14     64     256   16  1  :  tunables  0  0  0  :  slabdata  4    4    0
biovec-256              14     20     3136  10  8  :  tunables  0  0  0  :  slabdata  2    2    0
biovec-128              0      0      1600  10  4  :  tunables  0  0  0  :  slabdata  0    0    0
biovec-64               0      0      832   19  4  :  tunables  0  0  0  :  slabdata  0    0    0
biovec-16               0      0      256   16  1  :  tunables  0  0  0  :  slabdata  0    0    0
uid_cache               1      42     96    42  1  :  tunables  0  0  0  :  slabdata  1    1    0
sock_inode_cache        69     80     384   10  1  :  tunables  0  0  0  :  slabdata  8    8    0
skbuff_fclone_cache     0      0      448   9   1  :  tunables  0  0  0  :  slabdata  0    0    0
skbuff_head_cache       622    720    256   16  1  :  tunables  0  0  0  :  slabdata  45   45   0
file_lock_cache         1      24     168   24  1  :  tunables  0  0  0  :  slabdata  1    1    0
file_lock_ctx           19     56     72    56  1  :  tunables  0  0  0  :  slabdata  1    1    0
shmem_inode_cache       162    165    360   11  1  :  tunables  0  0  0  :  slabdata  15   15   0
pool_workqueue          6      8      512   8   1  :  tunables  0  0  0  :  slabdata  1    1    0
proc_inode_cache        7      44     360   11  1  :  tunables  0  0  0  :  slabdata  4    4    0
sigqueue                0      21     192   21  1  :  tunables  0  0  0  :  slabdata  1    1    0
bdev_cache              4      9      448   9   1  :  tunables  0  0  0  :  slabdata  1    1    0
kernfs_node_cache       9267   9280   128   32  1  :  tunables  0  0  0  :  slabdata  290  290  0
mnt_cache               22     32     256   16  1  :  tunables  0  0  0  :  slabdata  2    2    0
filp                    287    420    192   21  1  :  tunables  0  0  0  :  slabdata  20   20   0
inode_cache             796    1032   328   12  1  :  tunables  0  0  0  :  slabdata  86   86   0
dentry                  1966   3432   184   22  1  :  tunables  0  0  0  :  slabdata  156  156  0
names_cache             0      7      4160  7   8  :  tunables  0  0  0  :  slabdata  1    1    0
buffer_head             1008   1008   112   36  1  :  tunables  0  0  0  :  slabdata  28   28   0
nsproxy                 0      0      72    56  1  :  tunables  0  0  0  :  slabdata  0    0    0
vm_area_struct          484    540    136   30  1  :  tunables  0  0  0  :  slabdata  18   18   0
mm_struct               30     57     416   19  2  :  tunables  0  0  0  :  slabdata  3    3    0
fs_cache                29     84     96    42  1  :  tunables  0  0  0  :  slabdata  2    2    0
files_cache             30     64     256   16  1  :  tunables  0  0  0  :  slabdata  4    4    0
signal_cache            59     70     576   14  2  :  tunables  0  0  0  :  slabdata  5    5    0
sighand_cache           59     70     3136  10  8  :  tunables  0  0  0  :  slabdata  7    7    0
task_struct             59     72     1336  12  4  :  tunables  0  0  0  :  slabdata  6    6    0
cred_jar                111    150    160   25  1  :  tunables  0  0  0  :  slabdata  6    6    0
anon_vma_chain          362    459    80    51  1  :  tunables  0  0  0  :  slabdata  9    9    0
anon_vma                269    357    80    51  1  :  tunables  0  0  0  :  slabdata  7    7    0
pid                     64     126    96    42  1  :  tunables  0  0  0  :  slabdata  3    3    0
radix_tree_node         210    220    352   11  1  :  tunables  0  0  0  :  slabdata  20   20   0
idr_layer_cache         82     84     1112  14  4  :  tunables  0  0  0  :  slabdata  6    6    0
kmalloc-8192            13     15     8320  3   8  :  tunables  0  0  0  :  slabdata  5    5    0
kmalloc-4096            649    716    4224  7   8  :  tunables  0  0  0  :  slabdata  134  134  0
kmalloc-2048            84     90     2176  15  8  :  tunables  0  0  0  :  slabdata  6    6    0
kmalloc-1024            144    168    1152  14  4  :  tunables  0  0  0  :  slabdata  12   12   0
kmalloc-512             452    480    640   12  2  :  tunables  0  0  0  :  slabdata  40   40   0
kmalloc-256             1046   1050   384   10  1  :  tunables  0  0  0  :  slabdata  105  105  0
kmalloc-128             12701  12848  256   16  1  :  tunables  0  0  0  :  slabdata  803  803  0
kmem_cache_node         113    128    128   32  1  :  tunables  0  0  0  :  slabdata  4    4    0
kmem_cache              113    128    256   16  1  :  tunables  0  0  0  :  slabdata  8    8    0

@Adorfer
Copy link
Contributor

Adorfer commented Nov 15, 2017

testing with a 1043v1 sounds a bit like "trying to fix 2 issues with one shot", since (at least everybody seems to know for sure) that 1043v1 is instable by design, even in CC and BB. (eventhough it's just a hanging wifi, not a high load or a reboot.)

@blocktrron
Copy link
Member

blocktrron commented Sep 30, 2018

As Freifunk Darmstadt now completed the migration of it's network, we now have domains with max. 70 Nodes per Domain.

We already see the problems regarding high load greatly improve, if not gone completely. Stats

Surely, this is not a fix. I would also go as far to say that there is probably no real fix. We should probably accept that those devices just do not have enough ram to fulfil their task (And even the split is probably only a temporary improvement).

Another example of a very problematic node: Stats

@hauetaler
Copy link

hauetaler commented Oct 29, 2018

Same issue here on a Nanostation M2 (XM) with a webcam connected to the second ethernet port... without POE passthrough enabled, the device is running fine, POE passthrough activated causes the error to occur. the effect was previously reproducible at any time.
poe-p

@neocturne neocturne modified the milestones: 2018.2, 2019.1 Dec 26, 2018
@T-X
Copy link
Contributor

T-X commented Jan 17, 2019

@hauetaler: Can you try whether the same issue happens with PoE passthrough disabled and using a PoE-injector for the webcam instead? Does the same happen with PoE passthrough enabled but no webcam connected?

I'm wondering whether this is really an issue of PoE. Or whether this could be caused by the traffic the webcam generates instead.

Thirdly, do you have a scale for the y-axis?

@Adorfer
Copy link
Contributor

Adorfer commented Jan 17, 2019

@hauetaler could you try to disable as many ebtables rules as possible for a test?

@hauetaler
Copy link

I'm very sorry, but there's no PoE-injector availiable at the moment. Since gluon 2018.1.3 the Nanostation works again without any problems.

@hauetaler
Copy link

@T-X Ok, today the problem occured again. PoE is disabled now, so you're right, it's not an issue of PoE.
40 minutes after connecting a raspberry pi to the second ethernet port (eth1) as a freifunk client, load increases from 0.31 to 3.0 and higher. memory usage increases at 15-25 percent. Disconnecting the raspberry pi has no effect in this case. Without connecting a device on the second port, the problem doesn't occur.

@Adorfer next time I'll try to disable ebtables rules

@CodeFetch
Copy link
Contributor

@hauetaler Wow, 15-25% memory increase is really much. Can you please give me access to the router? My public key is ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDM8uhJ9Qin1Bnt1gVkhQEocIK+ziP4Ht0uCP1QPaTPza8hXxLrf5pizAxWpjM7Jnr3UFc/IpOMUII7B67MPlcUvlryQGESNQqGUDEoDbTww1wh79G86x4Q7xMS1q35H6E9KX0WUGMhdcHCOn4XQbIeNB6BY1NL27JgNE4I84oMhWbDdUnR36ZPCWvkm+7PKr92MacCZU/z7lBRHcW0zfCug4YuO3vOqtv1UQl3z2dsgK1VkyuDxyNXSeRufKyJJveqURzx1A5wZVQ3Qc7nIj00yx3GVsYMZH3oX6PuPiu+fu4nzvwiiWaqf/PFqa9Rfof1hQJy29Be8ggfbKZwEF4dCBGhydTF66hm729OzWry7XN49aZAmjHEe84ivDL16SjQjGWPFygMQdpQSovIT8t0vzfuNKRElhMEBAM4BxvLiWtaKFOhxXhMlK7rTmGBzouarFcR5ka1OFYD36z1rv8REEviUMv1QbFtIx1TD3HrliNt18lJE5d5AyDxadWy6Lf7WlPpVZnxydTneyE7UwtSt9vwx2zdNEOG6ygxOjY9JbiO12/kkyLeTyMq7+o0uY5oV2xo+I3aVYVS0jv3VHrTqtb/1nDWTb7Y9TTe8b0nOZOkOnnzOxWBvSms7MOh0NOA2I3ZpkIhKcWqdCvyKFfeUaita4sYKOrIwelYhyGQmQ== user@management.

@hauetaler
Copy link

@CodeFetch ...done - https://hannover.freifunk.net/karte/#/de/map/68725124e2fa

@T-X
Copy link
Contributor

T-X commented Jan 24, 2019

@hauetaler: Could you check whether the issue also occurs if you swap the Pi with a plain, simple switch with nothing else connected to it?

@hauetaler
Copy link

@T-X Just tested it, no problems at all. It seems there must be traffic for the error to occur.

@CodeFetch
Copy link
Contributor

@hauetaler Can you flash the router manually in case it gets unresponsive? I'd like to test our nightly firmware as it uses OpenWrt and afterwards a firmware with tracing and profiling support. I have had installed vH11, but it seems you have downgraded the router to vH10 again?!

The strange thing with your router is that I almost freed 5 MB of RAM and the load bug still occured. BTW I've moved to Freifunk Hannover a few months ago...

@hauetaler
Copy link

@CodeFetch Flashing this router manually is no problem. Should I reconnect the raspberry again?

@CodeFetch
Copy link
Contributor

@hauetaler Sorry for my late reply. I'd like to test it on sunday. It would be nice if you could plug in the raspberry then.

@CodeFetch
Copy link
Contributor

Using the Gluon master the load decreased from average 5 to 0.5. Which is still high as it did nearly nothing and routers with more RAM that actively serve clients have a load of average < 0.1. Between 22:10 and 22:30 you can see what happens, when I slowly fill the RAM (up to approximately 1MB). The load did go up to 3 and then the router rebooted due to a OOM.

https://stats.ffh.zone/d/000000021/router-fur-meshviewer?var-node=68725124e2fa&from=1549223160441&to=1549256313791&orgId=1

Next step for me is to watch the inodes that are being decompressed to find out what files are being repeatedly read which causes the high load. The 32 MB RAM routers are definitely OOM. It's a matter of a few 100 KBs if the load bug appears or not. When it appears once it is hard to fix even if you free a lot of RAM. This is a thing I don't have an explanation for.

@hauetaler I've just flashed a firmware with SquashFS debug messages enabled. Unfortunately the router is not reachable since then. I suspect that it generates too many messages. Sorry, but you need to flash it manually now :(... Please use our nightly firmware: http://build.ffh.zone/job/gluon-nightly/ws/download/images/sysupgrade/

@flobeier
Copy link
Contributor

flobeier commented Feb 4, 2019

@CodeFetch thanks for your time and effort in further investigating this bug!

@hauetaler
Copy link

I've to unbrick it first by TFTP recovery. It's impossible to flash a new firmware at the moment. Hopefully the router will be back online in a few minutes.

@CodeFetch
Copy link
Contributor

@hauetaler The node looks very good now. Load in average 0.1 like 64 MB devices and only 62% memory consumption. Did you unplug the raspberry? If not, please try to generate some traffic over LAN and then over WiFi. I've build a setup at home with which I can reproduce the load issue for further investigation now. Thank you very much for your help. We will release a firmware for Freifunk Hannover based on 2018.2 after we have checked if the 4 MB devices run as smoothly as yours or if the SquashFS block size needs to be reduced for them, too.

@hauetaler
Copy link

@CodeFetch load average seems to be ok at the moment, but memory consumption increases after connecting the node to vpn mesh again.

image

@MPW1412
Copy link
Contributor

MPW1412 commented Feb 10, 2019

The problem seems to be gone or at least mitigated for us (FF Münsterland) in 2018.2.x. Maybe even earlier, we never used 2018.1.*.

https://karte.freifunk-muensterland.de/map04/#!v:m;n:a42bb0d21ba4

I tested explicitly the wired mesh case, in which the problem occurred very often in 2017 based Gluons.

@CodeFetch
Copy link
Contributor

@MPW1412 With 2018.1.4 the bug is easily reproducible. With 2018.2/switch to OpenWrt the router seems to reboot directly after the load begins to increase to about 3, but that only happens when I manually fill the memory. I'm happy to see that your 4 MB flash device also seems to have less memory pressure. Can you please post a dump of /proc/meminfo and /proc/slabinfo with the old and the new firmware?

@CodeFetch
Copy link
Contributor

This was the thread I've found on our issue:
https://lkml.org/lkml/2017/9/14/646

The question is: Did the thrashing detection really improve or do we just have more memory available due to more efficient packet handling by ath9k, different SquashFS cache sizes etc.?
For that we need a comparable dump of /proc/meminfo and /proc/slabinfo with 2018.1.4 and 2018.2 of an affected 4 MB flash node.

I've looked through all linux commits I could find that were related to thrashing and memory handling. There were many commits that could possibly have improved the situation, but I could be sure for none of them. I'd like to do some tests with torvalds/linux@b1d29ba and torvalds/linux@95f9ab2.

We should have a look at some of these commits, as they might be able to detect the SquashFS thrashing state or make it worse or better (these were all I've found that come into question):
torvalds/linux@1899ad1
torvalds/linux@a76cf1a
torvalds/linux@172b06c
torvalds/linux@c55e8d0
torvalds/linux@2a2e488

Please help me to exclude some of them. I'm not that much into kernel page cache handling and some of them might be obviously irrelevant. We should find out whether the load bug was just a cosmetic issue or whether we are near the OOM justifiably.

@TomSiener
Copy link

TomSiener commented Feb 11, 2019

My high load scenario is still reproducable with 2018.2 or even with a build from master (7/2/2019)
on a 4/32 MB node with a few traffic on lan port:
image

Alter 2-3 hours the load raises.
If someone wants to do some tests with this node, there's no problem to add his key.
The node resides in a guest lan.
I would be glad if can help fixing this bug.

@mweinelt
Copy link
Contributor Author

mweinelt commented Mar 3, 2020

Well, everything here is in the green. If anybody still sees this on v2019.1.x and newer please speak up.

@mweinelt mweinelt closed this as completed Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. type: bug This is a bug 9. meta: known issue Known issue which should be mentioned in release notes
Projects
None yet
Development

No branches or pull requests