Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System instability above specific point #753

Closed
leahoswald opened this issue May 4, 2016 · 82 comments
Closed

System instability above specific point #753

leahoswald opened this issue May 4, 2016 · 82 comments
Labels
0. type: bug This is a bug

Comments

@leahoswald
Copy link

leahoswald commented May 4, 2016

As I've reported on IRC we at FFRN can observe a higher load and frequent reboots on nearly 30% of our nodes.This happens if we reach a specific number of nodes and clients in our network. We have debugged this issue for over a month now and we think we've limited the possible sources.

Let's start with our observations. The first time we became really aware of this problem was the moment we reached the number of 1500 clients in our network spread over nearly 700 nodes. This happened around the first of April this year. But after analyzing the problem we think it started even earlier with some "random" reboots we were analyzing too.

The first thing we observed was, that the majority of the affected nodes are the small tl-wr841 devices. This does not mean, that the bigger nodes like a tl-wr1043 are not affected, the problem just hasn't a big enough impact on them. But interestingly not all of these nodes where affected, only a portion of 30% is showing all the characteristics of this problem. All other nodes are running without any interruptions.

On an effected node we can see the following: If we reach more that 3000 entries in the transglobal table the problems start and if the number is falling under this mark the problems are mainly gone. Such a node shows a increased average load of around 0.45 to 0.9 compared with 0.2 to 0.25 of an not effected node. But the load also starts to peak to values of 2-4 in the time we are above the mark on the problematic nodes. And sometimes every few hours or every few minutes the node reboots.
Another interesting observation is, that affected nodes get alot of more free RAM when the problems start. The RAM usage decreases from the healthy default of around 85% (on a 841) to 75%-80%.

On a TL-1043 it looks like this:

image

On a TL-841v9 it looks like this:
image

At all times we can't see a single process making problems or using more RAM than usual, only the load and the system CPU utilization showed that something went wrong. So we though that the problem has to be in the kernel or in combination with the RAM.

So we started to debug the problems and first we tried to locate a pattern in our statistics to limit the number of possible sources for the problem. There were alot of other ideas we tested but all with nearly no effect. So the most promising was the TG table. But it's not the number of entries because we can't find any limits near this number in the sources and also some other things are speaking against it. So the problem has to be in the processing of the entries or somewhere else.

After that we found out that something in combination with the TG tabel, we think it's the writing of the table to tmpfs, was causing new page allocations. These page allocation couldn't be satisfied by
the available RAM so some parts of the page cache are cleaned up. This cache holds the frequent running scrips of the system so after that the system has to start rereading them. And here the first problem starts. The system is rereading the disk without an end. I've attached a log file for a affected and a not affected node.
notbroken.log.txt
broken.log.txt

If this continues for a while and then we try to write again our TG table, it could be that we run in the vm.dirty_ration which blocks the IO of all processes making everything even worse.

So to solve this problem we started optimizing alot of parameters in the sysfs. He is a list of all current additional options we set.

net.ipv6.neigh.default.gc_interval=60
net.ipv6.neigh.default.gc_stale_time=120
net.ipv6.neigh.default.gc_thresh1=64
net.ipv6.neigh.default.gc_thresh2=128
net.ipv6.neigh.default.gc_thresh3=512

net.ipv4.neigh.default.gc_interval=60
net.ipv4.neigh.default.gc_stale_time=120
net.ipv4.neigh.default.gc_thresh1=64
net.ipv4.neigh.default.gc_thresh2=128
net.ipv4.neigh.default.gc_thresh3=512

vm.min_free_kbytes=1024
vm.dirty_background_ratio=5
vm.dirty_ratio=30
vm.dirty_expire_centisecs=0

Here we save some RAM by using smaller neighbour tables, we increased the min_free_kbytes value to have a bigger buffer for allocation problems, we set down the dirty_background_ration so the system starts writing stuff to disk in background earlier (this is no problem because we write to a ramdisk) we set up the dirty_ration to prevent a whole IO lock and the dirty_expire_centisecs means that we write back the stuff only when we reach the background_ration and not after a time limit to prevent useless write.

With this changes we could increase the performance, we have decreased the load of effected nodes even under the average value for not effected nodes. So maybe some of these options are also relevant without this issue. To get even more free ram some people tried to disable haveged, this also makes it more stable, because we have more free RAM.

Then we saw that the community in Hamburg has a bigger network with alot more clients and nodes but doesn't look affected by the problem. So we analysed the site.conf and found out that the mesh VLAN feature we were using (Hamburg not) was causing double entries for every node. One with VID -1 and one with VID 0. This isn't that great.

Then we flashed a test node with the firmware from Hamburg. The first difference is that our firmware is based on gluon 2016.1.3 and the one from Hamburg is the version 2015.1.2. So there are a few differences.

But short back to the TG table. The TG table from Hamburg was around 3700 entrys long without the problem occurring. So the problem must be something that was changed between the versions. As the 2016 versions are based on OpenWRT CC and not on BB like 2015 this could be alot. But we think its not something on the OpenWRT base system. It has to be something more freifunk specific.
So we looked again on the process list of a node with our firmware and one with the firmware from Hamburg.
Her we found the following differences (First value for FFRN, second for FFHH)

/sbin/procd 1408 vs 1388
/sbin/ubusd 896 vs 888
/sbin/logd -S 16 1044 vs 1036
/sbin/netifd 1568 vs 1608
/usr/sbin/sse-multiplexd 780 vs 0 (doesn't exist)
radvd versionen 1108 vs 1104
/usr/sbin/uhttpd 1132 vs 1140
/usr/sbin/dnsmasq 1076 vs 916
/usr/bin/fastd 3316 vs 3300
odhcp6c 800 vs 812
/usr/sbin/batman-adv-visdata 784 vs 0 (doesn't exist)
/usr/sbin/dnsmasq 932 vs 924
respondd 2000 vs 2152

difference sum: 16844 vs 14728 = 2116

This means an increased RAM usage with the newer firmware of over 2MB. But this couldn't be the only source too, because we have nodes without the problem.

Then we started thinking about what we have, and also started writing some documentation of the work for the community. Here we get the idea that the sudden decrease of RAM usage maybe could be caused by the OOM killer. And also the characteristics we see in the RAM graph showed some characteristics of a memory leak. But again this couldn't be the only problem. So we thought again a little bit further and now think it's a combination of mem leak and a mem corruption causing the rereading of the flash storage without an end. With all these information we think that the only service that is really near to the problem is batman-adv-visdata, so this would be the first point to go deeper. But here we come to a limit in resources an knowledge about the system and hope that we find someone who can help us find a solution for this problem.

We know that this are alot of information and maybe alot of information is missing. Please ask if you need something.

You can find a german version including the discussions here: https://forum.ffrn.de/t/workaround-841n-load-neustart-problem/1167/29?u=ben

@RubenKelevra
Copy link

Which version of batman-adv do you use?

@leahoswald
Copy link
Author

We use the batman-adv-15 package from gluon. So it is batman-adv 2015.1

@rotanid
Copy link
Member

rotanid commented May 6, 2016

Freifunk München also had problems like yours in 2015 with a similar node count - so they split their network in three segments. afaik they didn't do such a detailed analysis as you did and i was told today the problems start to begin again because of the growth of the segments.

@Adorfer
Copy link
Contributor

Adorfer commented May 7, 2016

Thanks for the all the work.
But for me this is just another confirmation, that the existing batman-adv does not skale properly for networks of more than 300 nodes.

@leahoswald
Copy link
Author

leahoswald commented May 7, 2016

Hey @rotanid thanks for this informations. This makes it even more important to finde the bug. The problem is, that spliting the network is only a workarround for a important problem. So I think we should find and solve this bug.

@Adorfer I know your point and now please don't tell me this every time. This is not the way how new solutions are found. We just try to find a solution for problems instead running from on workarround to the next on. We would also like to experiment with the technologie and find such limitations to find a new way to handle them. And yes, we all know about the limitations of a batman-adv l2 network this is the reason why alot of people are experimenting with new solution. But they need there time to be good and stable. So please let uns discusse the problem here without your focus on small community networks. An hey, it looks like a software bug and such bugs can be fixed.

@bitboy0
Copy link

bitboy0 commented May 13, 2016

Even if the workaround helps alot ... a solution would be great!

@mmalte
Copy link
Contributor

mmalte commented May 13, 2016

That sounds plausible to me. We've seen a similar behaviour in the Regio Aachen network when we reached about 900 Nodes with 3000 clients.
At a very specific time of the day the load started to significantly increase an in the evening with clients leaven the network the load got back to a normal value.

load-haag
load-haag2

We thought that the mesh table just got to big for the little routers. A remarkable point was, that a strong offloader in front of these little devices protected them. Maybe because the table got simpler.
Having this in mind I tried to remove some of the mesh links in the core Network, changing the connection between the four gateways from full mesh to a ring. This seemed to help a little as well.
But a much bigger impact had the addition of two more gateways and to add additional fastd instances on the gateways (one for incoming IPv6 and one for IPv4). Resulting in a load of ~0.5 per core on the gateways.
Fast gateways with nearly no packet loss -> lower load on the mesh nodes.

This got us around to 1.100 nodes with 3.500 Clients, but network performance was dropping.

(A few month ago we finally splitted our network into many small networks using multiple fastd instances attached to different batman devices. One Firmware, multiple white-lists for fastd)

@mmalte
Copy link
Contributor

mmalte commented May 13, 2016

By the way, we are using gluon-mesh-batman-adv-14.

@T-X
Copy link
Contributor

T-X commented May 13, 2016

Would it be possible to get a dump from dmesg or /proc/vmstat once the issues start to occure on a node? Also a /proc/slabinfo would be great, but seems like it's not available on Gluon images by default. Finally, just to verify that it's a memory issue, the output of /sys/kernel/debug/crashlog would from a node that just crashed and rebooted would be great, too.

@nazco: Thanks for the very thorough, analytical report! By the way, one more things which next to batman-adv and the IP neighbor caches needs memory relative to the number of clients is the bridge. It's forwarding database (fdb for short) keeps a list of MACs behind ports, too.

Speaking of the bridge, I noticed that the bridge kernel code does not use kmalloc() for it's fdb entries, but kmem_cache_*() function calls. Maybe we are having similar issues as we had with the debugfs output until the added their fallback from kmalloc() to vmalloc(). Which is a very fragmented RAM. Could be interesting whether kmem_cache_alloc() might not just speed up memory allocation but might also help getting a less fragmented RAM (if that's the issue here).

Regarding the VLANs, that indeed sounds odd. I queried ordex, the guy behind the TT and its VLAN support on IRC. Btw. I just checked in a VMs with one isolated node and no matter with or without VLANs, I'm seeing a weird, additional local TT entry with VID 0 which has the MAC address of bat0. Do you have VID 0 entries without the P flag? How many have VID 0, how many VID 1 exactly?

@leahoswald
Copy link
Author

Hey, thanks for the reply. I'll try to get these info for you.

@T-X
Copy link
Contributor

T-X commented May 13, 2016

And one more thing which would be interesting: Running wirerrd to see whether something weird happens on the network when the load is high.

Currently I'm running this for Freifunk Hamburg and Freifunk Lübeck and that's usually one of the first places we look at when something behaves oddly.

@T-X
Copy link
Contributor

T-X commented May 14, 2016

Regarding the process table, I'm currently wondering about two things:

  • Isn't haveged missing?
  • Is this really a 2016.1.3 device? That version should have the C-rewrite of respondd and I'm a little astonished that it allegedly still takes about 2MB of RAM.

@leahoswald
Copy link
Author

Isn't haveged missing?

The process table in my initial post only shows the diffs to the 2015.2 firmware of hamburg

Is this really a 2016.1.3 device?

Yes, the device is running 2016.1.3

@neocturne
Copy link
Member

@T-X, I think these numbers are virtual memory (as that is the number shown by ps). The new respondd still uses about 2 MB of virtual memory, as it uses a dlopen a lot (and at least uClibc will use a lot of virtual memory per dlopened object).

@nazco, if the numbers you compared are virtual memory, they are meaningless, as virtual memory is often never actually allocated. AFAIK, the number VmRSS in /proc/$PID/status is the most relevent for actual RAM usage of a process.

I don't think any of the processes make much difference, the most important change from Barrier Breaker to Chaos Calmer is the newer kernel. I think the new kernel might work a bit worse under memory pressure, although it's hard to tell for sure.

@T-X
Copy link
Contributor

T-X commented May 15, 2016

Unfortunately, kmem_cache_alloc() isn't really documented in the kernel. So we are unsure whether it'd help in anyway with this problem. It seems that from looking at other parts in the kernel, that it is common to use dedicated caches for larger amounts of objects which are frequently changing.

Would anyone be willing to give this patch a try to see whether it makes any difference?

https://lists.open-mesh.org/pipermail/b.a.t.m.a.n/2016-May/015368.html

@T-X
Copy link
Contributor

T-X commented May 15, 2016

Btw., regarding the fiddling with the neighbor tables in the first post. @ohrensessel and I noticed yesterday after applying #674 and #688 that the multicast snooping takes place before any additions to the bridge forwarding database. For instance, "$ bridge fdb show" had no more entries towards bat0, while before it had one MAC entry for nearly every client in the mesh. For the IP neighbor tables it should be similar.

@ohrensessel wanted to test and observe further, whether having these two patches makes any difference for the load peaks at Freifunk Hamburg in the evening. He'll probably report back later.

@bitboy0
Copy link

bitboy0 commented May 19, 2016

How could that be an explanation for the strict limit of 3000 entries in the transglobal-table?
Under that limit nothing happens... above the problems are there. I'm just interrested how that can be.

@T-X
Copy link
Contributor

T-X commented May 19, 2016

@bitboy0: There is no strict limit for the global translation table. There is just a limit for the local translation table of a node (= the number of clients a node can serve; ~120 with batman-adv 2013.4, 16x that much with a recent version of batman-adv / since fragmentation v2 ).

The reports so far, backed by the observation that only 32MB devices are affected, seem to point to a simple out-of-memory problem on such devices (though I'm still waiting for a /sys/kernel/debug/crashlog or dmesg output from someone to confirm). When a device starts to get low on memory, then the Linux kernel memory allocator will have more and more trouble to serve requests and might even need to move some objects to be able to have consecutive, spare memory areas available again. Thus resulting in high load first and at some point even a reboot.

In my x86/amd64 VMs with many kernel debugging options enabled, a global TT entry had allocated about 200 bytes. Sven has mentioned a raw size of 48 bytes on OpenWRT ar71xx. Which will probably be aligned to 64 bytes. So 4000 entries times 64 bytes would result in about 250KB of RAM usage. Which doesn't seem much.

Of course, if the RAM is already full through kernel and userspace programs, then even a few additional hundred KB in the afternoon/evening through the batman-adv global TT, the bridge forwarding database or IP neighbor tables might be the straw to break the camel's back.

@T-X
Copy link
Contributor

T-X commented May 19, 2016

Regarding that hypothesis, it might further be interesting, whether:

  • devices are affected immediately or after a certain uptime
  • whether disabling some/any userspace service has a positive effect
    • does anyone have a 32MB node affected without fastd or haveged running?

@Nurtic-Vibe
Copy link

@T-X From the uptime of my little WA701ND i can see, that devices are affected immediately ( <10mins after restart ).
As mentioned in the workaround, some people disabled haveged, which results in a better behaviour due to lower RAM usage in total - but nevertheless this is only a workaround.

I added serial connection to some nodes, and try to get the remaining logs before crash.

@T-X
Copy link
Contributor

T-X commented May 20, 2016

Thanks @Nurtic-Vibe! With "better behaviour", what do you mean by that exactly? No more Out-of-Memory-Crashes or less often? What do you mean by "this is only a workaround"? If a high static memory footprint (85% were mentioned in the initial post) were the issue, than reducing that would be a valid fix, wouldn't it?

Btw., it's probably not that well known, but OpenWRT has a great feature to preserve crashlogs over a non-power-cycled reboot. After a crash & reboot you should have a new file in /sys/kernel/debug/crashlog. So it would be great if anyone, even with no serial-console access, could have a look at that after a crash.

@T-X
Copy link
Contributor

T-X commented May 20, 2016

And one more question @nazco: For the 85% you mentioned, what does the graph show if you are running the same node just like that but cut the uplink? What is the memory footprint without this node seeing the rest of the network?

It'd be interesting to test whether it stays relatively high even without any other mesh participants. That could back or dismiss the too-much-static-memory-usage theory.

@Nurtic-Vibe
Copy link

@T-X with haveged disabled we get OoM crashes less often, but they still occur regulary.

@T-X
Copy link
Contributor

T-X commented May 20, 2016

Looking at @nazco's broken.log.txt again, @NeoRaider, do you know whether the vmalloc fallback for debugfs access made its way to Gluon yet? It seems that batman-adv-visd accesses the global translation table via debugfs first to translate the alfred server's MAC address to an originator address. Which then results in yet another debugfs originator table lookup to check the TQ and determine the best alfred server.

To inform others (@NeoRaider found this issue a while ago): Without the vmalloc fallback, that results in the need of a large consecutive memory area to be allocated upon accessing a debugfs file. The allocation size happens in a stupid first try x bytes and if while copying it turns out to be insufficient, double it and copy again. That could explain why a certain threshold of global TT entries might cause a jump in load times.

If all that were the case, then it'd be a mixture of: High static memory usage, many small, scattered allocations in the remaining memory. Which makes trouble for a large, consecutive allocation for debugfs access.

@neocturne
Copy link
Member

@T-X, the vmalloc patch is included since Gluon v2016.1.

@leahoswald
Copy link
Author

@T-X one of my nodes just crashed but there is no file like /sys/kernel/debug/crashlog

@bitboy0
Copy link

bitboy0 commented May 20, 2016

@T-X No, the bigger nodes do have the same RAM-eating Problem. Due to the fact, they have more RAM they don't care. But the bug itself is the same. With "strict limit" I don't say that there is a visible limit like maximal table size, but the problems occure if the this specific number of entries is in the list. Maybe better to say: the bug is only visible if the TG have 3000 entries.

And the Problem start imidiately at all nodes in the network the same time if the "limit" is reached. So some knodes can't even get back to work, but they restart directly after a reboot again and again.
If the number of entries in the TG-table is below 3000 again the knodes sudden work propper again.

@bitboy0
Copy link

bitboy0 commented May 20, 2016

@T-X better behavior will say: the addidional space made by disable and stop haveged gives slightly more room for alloc. Because of the sysctl-changes prevents the kernel to write dirtied blocks with high priority it can simply handle the lack of memory more smoothly. This doesn't stop the Problem, but the kernel can manage that longer before OOM-killer triggers a panic.

@T-X
Copy link
Contributor

T-X commented May 20, 2016

@nazco: hm, okay, thanks. And you didn't power-cycle the device, right? Then maybe crashlog is unreliable in some OOM cases :(. Keep looking out for it though :).

Btw., you can easily check whether your OpenWRT image supports crashlog by triggering a crash through "echo c > /proc/sysrq-trigger". The device should then reboot and there should be a new file in /sys/kernel/debug/crashlog (until you reboot again or power-cycle it).

I also just tried simply doing a "dd if=/dev/urandom of=/tmp/foo.bin" and after a few seconds the NanoStationM2 with a Freifunk Hamburg image rebooted here. Then I had a nice Out-of-Memory trace in /sys/kernel/debug/crashlog.

Here's the crash before any uplink connectivity: crashlog-841-no-uplink.txt.
And here after: crashlog-841-with-uplink.txt.

Though the userspace programs do not seem to show any suspiciously high memory usage, at least at that point of time (taken between 19:00 and 20:00).

@T-X
Copy link
Contributor

T-X commented May 20, 2016

Interesting: For a Freifunk Hamburg node with currently 3370 clients (batctl tg | wc -l) the byte count currently is 259407 (batctl tg | wc -c). Which is very close to 2^18. Not sure whether that might still be a relevant number with the vmalloc patch for debugfs.

@T-X
Copy link
Contributor

T-X commented Aug 19, 2016

@bitboy: Various changes were made during these four weeks to reduce memory usage on the kernel side, which will hopefully trickle into Gluon soon:

Until this lands in a Gluon release, @bitboy0, would it be possible for you to give a recent batman-adv/batctl/alfred master branch and #780 a try and report back your new limits?

PS: Also, I'm still a little suspicious towards the new FQ-Codel. That's one more change that came with the more recent Gluon versions. And FQ-Codel is about queueing, which means it is about memory. Maybe it needs more memory in order to achieve it's incredible performance/latency improvements. (there seems to be a /proc/sys/net/core/default_qdisc, but not sure right now what it's value was prior to fq_codel)

@rotanid rotanid added in progress 2. status: rfc request for comments labels Aug 22, 2016
@rotanid rotanid added 0. type: bug This is a bug and removed 2. status: rfc request for comments labels Aug 22, 2016
@T-X
Copy link
Contributor

T-X commented Aug 22, 2016

And there is this ticket on OpenWRT still: https://dev.openwrt.org/ticket/22349

Can someone with an affected device try the patch mentioned there, "fq_codel: add batch ability to fq_codel_drop()" that is?

Also looks like it is possible to play with FQ-Codel parameters via tc (e.g. the "flows" and "limits" parameters): https://lists.openwrt.org/pipermail/openwrt-devel/2016-May/041445.html

@bitboy0
Copy link

bitboy0 commented Aug 24, 2016

@T-X I never compiled gluon by myself till now. I will try mey very best and thanks for the Informations!
If I get this done, I tell you the results of course!

@viisauksena
Copy link
Contributor

viisauksena commented Sep 20, 2016

havent looked much in the fq_codel yet - in freiburg we have also this issue (while really lower numbers in nodes (330++) and clients(900++) but complex connected network of 10 supernodes)
i am now trying some suggestions from ffrn-forum - here

which result in this script
or this packet for gluon ...
just in case somebody want to play with this also - i've written to ffrn-forum also

edit: which is basicly the same patch as ffrn used (couldnt find it before)
https://github.com/Freifunk-Rhein-Neckar/ffrn-packages/tree/master/ffrn-lowmem-patches

@leahoswald
Copy link
Author

Hey, do you have further information about the characteristics you observe and that led you to the conclusion that it could be the same bug?

@viisauksena
Copy link
Contributor

viisauksena commented Sep 20, 2016

not really, i would love to

fact is, that we observe for a long time a rising in reboots (up to several times a day) on 841 (or weak devices) while other nodes (same weak class of device) seem not affected, and some days later are affected , and than not...
everything we look into was not really giving us information (statistical data from monitoring about how many mesh-participants are inside network at all, how many supernodes and nodes are there - how was bandwith or how many clients on the specific nodes... we even tried to nail it down to specific routers from specific vendors (like some strange electricity failure) - to specific hw revisions with 841 - all , nothing )
(i left open some ssh connection and monitored dmesg and logread, nothing)

edit only minor thing, we have a test in one mesh-cloud with bigger mcast rate - there the routers reboot very often. The local router density is high.

we dont have detailed usage of ram over time, or load over time, just the observing that there is nothing out of the ordinary and some minutes later one is rebooting again.
(we could access 100++ router of our 400++ deployed routers in our network)

we have rather complex backbone with many bridged if on the supernodes, resulting in big originator tables on the supernodes (while then nodes could be reached equally good from all bridged supernodes) ... this should (so i think) should have no effect on the router, while there is nothing like that.
(compare on a tp841 # batctl o|wc 354 3644 40904
on supernode # batctl o|wc 408 33462 364424 )

now i want to test this on some routers around the city, and see if they reach a uptime of a several days or not
edit i made me this helper list (out of jq nodes.json) - which tells me which nodes of a specifi type are offline and how long the others are online - based on this list, i conclude that mostly nodes with mesh-vpn (uplink routers) work fine, while ibss0 (meshing routers) tend to fail. its not 100% but very obvious

edit2 the script does not help anything, some of our group think it could be some issue with network and a bunch of unaligned memory access (there is plenty on the routers... except from havin an vague idea of it , this is beyond my (c/assembler) knowledge) .. watch this number raise into millions cat /sys/kernel/debug/mips/unaligned_instructions

@ecsv
Copy link
Contributor

ecsv commented Sep 24, 2016

To the comment from #753 (comment):

My gluon 2016.1.6-based repository now has:

  • added batman-adv 2016.4
    • not yet released
    • includes batadv netlink in alfred+batctl+batman-adv and kmem_cache for TT
  • removed batman-adv-visdata
  • converted gluon-mesh-batman-adv-core+gluon-status-page-api to batadv netlink

Interested people can just try https://github.com/FreifunkVogtland/gluon/tree/v2016.1.6-1 when they think that the memory usage caused by debugfs is the culprit behind this problem.

@ecsv
Copy link
Contributor

ecsv commented Sep 29, 2016

@T-X, the fq codel stuff can really take a lot of memory. We should check out following patches for 2016.2 to reduce the impact with the new wifi driver:

These things were used to fix OOM problems in a test I did with Toke (using a 32MB device and 30 clients). Maybe reducing the limit for the qdisc might be a also possibility which can be tested because this is the part which is already in 2016.1.x. For example right now on LEDE it is using only 4Mb per qdisc:

tc -s -d qd sh dev eth1
qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0

(The output was generated with the include/linux/pkt_sched.h part of https://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/patch/?id=31ce6e010195d049ec3f8415e03d2951f494bf1d + https://patchwork.ozlabs.org/patch/628682/raw/ applied on iproute2.)

But OpenWrt CC doesn't yet have this memory_limit implementation because it was first introduced in 95b58430abe7 ("fq_codel: add memory limitation per queue"). So backporting the patches from LEDE (033-fq_codel-add-memory-limitation-per-queue.patch + 660-fq_codel_defaults.patch) could also be a good idea.

@neocturne
Copy link
Member

I think it would be better not to use fq-codel as long as we're on CC, there are too many fixes that would need to be backported...

I've avoided to backport https://git.lede-project.org/?p=source.git;a=commitdiff;h=c4bfb119d85bcd5faf569f9cc83628ba19f58a1f , so fq shouldn't be effective for mac80211 anyways; I have no idea though if this shortcut in fq_flow_classify() will prevent it from buffering (too many) packets.

@ecsv
Copy link
Contributor

ecsv commented Oct 1, 2016

Ok, didn't check whether fq_codel was active for the mac80211 queuing with 2016.2. So we can forget the point about the wifi driver and its internal queuing.

But just for clarification: fq_codel is still used as default queuing discipline on OpenWrt CC (and thus it most likely is also used by gluon 2016.1.x/2016.2.x). So the patch 033-fq_codel-add-memory-limitation-per-queue.patch + 660-fq_codel_defaults.patch may still be interesting for OpenWrt CC (2016.1.x and 2016.2.x) to reduce the chance that the normal qdiscs take up too much memory.

@FFS-Roland
Copy link

There are existing patches on "https://github.com/Freifunk-Rhein-Neckar/ffrn-packages/tree/master/ffrn-lowmem-patches" which help Nodes with small RAM to stay stable. While discussing if we will include the patches in our FFS-Firmware we are wondering, why the patches are not included in the official Gluon code base, after there exists good experience on FFRN. Are there specific reasons?

@rotanid
Copy link
Member

rotanid commented Nov 10, 2016

@FFS-Roland the simplest reason might be: no one created a pull request to include them.

@leahoswald
Copy link
Author

Well we've developed them as a workarround for some problems we see in our (big) network. So we are not aware of some side effects this options might have in other setups. This is the main reason why we don't create a PR at the moment. If you see good results in Stuttgart too, than I think we can talk about a regular PR with this patches.

@FFS-Roland
Copy link

Meanwhile we were testing patched Gluon 2016.2.1 with WR841N and found some side effects of the sysctl-modifications. Nodes (not clients) cannot be accessed reliably by IPv6, and CPU load rises up. Therefore we will not use the complete patch in our build, but haveget related part only.

@jplitza
Copy link
Member

jplitza commented Feb 11, 2017

I'm surprised the neighbor table garbage collection in that patch set helps at all, because nodes should not have to manage so many neighbor entries anyway. My node currently has 25 entries for IPv6 and 3 for IPv4, in a mesh with 650 nodes and 1000 clients.

@FFS-Roland
Copy link

Our last discussions in Stuttgart seem to result in not using the patch at all, because disabling haveged will limit entropy on the nodes significantly. So we will not profit of gaining 1 MB of RAM, but with our reduced subnet sizes we guess not running into trouble.

@rotanid
Copy link
Member

rotanid commented Nov 6, 2017

@Nurtic-Vibe reported on IRC that even FFRN doesn't use the lowmem-pkg anymore as it doesn't help much.
it also looks like the issue is even more pressing when running Gluon v2017.1.x

@rotanid
Copy link
Member

rotanid commented Jun 5, 2018

closing in favor of #1243, although this one also describes problems of which some are already solved.
if you still have similar issues, please open a new issue with detailed information when running a current gluon master branch build. master branch has more fixes that can't be backported to older releases like v2017.1.x or v2016.2.x

@rotanid rotanid closed this as completed Jun 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. type: bug This is a bug
Projects
None yet
Development

No branches or pull requests