Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very high load accumulates on hAP lite after a few days for the past couple nightly builds #709

Closed
k6ccc opened this issue Feb 10, 2023 · 7 comments

Comments

@k6ccc
Copy link

k6ccc commented Feb 10, 2023

For the past few nightly builds, I have been noticing that my hAP lite has been showing high load numbers after a few days. Performance gets very bad when the numbers go up (as expected) Since I load almost every nightly, it seldom has an uptime of more than 3 or 4 days. Today it had an uptime of 5 days and the load was over 20. Performance was so bad that it was taking about 5 minutes to load any page from a LAN connection, and largely impossible remotely. I am currently on NB 2283, but I have been seeing this for the past few builds.
I was able to capture a status page screen capture and download a support file. Then did a reboot and captured another status page screen capture. I have uploaded all those along with an annotated drawing of the AREDN nodes currently at my house. On that drawing, the green links are RF and the red links are DtD. For all practical purposes, none of these nodes has any traffic except K6CCC-hAP-at-Home which generally has about a half dozen tunnels connected (a mixture of server and client). There is usually a lot of tunnel through the hAP - couple mega b/s up to mid teen mega b/s.
supportdata-K6CCC-hAP-at-Home-202302101104.gz
Home_AREDN
hAP-at-Home_before reboot
hAP-at-Home_after reboot

@VA2XJM
Copy link
Contributor

VA2XJM commented Feb 10, 2023

I've seen that exact behavior twice since last update. I try to track the logs, but hardly can find something to help.

Mine is the fenced device that host server so I can reach the services both from the mesh and Internet. Mesh goes up and down (either using chan -2 or a cable). Service remains available when requests comes from Internet, but sometimes need to force a web page reload...

Only thing I can notice from the logs is that line that I've not seen before in previous version. From time to time, that line shows up more and more and more and I can see the load going up and ping loss also rise:
kern.warn kernel: [106805.197646] nf_conntrack: nf_conntrack: table full, dropping packet

@KG7GDB
Copy link

KG7GDB commented Feb 13, 2023

I am seeing from your map above that this server "K6CCC-hAP-at-Home" is a 2.4 GHz device which is connected to other nearby 2.4 GHz devices by RF and is a tunnel provider. That is a huge SNR, meaning it is very close to other 2.4GHz radios.
Look first at your mesh status page for connections. Are there any dTd connections on the hAP also on RF and connected to you? This is a common problem with 2.4 GHz radios all on Ch -2, and if you are powering a 2.4 GHz device from port 5 it will connect back and create a local loop.

Next, look at your tunnel connections. Think about how the OLSR routes to your hAP: Does it go through RF or dTd?
Is your network flapping? Try the "mtr" trace route command in Linux terminal to other parts of the mesh. It should take the same route every time with no significant delays.

If you are connected to other clients or servers already connected to mesh, you probably are also creating a "pass-through" route which drives up the traffic. The entire mesh may route traffic through this hAP.

Suggestions-use the tunnel server hAP without RF. Turn off LQM for local devices and tunnel servers. It is really designed for those hilltop APs with lots of poor or distant connections.
Do not connect your server as a client to another server. Only connect to your clients who are not connected to mesh in any other way. Check this by going to each clients' mesh status page. Also, ask each of your client to enable only one tunnel server at a time.
Ideally, tunnel server to client should be point to multi-point (like a star network).

A rule of thumb is "each mesh device gets to use only one connection to the greater mesh."
There can be connections to other devices to provide redundancy. If you want a redundant tunnel, make that tunnel client on a separate device.
73, Brett

@k6ccc
Copy link
Author

k6ccc commented Feb 13, 2023

Brett, The only node that has ANY outside connection is the hAP - and that is via several tunnels. Depending on which tunnels are connected, there can be quite a bit of traffic passing through the hAP. Everything else at the house is essentially in storage for either future use or "in case I need it for some event". I keep them all powered up mainly so I can update them with almost every nightly build. The four devices on 2 GHz are all within 6 feet of each other, but the hAP, USB150, and AR750 have damn near no antenna. The other RF links are from RocM3-1 (a Rocket M3 with dummy loads for antennas) connects to RocM3-3 over a distance of about 3 feet. RocM3-3 has a pair of coaxial dipoles for antennas. It communicates about 40 feet to my garage where the Rocket M3 has the feed portion only of a UBNT Rocketdish that is generally pointed back towards the house. Other than me messing with stuff, none of those nodes have significant traffic.
The network is pretty stable and there is little or no route flapping.
The network arrangement has not appreciably changed, and was never a problem with the load on the hAP until a few weeks ago - sorry, I don't remember exactly which nightly had been loaded when I noticed the problem.

@aanon4
Copy link
Contributor

aanon4 commented Feb 13, 2023

If someone has a node in this state, can you log into it and send me the output from netstat -a - thanks

@aanon4
Copy link
Contributor

aanon4 commented Feb 18, 2023

I think this is the culprit #719

@k6ccc
Copy link
Author

k6ccc commented Feb 18, 2023

Good bet! Thanks for finding it.

@k6ccc
Copy link
Author

k6ccc commented Mar 5, 2023

Assuming fixed, but the hAP lite is no longer being used heavily as it was replaced with a hAP ac3.

@k6ccc k6ccc closed this as completed Mar 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants