Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node goes "deaf" after a few days/weeks. #460

Closed
AJ6GZ opened this issue Jul 29, 2022 · 41 comments
Closed

Node goes "deaf" after a few days/weeks. #460

AJ6GZ opened this issue Jul 29, 2022 · 41 comments

Comments

@AJ6GZ
Copy link

AJ6GZ commented Jul 29, 2022

On a two month old SXTsq2, I am seeing the node go deaf, meaning the remote side sees it with 100% LQ / 0% NLQ and reports full strength signal/SNR, whilst the local side shows no connections. The chart is flat-lined. A reboot or power cycle restores it. The remote side comes back up without intervention. I've see it 3 times now, twice it took about two weeks and this last one only 3 days since a power cycle. 24V power is stable. The node doesn't crash or lose other functionality, seems just the receiver quits. Support file taken during period of "deafness".
supportdata-AJ6GZ-1-SXT2-202207290112.zip

We also have a Rocket M5 XW in town that has been doing the exact same thing for over a year now, again about every 1-2 weeks (ai6bx-rm5-sector-llva.local.mesh). That one is still in the broken state as of this writing.

@ab7pa
Copy link
Contributor

ab7pa commented Jul 29, 2022

Ian, could you check the Channel Width setting on that node? In one place it looks like you want to use a 10 MHz channel width on channel -2, but the iw info in the support dump looks like it's using 20 MHz width. If the channel widths are mismatched, then the two nodes will not link. Let us know what you find. THX

@AJ6GZ
Copy link
Author

AJ6GZ commented Jul 29, 2022

It's always been 10Mhz on this one. Interestingly, when I run that command on other nodes I also see 20Mhz everywhere even though they are all set to 5 or 10Mhz.

@ab7pa
Copy link
Contributor

ab7pa commented Jul 29, 2022

Cool. I assume your SSIDs also match on both ends. Are you running Link Quality Manager (LQM) on either node? Just making some guesses.

@AJ6GZ
Copy link
Author

AJ6GZ commented Jul 30, 2022

Yeah all standard SSID, no other funky configs. LQM is enabled on both but the Rocket pre-dates LQM when it started. On that one it (correctly) reports "Currently no RF neighbor data available".

@ae6xe
Copy link
Contributor

ae6xe commented Jul 31, 2022

iw info commands always show the node at 20MHz regardless. iw was not enhanced (upstream by openwrt and other groups) to recognize when the wireless chip has been put to 10MHz or 5MHz channel widths.

On the next occurrence of these symptoms, only do a wifi scan, then check if the link is back. Also, grab a support data file before and after.

@bwarden
Copy link
Contributor

bwarden commented Aug 3, 2022

I've been having a similar experience on my rb-lhg-2nd-xl. Added this to trigger a new scan that usually brings it back up:
/etc/cron.hourly/wifi-scan-if-needed

#!/bin/sh

NUM_LINES=$(wc -l /var/run/hosts_olsr | cut -d ' ' -f 1)

if [[ $NUM_LINES -lt 15 ]]; then
        iw dev $(uci get network.wifi.ifname) scan passive
fi

I'll try to capture supportdata after it happens next time.

@aanon4
Copy link
Contributor

aanon4 commented Aug 3, 2022

That's a really effective but simple way to detect the problem - lovely.

[Later]

Actually, I guess this doesnt work well as /var/run/hosts_olsr could contains lots of entries from a DtD connection even if the wifi is deaf.

@ae6xe
Copy link
Contributor

ae6xe commented Aug 3, 2022

@AJ6GZ , can you describe the environment this node is in? Are there any single antenna (non-MIMO) devices nearby? Are there weak signal neighbors? We had removed code to detect and recover from this situation as there had been no reported instances detected -- at least no one could find a log of this code recovering a node in this situation. It appears this condition and the wireless driver are still prone to this condition. We'll need to determine if code should be put back in a release to continue to address the situation. (@aanon4, did you by chance convert this rssi check code to LUA? It's available?)

@aanon4
Copy link
Contributor

aanon4 commented Aug 3, 2022

This code was rewritten in lua. It hasn't been removed and is still running on every node. You'll still see the same rssi.log and rssi.dat file if you look in /tmp.

@ae6xe
Copy link
Contributor

ae6xe commented Aug 3, 2022

Ok, my bad, thought we had taken that out. I must be thinking of something else. This code was intended to detect and recover. I see there is a log file /tmp/rssi.log in the support data and that it is triggering passive scans, but apparently, the node is going deaf too slowly and there's never a large enough SNR drop to trigger the passive scan.

some possible options: 1) bug in /usr/local/bin/mgr/rssi_monitor_lua ; 2) the tolerance checks could be tighter and trigger passive scans more often; 3) the technique to determine a deaf node could be changed if something is better.

@AJ6GZ
Copy link
Author

AJ6GZ commented Aug 5, 2022

One solid neighbor (75ft). The SXT2 is low-level, just to connect two buildings on-site. The other neighbor is down the street, normally pulled out by LQM by the SXT2, but LQM is off for now--thought it might be that, but it still did it. Other neighbors are both NSM2's. Still waiting for mine to go deaf, but the other 5Ghz Rocket in town did it again after only 4 days. Its on 3.22.6.0.

@aanon4
Copy link
Contributor

aanon4 commented Aug 5, 2022

LQM doesn't effect what connects, just what's listened to.

@ae6xe
Copy link
Contributor

ae6xe commented Aug 5, 2022

to elaborate on @aanon4 's comment... LQM is above 802.11n protocols, meaning that any AREDM neighbor with same SSID, channel width, and channel, will still make an 802.11n ad-hoc connection if able. This issue of the node going deaf is down at the 802.11 physical layer 1 level. It is code in the driver that considers various inputs, e.g. noise levels and decoding error rates, that will attempt to raise the noise floor to screen out what is believed to be ambient noise, typically in the presence of a very strong signal. Decoding data from this strong signal can improve when the noise (in this case a weaker signal neighbor) is screened out. It's the situation when a stronger SNR that includes the weak signal neighbor noise has lower throughput than a weaker SNR without the weak signal neighbor noise, which yields higher throughput.

In your situation, you might try to lower the power on this close-nearby neighbor to see if the node stops going deaf. But, maybe you need this power on both nodes to connect with other distant neighbors? If so, and If cost was no object you would have these 2 nodes that are 75' apart on 2 different frequencies, then a 3rd P2P frequency and link between them (4 nodes total). The coverage areas would have significantly higher throughput with this architecture.

@bwarden
Copy link
Contributor

bwarden commented Aug 8, 2022

Adding my support data after my last auto-recovery in case it's useful:
supportdata.gz
Sounds like you understand the conditions pretty well though. I have a directional dish (RBLHG-2nD-XL) pointed to my next hop, and a GL-AR750 in my office for a go-box, both on channel -2.

Just throwing ideas out there, it would be interesting to be able to configure blocking the RF link if there's already a DtD or tunnel link between the two nodes, if that would help here.

@mathisono
Copy link

mathisono commented Aug 10, 2022

Ive seen the same problem! not in such a regular manner. Seen it a few times but I'm not quite sure how to report this! its happened to a few nodes here in the bay area, I was able to trigger this when I saturate the link, leads me to think it the "wireless driver"

I do have support data from the deaf nodes!

@AJ6GZ
Copy link
Author

AJ6GZ commented Sep 5, 2022

Found the SXT2 deaf today. Was going to try a wifi-scan as requested above, but right after I did the support data download, the node recovered. Maybe something in the support dump triggered a reset of something?

supportdata-AJ6GZ-1-SXT2-202209050527 down state.gz
supportdata-AJ6GZ-1-SXT2-202209050530 after support download.gz

@ae6xe
Copy link
Contributor

ae6xe commented Sep 5, 2022

Yes, the support download triggers a passive scan, which recovers from the deaf state.

The good news is the log file has interesting data. The code written to deal with deaf nodes was trigging these passive scans to reset the receive in the leading up to the node going deaf. In looking at the before and after signal strengths (before/after resetting the receive radio), we can see as high as a 20dB change-decrease in SNR on one of the polarities. To interpret the log data (H = Horizontal and V = Vertical polarity):

neighbor node MAC address: [current H SNR]: H rolling ave SNR; H standard deviation SNR: [current V SNR]: V rolling ave SNR: V standard deviation SNR

The node is determined to be deaf if ether H or V SNRs change (between 1 minute samples) by 3 standard deviations + 0.5 from the rolling average.

local sdh3 = math.floor(rssih.sd_h * 3 + 0.5)
if math.abs(rssih.ave_h - info.Hrssi) > sdh3 then
hit = hit + 1

"vi /usr/local/bin/mgr/rssi_monitor.lua" on the node to edit this code.

There's a couple of places in the code (V and H) like this to edit. For whatever reason, your site is going deaf at a rate too slowly for this to catch and trigger a reset. If able, you could remove the "+ 0.5" to see if this triggers more resets, such that the symptoms go away -- change from 3.5 to 3.0 as the test. If it still goes deaf, change to 2.5. It's getting the right balance so that you don't trigger unnecessary resets to keep the radio from going deaf in your environment.

Joe AE6XE

09/04 20:34:07: before 68:72:51:70:ac:ed [-42] [-75]
09/04 20:34:07: after 68:72:51:70:ac:ed [-41] [-49]
09/04 21:09:07: Attenuated Suspect 68:72:51:70:ac:ed [-42] -42.059662 2.383831 [-66] -51.184541 4.246669
09/04 21:09:12: before 68:72:51:70:ac:ed [-42] [-66]
09/04 21:09:12: after 68:72:51:70:ac:ed [-40] [-46]
09/04 21:13:13: Attenuated Suspect 68:72:51:70:ac:ed [-38] -42.072641 2.355417 [-75] -51.129107 4.223430
09/04 21:13:18: before 68:72:51:70:ac:ed [-38] [-75]
09/04 21:13:18: after 68:72:51:70:ac:ed [-44] [-56]
09/04 21:30:18: Attenuated Suspect 68:72:51:70:ac:ed [-49] -41.921529 2.319223 [-60] -51.383027 4.400278
09/04 21:30:23: before 68:72:51:70:ac:ed [-49] [-60]
09/04 21:30:23: after 68:72:51:70:ac:ed [-42] [-47]
09/04 21:38:23: Attenuated Suspect 68:72:51:70:ac:ed [-45] -41.990198 2.281000 [-67] -51.343611 4.400690
09/04 21:38:28: before 68:72:51:70:ac:ed [-45] [-67]
09/04 21:38:28: after 68:72:51:70:ac:ed [-47] [-53]
09/04 21:57:30: Attenuated Suspect 68:72:51:70:ac:ed [-44] -41.853118 2.376135 [-71] -50.787067 4.669417
09/04 21:57:35: before 68:72:51:70:ac:ed [-44] [-71]
09/04 21:57:35: after 68:72:51:70:ac:ed [-41] [-52]
09/04 22:25:38: Attenuated Suspect 68:72:51:70:ac:ed [-52] -41.623130 2.378647 [-55] -50.892047 4.168575
09/04 22:25:43: before 68:72:51:70:ac:ed [-52] [-55]
09/04 22:25:43: after 68:72:51:70:ac:ed [-41] [-51]
09/04 22:26:43: Attenuated Suspect 68:72:51:70:ac:ed [-49] -41.623130 2.378647 [-62] -50.892047 4.168575
09/04 22:26:49: before 68:72:51:70:ac:ed [-49] [-62]
09/04 22:26:49: after 68:72:51:70:ac:ed [-43] [-53]
09/04 22:42:50: Attenuated Suspect 68:72:51:70:ac:ed [-54] -42.173061 2.576993 [-58] -51.310279 4.171160
09/04 22:42:55: before 68:72:51:70:ac:ed [-54] [-58]
09/04 22:42:55: after 68:72:51:70:ac:ed [-45] [-68]
09/04 22:44:55: Attenuated Suspect 68:72:51:70:ac:ed [-49] -42.186617 2.557620 [-76] -51.370766 4.163146
09/04 22:45:00: before 68:72:51:70:ac:ed [-49] [-76]
09/04 22:45:00: after 68:72:51:70:ac:ed [-45] [-59]
09/04 22:49:00: Attenuated Suspect 68:72:51:70:ac:ed [-41] -42.144538 2.516227 [-71] -51.483699 4.155183
09/04 22:49:05: before 68:72:51:70:ac:ed [-41] [-71]
09/04 22:49:05: after 68:72:51:70:ac:ed [-41] [-61]
09/04 23:04:06: Attenuated Suspect 68:72:51:70:ac:ed [-53] -42.299953 2.879254 [-79] -52.126876 4.373071
09/04 23:04:11: before 68:72:51:70:ac:ed [-53] [-79]
09/04 23:04:11: after 68:72:51:70:ac:ed [-44] [-56]
09/04 23:07:11: Attenuated Suspect 68:72:51:70:ac:ed [-47] -42.307398 2.902182 [-72] -52.089963 4.308814
09/04 23:07:16: before 68:72:51:70:ac:ed [-47] [-72]
09/04 23:07:16: after 68:72:51:70:ac:ed [-44] [-57]
09/04 23:29:19: Attenuated Suspect 68:72:51:70:ac:ed [-56] -42.686944 3.426518 [-50] -52.757419 4.780162
09/04 23:29:24: before 68:72:51:70:ac:ed [-56] [-50]
09/04 23:29:24: after 68:72:51:70:ac:ed [-43] [-58]
09/04 23:56:25: Attenuated Suspect 68:72:51:70:ac:ed [-77] -43.051961 3.232237 [-70] -52.431855 5.387794
09/04 23:56:30: before 68:72:51:70:ac:ed [-77] [-70]
09/04 23:56:30: after 68:72:51:70:ac:ed [-43] [-58]

@mathisono
Copy link

@mathisono
Copy link

Im writing this after submitting the most recent data in the post above and now reading what @ae6xe said.

@ae6xe I will report back on your edit to the code.

@aanon4
Copy link
Contributor

aanon4 commented Sep 9, 2022

Noticed a deaf node today. In the rssi.log I noticed it stopped logging a week or so ago and only restarted after a wifi scan (no daemons were restarted). Usually this thing is very chatty so this seems notable.

@mathisono
Copy link

That's consistent with sky reported. a Wi-Fi scan will recover the node from the state of deaf.

@aanon4
Copy link
Contributor

aanon4 commented Sep 10, 2022

Well I now have a node which goes deaf at least once a day - possibly damaged in the hot weather as it was solid before. Should make testing easier at least.

@AJ6GZ
Copy link
Author

AJ6GZ commented Sep 10, 2022

Updated settings to 3.0. Another waiting game...

@aanon4
Copy link
Contributor

aanon4 commented Sep 10, 2022

Looking over the lua code (and the original perl code) I notice that if a station suddenly disappears from the station list, the stats for it will stop being updated, and this station can never trigger a wifi scan. This seems problematic. However, I'm not sure how you differentiate this happening from a station legitimately going away (because it was unpowered). Maybe that doesn't matter? Thoughts @ae6xe ?

@aanon4
Copy link
Contributor

aanon4 commented Sep 10, 2022

Could we take a simpler approach to this problem? For example, could we just run the wifi scan command every 5 minutes if there is nothing associated with the wifi? I assume nodes don't go a bit deaf? Looking at various logs people have added here, the nodes appear to loose all their associations, not just some.

@aanon4
Copy link
Contributor

aanon4 commented Sep 10, 2022

I put a draft pull request up with a relatively simple implementation of the above.
#503

@ae6xe
Copy link
Contributor

ae6xe commented Sep 10, 2022

The scenario where a node goes deaf, and why this Ambient Noise Immunity (ANI) feature was created in the linux wireless ath9k driver, is in a home wifi environment, the primary target user base for linux wireless. Think of a linux desktop with this ath9k driver connecting to a home AP. The AP is relatively strong, and the linux desktop is hearing all the other clients trying to connect to the AP too. So this ANI feature starts to raise the noise floor and drown out all these other clients. The linux desktop is then only hearing the strong AP signal and the work load to deal with other clients (that aren't directly being communicated with) goes away. What remains is the 1 AP with still a good signal to communicate with.

In the AREDN scenario, we are seeing this scenario play out, where there is one or more 1 strong neighbors, and one or more weaker neighbors. Some of these weaker neighbors start to drop out, but at least 1 neighbor always remains to communicate with.

In regards to the difference between a neighbor being powered off, verses it is a weak signaling coming in and out. Once the receive is reset and the noise floor is back to ~-95dBm, then all the weak signals, if they can be received, are detected again and communications is reestablished that the environment allows. Starting the statistics over shouldn't be a problem. the statistics are a rolling weighted average, so the few most current signal strength measurements have high weighting. It's more the issue to avoid being trigger happy. Setting this threshold of 3.5, 3.0, 2.5, ... will cause more triggers to reset the receive more often. The lower the value, the more triggers occur. Many nodes will never have any triggers.

We're trying to avoid a race condition with other node functionality. Mesh status or wifi scan could come back with no data. Just not sure how much this is a concern or not, and the the complexity of the current approach means we're taking a conservative approach that it may be a concern.

@aanon4
Copy link
Contributor

aanon4 commented Sep 10, 2022

I'm not saying the above isn't an issue, but speaking for myself, when a node goes deaf, it doesnt retain any connections to other nodes - it hears nothing. I've looked at the various supportool data dumps people have added here and it looks similar for them, but it would be good to get some confirmations @mathisono @AJ6GZ @bwarden

@ae6xe
Copy link
Contributor

ae6xe commented Sep 10, 2022

If all the neighbors drop out, then possibly something else is going on. Could there be another signal at the 802.11n level, that has a different SSID on the same channel and channel width?

The feature in the driver that causes a node to go deaf to some neighbors would be dead-on-arrival if the node became deaf to all signals. The original purpose and intent was to improve throughput (primary use case on a client connecting to an AP). Here is one of the few and better write ups of this ANI Feature:

https://wiki.freebsd.org/dev/ath_hal%284%29/AutomaticNoiseImmunity

ANI is tuning various receiver settings to maximize receiving data on a signal. For it to tune that signal out altogether, would be the opposite of what it is tracking to do.

@aanon4
Copy link
Contributor

aanon4 commented Sep 10, 2022

Here's an alternate approach which just adds a scan on the transition from hearing something to hearing nothing - #505

@AJ6GZ
Copy link
Author

AJ6GZ commented Sep 10, 2022

The node I connect to (my dish to its sector) about 5 miles away always drops all 3 clients when it goes deaf. All are about the same distance/quality from the sector. Rebooting my side or holding it off for some time doesn't clear the remote side.

3 clients LQM on the remote node:
ai6bx-6-nb-m5-pv RF 23/22 4.5 miles 97% active
aj6gz-1-pbm5-west RF 33/36 4.4 miles 100% active
km6ajp-ns-m5-xw RF 22/22 5.5 miles 96% active

@aanon4
Copy link
Contributor

aanon4 commented Sep 10, 2022

Maybe #505 will get us where we need to be to fix this.

@mathisono
Copy link

Im waiting .... on the network now!

It was nice to read about "Known issues" with ani!!!

I was certain that I can trigger the node to go deaf with a "high traffic load" over a link. Im wondering how ANI is reacting possibly disproportional to the error rate readings? ANI reacts (i dont completely know how ) but its clearly tracking the acceptable error. As said before ANI decides to "raise the noise floor and drown out all these other clients." based on link error... Its counter intuitive that the proportions error/data would need drastically different acceptable limits for Low / Mid / High data rates. High data rates need a specific Signal to Noise values to support a High data throughput rate. if I'm right in my assumption ANI would counter that need. Loosing more data at a higher data rate is acceptable, compared to a week signal that need to search with adjusted AGC setting to reduce the error rate. Then Its resendable to raise the noise floor to drown out background noise?

@AJ6GZ
Copy link
Author

AJ6GZ commented Sep 20, 2022

Deaf report: after 5 days on Build 1724. Support attached. Running support download restored receiver as expected.

supportdata-AJ6GZ-1-SXT2-202209200443.gz

@aanon4
Copy link
Contributor

aanon4 commented Sep 20, 2022

Thanks for this. So the new code correctly detected the deaf node ... but then didnt manage to fix this issue :-( So will look at that part again.

@aanon4
Copy link
Contributor

aanon4 commented Sep 20, 2022

#508

@aanon4
Copy link
Contributor

aanon4 commented Sep 20, 2022

Trying a more generic reset like what the supportool dump does.

@bwarden
Copy link
Contributor

bwarden commented Sep 26, 2022

FWIW, I just had another incident on my node, and I successfully resolved it with just a passive scan: iw wlan0 scan passive, no frequency specified.
supportdata-N7BTW-QTH-A-202209261341.gz

@mathisono
Copy link

I need to test the fix on a wider group of nodes. Im going to use the nighty build to test the "fix" that's been committed.

@aanon4
Copy link
Contributor

aanon4 commented Oct 18, 2022

If anyone is still seeing this issue in the currently night, please reopen this.

@aanon4 aanon4 closed this as completed Oct 18, 2022
@mathisono
Copy link

1819-d581b99 loaded. I will monitor and reopen this if the problem arises.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants