Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate ath9k and ath10k radio reset for deaf nodes #857

Merged
merged 3 commits into from
Jun 2, 2023

Conversation

aanon4
Copy link
Contributor

@aanon4 aanon4 commented Jun 1, 2023

There is a well known issue with the ath9k chip set that it will occasionally go deaf. The usual response to this problem is, when detected, to run a wifi scan which will reset the chip and resolve the problem. However, this takes quite a lot of time - 20+ seconds when scanning the 5GHz band - and during this time the radio is unresponsive to normal traffic, causing drop outs in anyone sending traffic over the node, and potentially causing the mesh to start to reconfigure itself as OLSR traffic is no longer sent.

While I've been unable to find a real solution to this problem (there are various proposals but none confirmed as working) it is generally agreed that a wifi chip reset is all that's required rather than the full wifi scan. Using the included patch, we can now do this from our monitoring daemons when problems are detected. In testing this proved as effective and much faster.

I've also moved the ath10k monitor to use a similar approach, although I've not seen a problem with this driver in recent nighty builds (a node I have which connects across town had previous had a deaf problem with the ath10k driver).

This PR includes the this patch - https://patchwork.kernel.org/project/linux-wireless/patch/20210914192515.9273-2-linus.luessing@c0d3.blue/ - which gives user level access to resetting the ath9k chip (there is already a way to do this on the ath10k).

Experimental test to see if we can reset a deaf radio by leaving
and rejoining the adhoc network rather than using scan. A scan, especially
if we have to do both active and passive, essentially mutes the radio to
AREDN traffic for 10-20 seconds, which isn't good. If the radio is completely
deaf then it doesnt matter, but particularly on the 9K radios we do this when
things are looking a bit dodgy, though not deaf. This approach is much, much,
quicker, but needs to be tested to see if it has the same un-deafing result.
@Orv
Copy link

Orv commented Jun 1, 2023

What monitoring daemon will be required?

@aanon4
Copy link
Contributor Author

aanon4 commented Jun 1, 2023

They already exist. This just changes how they do the reset.

@Orv
Copy link

Orv commented Jun 1, 2023

The nodes I currently see issues with are all ath10K radios. But I'll definitely test.

@ae6xe
Copy link
Contributor

ae6xe commented Jun 1, 2023

a more complicated fix, and should work. This used to be "/usr/sbin/iw $iface scan freq $freq passive"-- a passive scan on a single channel, possibly quicker and less risk of further symptoms than a chip reset. Wouldn't this be simple to get the job done?

@VA2XJM
Copy link
Contributor

VA2XJM commented Jun 1, 2023 via email

@aanon4
Copy link
Contributor Author

aanon4 commented Jun 1, 2023

Not any less risky to do this rather than a single frequency scan as the scan does the same chip reset - but this doesnt do anything else so wont tie the radio up unnecessarily.

@ae6xe
Copy link
Contributor

ae6xe commented Jun 2, 2023

Got it, the reset being the lowest impactful change or risk of symptoms is the best approach. I champion moving this PR forward.

Capturing some history here. I had come to the conclusion that this behavior would not be fixed and perpetuated in ath9k/10k/... The logic being that it is a desirable behavior for a client of an AP, to become deaf to other neighbors (by raising the noise floor artificially), such that the remaining signal of the AP is less SNR, but no longer contending with these neighbor signals to decode the bits. In this scenario, the client is only attempting to communicate with the AP<->internet. for adhoc or IBSS networks, where we expect to communicate (OSLR) to all neighbors, this is problematic -- to zero in on a strong neighbor.

Note, another possible option is to turn off this "noise immunity" functionality. The limited testing I did at the time, showed using noise immunity improved AREDN long distance usage, and issues went away. Interestingly, see DD-WRT notes -- which combined with AREDN experience shows results all over the board :) :

"its generally recommended to leave this disabled, only enable if you are an advanced user, are diagnosing various wireless issues, or it fixes a specific issue you were having. Especially if you have multiple Qualcomm Atheros routers connected to each other in any way, its highly recommended to have noise immunity enabled, or disabled on all routers, but not mixed. There has been some reports over the years that disabling noise immunity has helped stabilize the WLAN in terms of throughput &/or reducing dropouts, disabling noise immunity could also result in great or unchanged close range performance, but horrible or no throughput whatsoever, at medium ~ far range, so experiment with this setting. There is also some cases where enabling noise immunity gives abnormally low TX/RX rates & throughput, or noise immunity is simply too aggressive even in low noise, in this case, disable the feature."

Copy link
Contributor

@ae6xe ae6xe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good -- see other comments.

@ae6xe ae6xe merged commit a61dfcd into aredn:main Jun 2, 2023
@mathisono
Copy link

Im most interested in what different factors lead to a node going deaf. Tim can any properties be tracked with prometheus telemetry to identify the scenarios that cause this? I haven't had the problem lately, but im always willing to break things!!!

@aanon4
Copy link
Contributor Author

aanon4 commented Jun 2, 2023

The prometheus information contains lots of information which might be relevant, such as neighbor SNR and such like .. but was is specifically significant in causing the deafness I dont know (though others might).

Also, I would hope you've not seen the problem recently as there have been mitigations to deal with the problem for many years. I think I made them more effective but also more disruptive (not great) so I'm hoping this change will improve that. But it would be good to have this up on any node which had a history of problems so we can validate this some more.

@mathisono
Copy link

The ptp link that I was running into deafness problems with has not returned! Im working to get one end of it back up (node died).

Yes I will work to Update nodes to a FW capable of reporting Prometheus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants