Unstable ath9k WLAN #605

AKA-47 · 2015-12-22T14:20:21Z

Das Problem wird aktuell hier im Forum diskutiert:

https://forum.freifunk.net/t/wifi-mesh-probleme-im-2012-2-gluon/9813/14

Die Konstellation:

Mehrere 841v10 mit 2012.2 im WIFI Mesh
getestet mit selbst gebauter Firmware und experimental builds anderer Site (also nicht von mir)
getestet zusätzlich mit 1043v2
KEIN 802.11s!!
Immer nur einen Knoten mit VPN

Das Problem:

Sobald ich 2 Router mit 2012.2 im Mesh habe bricht mir das WIFI mesh soweit ein das kaum noch Pakete durchgehen.

Folgende Test wurden gemacht um das Problem einzugrenzen:

Eigene Firmware:
-2er Mesh mit 2012.1 auf 1043 und 2012.2 auf 841v10 --> OK
-2er Mesh mit 2012.2 auf 841v10 --> Problem tritt auf
-2er Mesh mit 2012.2 auf 1043er und 2012.2 auf 841v10 --> Problem tritt auf
-2er Mesh mit 2012.2 auf 1043er --> Problem tritt auf
-3er Mesh mit 2012.1 auf 1043 und 2012.2 auf ZWEI 841v10 --> Problem tritt auf

Andere Firmware:
-2er Mesh mit 2012.1 auf 1043 und 2012.2 auf 841v10 --> OK
-2er Mesh mit 2012.2 auf 841v10 --> Problem tritt auf
-2er Mesh mit 2012.2 auf 1043er und 2012.2 auf 841v10 --> Problem tritt auf
-2er Mesh mit 2012.2 auf 1043er --> Problem tritt auf
-3er Mesh mit 2012.1 auf 1043 und 2012.2 auf ZWEI 841v10 --> Problem tritt auf

Fazit: Identische Ergebnisse der Testfälle unabhängig der Hardware! Somit kann ich also den 841v10 als Fehlerquelle sowie auch meine gebackene Firmware ausschließen.
Ich habe sogar Testweise mal die WIFI Kanäle gewechselt was aber auch nichts gebracht hat

Geflasht wurde sicherheitshalber IMMER ein Factory Image mit TFTP Recovery um sicherzugehen das die Kisten komplett neu sind.

Evtl kann sich ein Entwickler daz äußern?

The text was updated successfully, but these errors were encountered:

neocturne · 2015-12-22T15:45:43Z

Ist noch ein direkter Ping über das Link-local auf dem ibss0-Interface möglich? Bitte sowohl Unicast (fe80::......%ibss0) als auch Multicast (ff02::1%ibss0) testen.
Wie sieht iw dev ibss0 station dump aus? Bitrate? Signalpegel?
Von batman-adv gemessene TQ zu den Nachbarknoten (batctl o)?

Uns sind bisher keine solchen Probleme aufgefallen.

AKA-47 · 2015-12-22T15:59:33Z

ping6 fe80::60e6:28ff:febe:18c4
PING fe80::60e6:28ff:febe:18c4 (fe80::60e6:28ff:febe:18c4): 56 data bytes
64 bytes from fe80::60e6:28ff:febe:18c4: seq=0 ttl=64 time=0.573 ms
64 bytes from fe80::60e6:28ff:febe:18c4: seq=1 ttl=64 time=0.419 ms
64 bytes from fe80::60e6:28ff:febe:18c4: seq=2 ttl=64 time=0.401 ms
64 bytes from fe80::60e6:28ff:febe:18c4: seq=3 ttl=64 time=0.480 ms
64 bytes from fe80::60e6:28ff:febe:18c4: seq=4 ttl=64 time=0.494 ms
64 bytes from fe80::60e6:28ff:febe:18c4: seq=5 ttl=64 time=0.394 ms
64 bytes from fe80::60e6:28ff:febe:18c4: seq=6 ttl=64 time=0.397 ms
64 bytes from fe80::60e6:28ff:febe:18c4: seq=7 ttl=64 time=0.401 ms
64 bytes from fe80::60e6:28ff:febe:18c4: seq=8 ttl=64 time=0.477 ms
^C
--- fe80::60e6:28ff:febe:18c4 ping statistics ---
9 packets transmitted, 9 packets received, 0% packet loss
round-trip min/avg/max = 0.394/0.448/0.573 ms

ping6 ff02::1
PING ff02::1 (ff02::1): 56 data bytes
64 bytes from fe80::62e3:27ff:febe:18c4: seq=0 ttl=64 time=1.045 ms
64 bytes from fe80::62e3:27ff:febe:18c4: seq=1 ttl=64 time=0.664 ms
64 bytes from fe80::62e3:27ff:febe:18c4: seq=2 ttl=64 time=0.643 ms
64 bytes from fe80::62e3:27ff:febe:18c4: seq=3 ttl=64 time=0.722 ms
64 bytes from fe80::62e3:27ff:febe:18c4: seq=4 ttl=64 time=0.581 ms
--> alles super

Node2:
iw dev ibss0 station dump
Station 62:e6:28:c6:d7:ee (on ibss0)
inactive time: 0 ms
rx bytes: 3504727
rx packets: 28048
tx bytes: 70611
tx packets: 795
tx retries: 104
tx failed: 0
signal: -38 [-39, -44] dBm
signal avg: -37 [-39, -43] dBm
tx bitrate: 130.0 MBit/s MCS 15
rx bitrate: 117.0 MBit/s MCS 14
expected throughput: 4.503Mbps
authorized: yes
authenticated: yes
preamble: long
WMM/WME: yes
MFP: no
TDLS peer: no

Node1:
Station 62:e6:28:be:18:c4 (on ibss0)
inactive time: 30 ms
rx bytes: 232907589
rx packets: 1909673
tx bytes: 4257570
tx packets: 25132
tx retries: 8100
tx failed: 47
signal: -37 [-41, -40] dBm
signal avg: -37 [-40, -40] dBm
tx bitrate: 65.0 MBit/s MCS 7
rx bitrate: 1.0 MBit/s
expected throughput: 3.194Mbps
authorized: yes
authenticated: yes
preamble: long
WMM/WME: yes
MFP: no
TDLS peer: no

batctl o | grep ibss
62:e6:28:be:18:c4 0.170s (201) 62:e6:28:be:18:c4 [ ibss0]: 62:e6:28:be:18:c4 (201)

Hatte ich alles schon durchgecheckt. und konnte das Problem sogar an verschiedenen Standorten reproduzieren... bin ein wenig ratlos

AKA-47 · 2015-12-22T16:12:18Z

Achso...und die andere Seite:
Node2:
batctl o | grep ibss | grep "62:e6:28:c6:d7:ee "
62:e6:28:c6:d7:ee 0.450s (209) 62:e6:28:c6:d7:ee [ ibss0]: 62:e6:28:c6:d7:ee (209)

neocturne · 2016-02-28T04:06:38Z

The Gluon master just got an updated mac80211 backport, which was reported to fix some ath9k issues.

AKA-47 · 2016-03-02T16:06:20Z

Sehr gut! Jetzt geht alles wie es soll! Vielen Dank

AKA-47 · 2016-03-11T11:26:03Z

Wann kommt dieser commit in die stable?

jplitza · 2016-03-11T12:12:49Z

Mit 2016.2. Für ein 2016.1.x-Release ist diese Änderung denke ich zu groß. Und 2016.2 braucht vermutlich noch eine Weile (siehe Milestone)

neocturne · 2016-03-24T21:48:04Z

Ich bin am Überlegen, nach v2016.1.3 einen größeren Backport zu machen und in 1-2 Wochen als v2016.1.4 zu releasen, um solche Probleme zu beheben.

A-Kasper · 2016-03-25T01:22:41Z

Am 24.03.2016 um 22:48 schrieb Matthias Schiffer:

Ich bin am Überlegen, nach v2016.1.3 einen größeren Backport zu machen
und in 1-2 Wochen als v2016.1.4 zu releasen, um solche Probleme zu beheben.

Ich fänd das sehr sehr gut und wichtig. Ich hatte sogar schon überlegt
auf die 2015er zurück zu gehen, weil die wireless geschichten sich so
krass verschlechtert haben. Das ist schon eine sehr wichtige Funktionalität.

neocturne · 2016-05-16T12:45:54Z

Ich habe einen neuen Branch v2016.1.x-mac80211-test hinzugefügt, der ein weiters Update der WLAN-Treiber enthält. Das ganze besiert auf v2016.1.x, damit ein unkompliziertes Wechseln zwischen Releases und dem Test-Branch möglich ist.

Es wäre sehr hilfreich, wenn die neue Version auf möglichst vielen Routern mit WLAN-Problemen getestet wird.

neocturne · 2016-05-20T12:02:49Z

v2016.1.x-mac80211-test hat gerade ein weiteres Update bekommen, bitte testen.

rotanid · 2016-05-21T14:43:06Z

update heute nacht auf ein paar Knoten gespielt, schon jetzt hat einer davon wieder den Bug :-(
geringe sample size, würde aber bisher eher sagen es ist schlechter als ohne den test-branch

oszilloskop · 2016-05-23T08:47:40Z

Habe die v2016.1.x-mac80211-test nun seit 4 Tagen im Einsatz. Das Problem tritt weiterhin auf.
Bei mir sieht es so aus, als ob sich die v2016.1.4 und die v2016.1.x-mac80211-test bezüglich der Problemhäufigkeit nichts geben.

neocturne · 2016-06-07T11:50:28Z

v2016.1.x-mac80211-test enthält jetzt einen neuen Patch, bitte testen.

neocturne · 2016-06-07T11:52:25Z

Ähm... der neue Patch kompiliert noch nicht, einen Moment...

neocturne · 2016-06-07T12:57:28Z

So, v2016.1.x-mac80211-test enthält jetzt 2 Commits über v2016.1.x hinaus: Ein weiteres Update von mac80211 und ein Patch von nbd.

Bitte erstmal nur das Update testen (also den zweitneusten Commit). Wenn das keinen Unterschied macht, dann einmal mit dem neusten.

neocturne · 2016-06-12T19:40:47Z

Gibt's irgendwas neues hier?

oszilloskop · 2016-06-12T19:44:41Z

Ja, bin gerade mit meinen Test bezüglich d31c1c9 durch.
Leider tritt das Problem weiterhin auf.

Werde heute noch 548cf1d auf einige Router aufspielen.

oszilloskop · 2016-06-14T12:04:57Z

Nach zwei Tagen 548cf1d Test sieht es erstmal deutlich besser aus als mit d31c1c9. Statt mehrmals täglichen Ausfallerscheinungen, hatte ich bisher insgesamt nur zwei Ausfälle. Ob die Zwei Ausfälle problembedingt sind, weiss ich noch nicht genau. Manchmal sind da schnelle Finger an den Ein-/Ausschaltern der Router, und das beeinflusst mein Abfangszenario-Skript.

Getestet habe ich auf meshenden CPE210 v1.0, WR841 v8, v9 und v10 mit starker Datenlast auf dem Wifi Mesh- und Clientnetz.

Werde weiter beobachten...

oszilloskop · 2016-06-14T18:28:34Z

Hm, wie es aussieht, so hat sich nur die Häufigkeit verringert.
Ich habe jetzt noch 2 Knoten, welche ein/das Problem haben.

Nennen wir einen Problem-Knoten mal KPUTT
KPUTT hat die MAC ea:97:f7:a1:bf:2c

Chekov ist ein Knoten, der Wifi-meshend den Knoten KPUTT in die Wolke einbindet.

Der Befehl iw dev ibss0 station dump auf Chekov ergibt die unten aufgeführte Ausgabe.

Auffällig ist:

Der Name von KPUTT kann nicht aufgelöst werden
Die TX Bitrate von KPUTT hängt auffällig vier mal bei 1.0 MBit/s
KPUTT (ea:97:f7:a1:bf:2c) taucht 5 mal auf)
Auf unserer MAP liegt die TX TQ von KPUTT zu Cjekov bei 2-4%
Auf unserer MAP ist die TX TQ von Chekov zu KPUTT nicht mehr aufgeführt

Ausgabe:

root@Chekov:~# iw dev ibss0 station dump
Station ea:97:f7:a1:bf:2c (on ibss0)
    inactive time:  6800 ms
    rx bytes:   782524109
    rx packets: 5289540
    tx bytes:   0
    tx packets: 0
    tx retries: 362968
    tx failed:  7023
    signal:     -53 [-55, -57] dBm
    signal avg: -53 [-55, -58] dBm
    tx bitrate: 1.0 MBit/s
    rx bitrate: 104.0 MBit/s MCS 13
    expected throughput:    43.303Mbps
    authorized: yes
    authenticated:  yes
    preamble:   long
    WMM/WME:    yes
    MFP:        no
    TDLS peer:  no
    connected time: 23890 seconds
Station ea:97:f7:a1:bf:2c (on ibss0)
    inactive time:  6800 ms
    rx bytes:   782486698
    rx packets: 5289265
    tx bytes:   0
    tx packets: 0
    tx retries: 362968
    tx failed:  7023
    signal:     -53 [-55, -57] dBm
    signal avg: -53 [-55, -58] dBm
    tx bitrate: 1.0 MBit/s
    rx bitrate: 104.0 MBit/s MCS 13
    expected throughput:    4.119Mbps
    authorized: yes
    authenticated:  yes
    preamble:   long
    WMM/WME:    no
    MFP:        no
    TDLS peer:  no
    connected time: 23889 seconds
Station ea:97:f7:a1:bf:2c (on ibss0)
    inactive time:  6800 ms
    rx bytes:   782487085
    rx packets: 5289266
    tx bytes:   0
    tx packets: 0
    tx retries: 362968
    tx failed:  7023
    signal:     -53 [-55, -57] dBm
    signal avg: -53 [-55, -58] dBm
    tx bitrate: 1.0 MBit/s
    rx bitrate: 104.0 MBit/s MCS 13
    expected throughput:    4.119Mbps
    authorized: yes
    authenticated:  yes
    preamble:   long
    WMM/WME:    no
    MFP:        no
    TDLS peer:  no
    connected time: 23889 seconds
Station ea:97:f7:a1:bf:2c (on ibss0)
    inactive time:  6800 ms
    rx bytes:   782487085
    rx packets: 5289266
    tx bytes:   0
    tx packets: 0
    tx retries: 362968
    tx failed:  7023
    signal:     -53 [-55, -57] dBm
    signal avg: -53 [-55, -58] dBm
    tx bitrate: 1.0 MBit/s
    rx bitrate: 104.0 MBit/s MCS 13
    expected throughput:    4.119Mbps
    authorized: yes
    authenticated:  yes
    preamble:   long
    WMM/WME:    no
    MFP:        no
    TDLS peer:  no
    connected time: 23889 seconds
Station ea:97:f7:a1:bf:2c (on ibss0)
    inactive time:  30 ms
    rx bytes:   804896639
    rx packets: 5668484
    tx bytes:   2285864334
    tx packets: 2674955
    tx retries: 362968
    tx failed:  7023
    signal:     -54 [-56, -58] dBm
    signal avg: -53 [-55, -57] dBm
    tx bitrate: 117.0 MBit/s MCS 14
    rx bitrate: 104.0 MBit/s MCS 13
    expected throughput:    43.303Mbps
    authorized: yes
    authenticated:  yes
    preamble:   long
    WMM/WME:    yes
    MFP:        no
    TDLS peer:  no
    connected time: 23889 seconds

EDIT:
Ausgabe von batctl o | grep ea:97:f7:a1:bf:2c auf Chekov

ea:97:f7:a1:bf:2c    4.780s   (  4) ea:97:f7:a1:bf:2c [     ibss0]: ea:97:f7:a1:bf:2c (  4)

oszilloskop · 2016-06-14T18:43:14Z

Nochmal ich.
Nachdem ich auf Chekov! den Befehl wifi abgesetzt habe, sieht die Ausgabe von
root@Chekov:~# iw dev ibss0 station dump wie folgt aus:

Station ea:97:f7:a1:bf:2c (on ibss0)
    inactive time:  40 ms
    rx bytes:   157287
    rx packets: 2259
    tx bytes:   345672
    tx packets: 4342
    tx retries: 1050
    tx failed:  54
    signal:     -54 [-55, -58] dBm
    signal avg: -53 [-55, -58] dBm
    tx bitrate: 104.0 MBit/s MCS 13
    rx bitrate: 104.0 MBit/s MCS 13
    expected throughput:    40.283Mbps
    authorized: yes
    authenticated:  yes
    preamble:   long
    WMM/WME:    yes
    MFP:        no
    TDLS peer:  no
    connected time: 143 seconds

batctl o | grep ea:97:f7:a1:bf:2c ergibt folgendes:
ea:97:f7:a1:bf:2c 2.330s ( 7) ea:97:f7:a1:bf:2c [ ibss0]: ea:97:f7:a1:bf:2c ( 7)

Das ganze nochmal (wifi gefolgt von iw) ein zweites mal

root@Chekov:~# iw dev ibss0 station dump
Station ea:97:f7:a1:bf:2c (on ibss0)
    inactive time:  200 ms
    rx bytes:   10066
    rx packets: 145
    tx bytes:   0
    tx packets: 0
    tx retries: 0
    tx failed:  0
    signal:     -56 [-58, -61] dBm
    signal avg: -55 [-57, -60] dBm
    tx bitrate: 1.0 MBit/s
    rx bitrate: 1.0 MBit/s
    authorized: yes
    authenticated:  yes
    preamble:   long
    WMM/WME:    yes
    MFP:        no
    TDLS peer:  no
    connected time: 18 seconds

und batctl o
ea:97:f7:a1:bf:2c 5.410s ( 12) ea:97:f7:a1:bf:2c [ ibss0]: ea:97:f7:a1:bf:2c ( 12)

KPUTT ist weiterhin nicht zu erreichen.

Hm, ist das jetzt das Problem von einem Router, oder des Zusammenspiel von beiden/mehreren Routern?

neocturne · 2016-06-14T19:57:54Z

Ok, eine vollständige Lösung aller Probleme wäre auch zu schön gewesen. 2 weitere Tests wären hilfreich:

In 548cf1d gibt es die Datei openwrt/package/kernel/mac80211/patches/990-test.patch; diese enthält 2 unabhängige Patch-Chunks. Diese beiden Chunks sollten einmal unabhängig voneinander getestet werden, um herauszufinden, welcher sich auf das Problem auswirkt. Nach Verändern der Patch-Datei einfach einmal make package/mac80211/clean ausführen, damit es neu gebaut wird.
Ich hatte die Ursache des Problems bisher auf eine Version zwischen compat-wireless 2015-03-09 und 2015-07-21 eingegrenzt, da nach dem Update auf 2015-07-21 vermehrt Bug Reports kamen. Was sind deine beiden Problemknoten für Hardware? Wenn die Hardware schon von 6e79982 unterstützt wurde, wäre auch damit ein Vergleich hilfreich, um herauszufinden, ob das Problem auch unter 2015-03-09 auftritt.

neocturne · 2016-10-11T01:42:02Z

Ich habe ein mac80211-Downgrade im Gluon-Master committed (487922a). Bitte testen, wenn es läuft, kommt es nach v2016.2.x.

kpanic23 · 2016-10-14T14:54:44Z

Also ich habe eine Firmware mit diesem Stand auf ein paar Knoten ausgerollt. Einer der Mesh-Knoten hat sich gestern verabschiedet, war heute aber merkwürdigerweise wieder online, muss sich wohl irgendwie von selbst neu gestartet haben. Uptime jetzt 18 Stunden, muss also gestern gegen 23.00 Uhr passiert sein.
Heute hat sich ein anderer Knoten weggehängt, ist aber vorhin wundersamerweise wieder aufgetaucht.
Ich habe mich gleich draufverbunden und das Log gezogen. Merkwürdigerweise steht im betreffenden Zeitraum absolut gar nichts im logread-Log. dmesg zeigt:
[18422.850000] ath: phy0: Timeout while waiting for nf to load: AR_PHY_AGC_CONTROL=0xd0dda

Logread: http://pastebin.com/pNyuKxzc

Kartenlink: https://map.freifunk-3laendereck.net/#!v:m;n:14cc2070948a

Antaiir · 2016-10-15T13:17:44Z

Wir haben einige Knoten mit der aktuellen Version versehen. Bei einer "PicoStation M2HP" (nur Mesh) taucht seit kurzem das Problem auf, daß sie sich dauerhaft aus dem WLAN ausklinkt, sobald man an der Konsole "wifi" abgesetzt hat. Die Pico ist dann nur noch mittels Reset zu reanimieren. Auf anderen Knoten ist mir bisher noch nichts in dieser Richtung aufgefallen.

VG
Antaiir

ghost · 2016-11-06T15:19:37Z

Bei uns ist seit dem Update von a54e765 auf 1c3d978 alles zurück bei alter Stabilität. In den letzten 2 Tagen hat kein Router durchgebootet und es wird mindestens das doppelte an Traffic geschoben (Testsubjekte sind CPE210 mit 802.11s).

CodeFetch · 2016-11-07T03:37:56Z

Könnte das vielleicht ein Kalibrierungsfehler sein? Im master von Linux sind einige Patches, die die Peak-Detect-Kalibrierung der betroffenen Chips auf „softwaretechnisch gelöst” umstellen (der hier und ganz viele andere):
torvalds/linux@7da1ddd#diff-722db64892e310f0729be9025924ba52

Ich teste schon seit ein paar Tagen alle möglichen Patches, die noch nicht bei LEDE oder OpenWrt eingepflegt sind, aber ich dachte NeoRaider wäre da schon dran... ~~Übermorgen werde ich mal berichten, wenn sich hier nichts anderes ergibt.~~
EDIT: Dauert noch etwas. Muss jetzt erst mal 2016.2.1 testen 👍 ...

Ähnliche Fehler gab es nämlich bei dem WR940N v3, aber ich bin mir nicht sicher, ob die Ursache die gleiche ist. Es könnte ja sein, dass mit der Einführung der zusätzlichen Energiesparfunktionen auf einmal die Hardwarekalibrierung weiterer Chips nicht mehr funktioniert (diese Funktionen schaffen übrigens Airtime und sie abzuschalten ist nicht im Freifunksinne und es hat bei den betroffenen Communities das Problem auch nicht gelöst).

Wieso ist hier überhaupt alles auf Deutsch?

viisauksena · 2016-11-13T17:11:51Z

as mentioned in the other Ticket ( #993 and maybe #889 )
at least with release v2016.2.x there is a light tendency for ibss0 (or whole wifi?) to fail/hickup .. being unresponsive
this is also mentionend in https://forum.freifunk.net/t/announce-gluon-v2016-2-1 and https://forum.freifunk.net/t/wifihaenger-in-gluon-2016-2-x-quickfix/13821
a working solution seem to be wifi (think @rotanid mentioned this ) or done successfull by myself iwinfo phy0 scan

here some lines from closed other tickets.. sorry for mixing denglish together

@rotanid du wolltest konkretes feedback,
Firmware mit dem master build zu dem zeitpunkt wo der 2016.2.1 rauskam, inkl make clean und make update im vorfeld. benutzt wird ibss0 und bat-v14 - aufgefallen an 2 unterschiedlichen Tplink 841 - die autoupdate hinter sich haben.

symptom : wegbrechen der mesh-links - simply dead ... ifconfig mit normaler ausgabe, nichts im logread, nichts im dmesg, aufgefallen weil der mit uplink (!) in einem dichten Meshnetz ohne Mesh-Kontakte war.

Lösung - live beobachtet über die status page (die war bei ibss0 leer) - "iwinfo phy0 scan" .. danach sofort neu gefundene meshpartner in der status page. und logisch im meshviewer nach der entsprechenden Verzögerung.
gelöst wurde das temporär mit einem Stündlichen iwinfo phy0 scan als micron job
daher der Bugname "hickup".

ich weis das @Adorfer schon früher von sowas berichtet hatte, konnte das aber nie so genau beobachten. bei uns (Freiburg-support branch) kann das aber aufgrund von sicherungsscripten NUR bei uplinkroutern passieren, alle anderen verlieren ja durch wegbrechendes Mesh ihren uplink und würden final sogar rebooten (nach fastd, network, restart versuchen) >> https://github.com/viisauksena/gluon-fffr/blob/master/files/lib/gluon/fffr/emergency.sh (das löst nicht das mesh link problem, aber durch restart/reboot eben indirekt schon)

viisauksena · 2016-11-13T18:50:52Z

weis nicht ob das zusammenhängt, aber eine "seltsame" beobachtung kurz hier notiert:
an einem meshenden laufenden router ohne symptome
iwinfo phy0 scan ;dmesg |tail -n 2; logread | tail -n2
erzeugt VOR dem output von iwinfo noch

[11062.610000] IPv6: ADDRCONF(NETDEV_UP): tmp.phy0: link is not ready
Sun Nov 13 19:45:50 2016 kern.info kernel: [10964.770000] IPv6: ADDRCONF(NETDEV_UP): tmp.phy0: link is not ready

aber das vielleicht einfach nur ein relikt weil phy0 kein direktes if ist, tauscht man das mit ibss0 client0 oder so - gibts keinen fehler

rotanid · 2016-11-13T20:12:15Z

hier noch 2 Fehlerberichte aus dem Freifunk Forum kopiert, damit wir alles an einem Platz haben.
Meine Fragen waren:

erfolgte der Build mit einem kompletten clean?
Mit welcher Version lief es zuletzt wirklich besser?
ibss oder 11s?
Was steht in "logread" dazu? Was in "dmesg"?
Was ist der Inhalt von /sys/kernel/debug/ieee80211/phy0/ath9k/reset

von @Tarnatos

Build dir von 2016.1.5 mit make clean GLUON_TARGET= gesäubert
Gluon 16.5 (spätere habe ich nicht installiert)
Das Problem trat mit IBSS auf einem WDR4900 und mit 11s auf 2 CPE210v1 auf
http://pastebin.com/iL93gBDf
http://pastebin.com/ttArb9YP

Bei mir zeigte sich der Fehler auch nicht in schlechter WiFi Performance, sondern in geringen Bandbreiten <3MBit/s nach einem iw $dev scan kamen wieder "normale" Bandbreiten zustande.

von DL3DCF

ibss (1 ibss Link und 3 Clients als es auftrat)
logread: kein Eintragungen, es verschwinden nach und nach die Clients mit "deauthenticated due to inactivity (timer DEAUTH/REMOVE)"
dmesg: keine Eintragungen
Habe keinen Reset durchgeführt, sondern "iw client0 scan".

cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset
Baseband Hang: 0
Baseband Watchdog: 11
Fatal HW Error: 0
TX HW error: 0
Transmit timeout: 0
TX Path Hang: 2
PLL RX Hang: 0
MAC Hang: 0
Stuck Beacon: 112
MCI Reset: 0
Calibration error: 0
Tx DMA stop error: 116
Rx DMA stop error: 0

von mir selbst:

Baseband Hang:  0
Baseband Watchdog: 330
   Fatal HW Error:  0
      TX HW error:  0
 Transmit timeout:  0
     TX Path Hang:  0
      PLL RX Hang:  0
         MAC Hang:  0
     Stuck Beacon:  1
        MCI Reset:  0
Calibration error:  0
Tx DMA stop error:  0
Rx DMA stop error:  0

viisauksena · 2016-11-14T01:29:24Z

was bringt das cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset .. hab hier router die haben da einiges dramatischere Werte und fahren seit Tagen problemlos inkl meshenden Routern und Traffic im GB bereich

edit: wer sind denn "die" entwickler .. ath9k?
(mal von einem sauber laufenden router - paar tage mitmesh und gb traffic und immer wieder clients)
ath9k_debug.txt

rotanid · 2016-11-14T16:49:14Z

wurde von den Entwicklern damals angefordert, die werden schon wissen warum.

@viisauksena "die Entwickler" ist bei ath9k primär nbd / Felix von LEDE, siehe das oben von AK-47 verlinkte LEDE issue 176

Tarnatos · 2016-11-18T15:13:15Z

Wir haben hier einen weiteren bestätigten Fall:
http://mesh.freifunk.in-kiel.de/#!v:m;n:10feede661f4

Gluon 16.5
IBSS auf einem WR1043v2

Nach einem iw client0 scan kamen alle Nachbarn wieder.

Logs:
wifi-down-gluon-605.txt

rotanid · 2016-11-20T16:56:05Z

upstream LEDE issue 176 has been closed as fixed, maybe we (well, @neoraider ...) should try another backport? but people would need to participate in debugging that this time before doing the release :)

viisauksena · 2016-11-20T21:46:50Z

than an announcement with an proposed commit-id in the forum would be fine - or some extra branch/tag

ghost · 2016-11-20T21:58:20Z

A commit id posted here and it will be on 20 nodes tomorrow

CodeFetch · 2016-11-21T12:17:33Z

Are you all going mad?! Can you please stop using this as a forum?
Nothing has been fixed - the issue was closed, because people were acting like you being no help.
It is clearly a DMA lock error. No matter how many logs you post with the same commands executed it will be the same phenomenon. Tx DMA stop error, Stuck beacon... Google it if you don't know what it is.
Unless the LEDE kernel source isn't updated to a very recent version, chances are bad, that it contains the needed fix. And it isn't said that the bug didn't exist before. Maybe it just occurs more frequently since e.g. powersave update. There are many patches in kernel master which address issues similar to this. Go and test the patches or wait until someone else did, but this "forum culture" here is just distracting...

viisauksena · 2016-11-26T12:16:20Z

@CodeFetch i couldnt follow your argument so far, even with some sysctl specialy aranged that the Device wont reboot on panic to get some more debug output. Some may have this problem, i had different ones - i guess. But you may explain deeper why this "it is clearly a DMA lock error" and which are the recent kernel features needed - it seems LEDE so far will have a 4.4.32 (i build yesterday) kernel. With many ath9k improvements. Where latest stable is 4.8.11 by kernel.org. Do you think that is recent enough?

rotanid · 2016-11-26T19:08:55Z

@viisauksena i already explained to you, that #753 and therefore your "reboot on panic" or "reboot on oom" doesn't have to do anything with this issue, the ath9k driver instabilities. would you please stop mixing those two issues finally? thanks.
@CodeFetch either you know the relevant patches, then please name them. perhaps you are testing the patches, then please report about the results or even provide instructions or binaries so others could participate in testing - or if you don't know or test, i don't see what your comment is about other than "distracting".

CodeFetch · 2016-11-29T01:33:20Z

It's the last I'm going to say on that issue here...
@viisauksena This issue occured often in kernel history and you can find lots of information on it. It can have many reasons but all are related to the device driver (in this case ath9k). You can identify this as a DMA lockup because of the stuck beacons, the Tx DMA stop error or basebandwatchdog interrupts. Tx DMA stop error means that it was tried to remove packets from the queue and it failed. Stuck beacons means that beacons in the queue were removed as they took too long to be sent (no airtime and I mean NO airtime - e.g. radar) or there was a lockup in the DMA queue (I mean the hardware thing in the wifichip) and they just can't be sent anymore. I don't really know the Basebandwatchdog code, but it can also be triggered if there's a DMA lockup and tries to reset the device if there is a failure in general. So if you have either Tx DMA stop errors or stuck beacons or basebandwatchdog issues it is a good hint that the "DMA locked" at a specific state. No, that's not recent enough I think. There are many, many patches for newer chipsets that were released in the last months which are simply being overlooked by LEDE and OpenWrt. They are quite relevant because the board design of newer devices change the chips' behaviour (e.g. temperature compensation patch works well on one device and makes the problem worse on another - both having identical chipsets but another board revision). The Qualcomm-devs release patches for their customers (e.g. TP-Link) if they find such a weird behaviour in one of their boards. And that's one of the problems. On the one hand they don't test if an "outdated" board will still work with their patch, on the other private folks do. It's hard to properly support every board. So you really have to be careful what patches you add.

@rotanid I'm testing, but I also have other things to do and I luckily don't need people to test it as I have a router where the error occurs reliably (one mesh-partner, dozens of clients). From the information given by me you could have found out which patches are relevant and I also named one: torvalds/linux@7da1ddd#diff-722db64892e310f0729be9025924ba52. Every calibration-related or queue-related ath9k patch could be the bad or the good guy.

rotanid · 2016-11-29T03:02:24Z

thanks for your constructive answer. good to know there's someone with sufficient knowledge and a good test device&situation. most of us lack the first and some the latter. that may be why the related LEDE-issue was closed (if it isn't fixed in latest LEDE? didn't test because of missing reliable test location.)
i hope your findings will lead to improvements upstream, even if you're not going to comment again here.

viisauksena · 2016-12-01T08:32:28Z

i figured out one specific router in our net , which has failing ibss0 regulary,. Luckily this has own fastd uplink. This is very obvious by looking to the map, because the router should have plenty of ibss mesh links - which it doesnt.
While the router has own uplink, the normal "emergency-script" in fffr wouldnt work - since this only target loss of network and not crippled mesh network. (like restarting wifi, iwinfo phy scan, restarting fastd, restarting network etc).
This router have a working hourly micrond job iwinfo phy0 scan - which i thought would prevent this loosing of mesh links - which it does not! A manual check batctl o -H|grep -v mesh-vpn support this - instead wifi command do resolve this issue. This is clearly not the the hacky way to go - a hourly wifi. (extremly hacky micron.d if [ $(uci get wireless.ibss_radio0.disabled) -eq 0 ]; then if [ $(batctl o -H|grep -v mesh-vpn|wc -l) -eq 0 ]; then iwinfo ibss0 scan|grep -q $(uci get wireless.ibss_radio0.bssid) && wifi ;fi;fi)

so its "nice" to have a reliable failing ibss0 router ... but i really dont know how to go on here - give hints, or help in further debugging. If i can do anything i am happy to do so. At least i will leave this router with this FW for a while.

cat /sys/kernel/debug/ieee80211/phy0/ath9k/reset is all 0 (no DMA errors or stuck beacons as suggested by @CodeFetch - uptime 11h right now)

FW is build around 23.Nov. with this commit
http://openfreiburg.de/freifunk/meshviewer/#!v:m;n:60e3275a2aec
http://openfreiburg.de/freifunk/meshi2/#!v:m;n:60e3275ffdd2

rotanid · 2017-02-12T20:03:58Z

as an update: we still have issues with ath9k devices running the gluon 2016.2.x branch - maybe it is fixed in the lede-based master, but we didn't have time to adapt everything to the big changes in the master branch, yet.

neocturne · 2017-02-23T18:41:36Z

Stability should have improved with the switch to LEDE, but it's still not perfect.

New upstream issue: https://bugs.lede-project.org/index.php?do=details&task_id=447

neocturne · 2017-05-09T14:03:11Z

There haven't been any reports of ath9k issues here or in the LEDE bug tracker for a while, so I think we can finally close this.

AKA-47 changed the title ~~WifiStack Problem in 2012.2?~~ WifiStack Problem in 2015.2? Dec 22, 2015

AKA-47 closed this as completed Mar 2, 2016

neocturne reopened this May 16, 2016

neocturne changed the title ~~WifiStack Problem in 2015.2?~~ Unstable ath9k WLAN May 16, 2016

neocturne added the 0. type: bug This is a bug label May 16, 2016

neocturne added this to the 2016.2 milestone May 16, 2016

This was referenced May 16, 2016

ath9k: "ath: phy0: Unable to reset channel, reset status -5" #750

Closed

ATH9K Problems in ubiquiti devices #602

Closed

viisauksena mentioned this issue Nov 11, 2016

ibss0 hickup in v2016.2++ #933

Closed

rotanid mentioned this issue Nov 13, 2016

ibss0 dies sometimes - ath: phy0: Timeout while waiting for nf to load #889

Closed

neocturne added the 9. meta: upstream issue Issue pertains to an upstream project label Feb 23, 2017

neocturne modified the milestones: 2017.1, 2016.2 Feb 24, 2017

rotanid mentioned this issue Mar 24, 2017

strange non-deterministic networking on some nodes #1079

Closed

neocturne closed this as completed May 9, 2017

Unstable ath9k WLAN #605

Unstable ath9k WLAN #605

Comments

AKA-47 commented Dec 22, 2015

neocturne commented Dec 22, 2015

AKA-47 commented Dec 22, 2015

AKA-47 commented Dec 22, 2015

neocturne commented Feb 28, 2016

AKA-47 commented Mar 2, 2016

AKA-47 commented Mar 11, 2016

jplitza commented Mar 11, 2016

neocturne commented Mar 24, 2016

A-Kasper commented Mar 25, 2016

neocturne commented May 16, 2016

neocturne commented May 20, 2016

rotanid commented May 21, 2016

oszilloskop commented May 23, 2016 • edited

neocturne commented Jun 7, 2016

neocturne commented Jun 7, 2016

neocturne commented Jun 7, 2016

neocturne commented Jun 12, 2016

oszilloskop commented Jun 12, 2016

oszilloskop commented Jun 14, 2016 • edited

oszilloskop commented Jun 14, 2016 • edited

oszilloskop commented Jun 14, 2016 • edited

neocturne commented Jun 14, 2016

neocturne commented Oct 11, 2016

kpanic23 commented Oct 14, 2016 • edited

Antaiir commented Oct 15, 2016

ghost commented Nov 6, 2016 • edited by ghost

CodeFetch commented Nov 7, 2016 • edited

viisauksena commented Nov 13, 2016

viisauksena commented Nov 13, 2016

rotanid commented Nov 13, 2016 • edited

viisauksena commented Nov 14, 2016 • edited

rotanid commented Nov 14, 2016 • edited

Tarnatos commented Nov 18, 2016

rotanid commented Nov 20, 2016

viisauksena commented Nov 20, 2016 • edited

ghost commented Nov 20, 2016

CodeFetch commented Nov 21, 2016

viisauksena commented Nov 26, 2016 • edited

rotanid commented Nov 26, 2016

CodeFetch commented Nov 29, 2016

rotanid commented Nov 29, 2016

viisauksena commented Dec 1, 2016 • edited

rotanid commented Feb 12, 2017

neocturne commented Feb 23, 2017

neocturne commented May 9, 2017

oszilloskop commented May 23, 2016 •

edited

oszilloskop commented Jun 14, 2016 •

edited

oszilloskop commented Jun 14, 2016 •

edited

oszilloskop commented Jun 14, 2016 •

edited

kpanic23 commented Oct 14, 2016 •

edited

ghost commented Nov 6, 2016 •

edited by ghost

CodeFetch commented Nov 7, 2016 •

edited

rotanid commented Nov 13, 2016 •

edited

viisauksena commented Nov 14, 2016 •

edited

rotanid commented Nov 14, 2016 •

edited

viisauksena commented Nov 20, 2016 •

edited

viisauksena commented Nov 26, 2016 •

edited

viisauksena commented Dec 1, 2016 •

edited