-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ubnt EdgeRouterX switch dies or sthg (affects ramips-mt7621) #494
Comments
there is also a discussion that the previous listed ideas might not lead to a solution: http://lists.infradead.org/pipermail/lede-dev/2017-November/009799.html |
Has there been a check if original firmware behaves different? May be it's an hardware issue? |
another one in the OpenWrt-Mailinglist: http://lists.infradead.org/pipermail/lede-dev/2018-April/011939.html |
some recent OpenWrt-commits:
|
Is there a build that incorporates the patches? |
I installed OpenWrt SNAPSHOT, r7050-9c409cb on a erx-sfp that we had to restart a few times in the past. The snapshot should include the fix. So far I see strange load patterns (constant load of 1): http://monitor.berlin.freifunk.net/detail.php?p=load&t=load&h=flughafen-core&s=86400 And we had one exception in the kernel code so far:
I will report back if we have another crash with the new code. |
Still up and running but we get even more interesting output:
|
Problem persist even with the new openwrt version mentioned above. |
What a pity! |
01df4a2565, 0c285bd081, 2601e34fad might bring improvements for #494 b123921a92 include/prereq-build.mk: explicitly check for -f flag when using busybox time 36fa1bbf6f include/kernel-build.mk: fix kernel rebuild on backport patch changes 18533ff415 kernel: backport page fragment API changes from 4.10+ to 4.9 888a15ff83 ppp: add missing -fPIC to rp-pppoe.so CFLAGS 2601e34fad ramips: ethernet: disable fraglist support 154c0c4006 ubus: compile with LTO enabled 73fc67b614 procd: compile with LTO enabled 47b42137ce dropbear: compile with LTO enabled ef96d1e34a firewall: compile with LTO enabled ef16a394d2 iw: compile with LTO enabled e7397eef69 ppp: compile with LTO enabled dfbd49bd22 ppp: fix linker flags for the radius plugin 07940acc34 netifd: compile with LTO enabled 8c11133c9d busybox: compile with LTO enabled 4e56af5ab4 mt76: update to the latest version 16035a7dd3 include/feeds.mk: rework generation of opkg distfeeds.conf 6dac434c00 base-files: fix feed list in PKG_CONFIG_DEPENDS 9af22f1ac9 include/feeds.mk: always add available feeds to PACKAGE_SUBDIRS 6bdd5d8459 scripts/feeds: add src-dummy method 0c285bd081 ramips: ethernet: use own page_frag_cache 01df4a2565 ramips: ethernet: use skb_free_frag to free fragments 2eeb4b78c6 ramips: TP-Link TL-WR902AC v3: add missing wps button 33321ebefa ramips: TP-Link TL-WR902AC v3: don't build factory image a07e1126bc tools: kernel2minor: update to latest version 11d6547455 config: extend small_flash feature cf7154db07 kernel: only optimized for size if small_flash 621fa91a82 ar71xx: move boards to tiny subtarget 671999157d verbose.mk: quote SUBMAKE options 12915b105a arc: Update variables substitutions in u-boot env files d238c7f995 mediatek: Fix memory node for U7623 d3b8e6b2a7 kernel: gpio-nct5104d remove boardname check af70d86d62 netifd: update to latest git HEAD 33553a11ab ramips: clean up and fix MT7621 NAND driver issues 21ee8ce9b5 kernel: replace bridge port isolate hack with upstream patch backport on 4.14 68f9921ed8 netifd: update to the latest version 41a1c1af4b kernel: adjust bridge port isolate patch to match upstream attribute naming e07ad61aec procd: update to the latest version, fixes gcc 8 build error 8b42a260ed mac80211: Expose support for ath9k Dynack ba2b0f0ac6 kernel: bump 4.14 to 4.14.54 954faac7bc qos-scripts: fix indentation 4630159294 wireguard: bump to 0.0.20180708 7e82418372 iproute2: update to 4.17.0 6dac92a42e hostapd: build with LTO enabled (using jobserver for parallel build) 9b965d3b71 binutils: remove version 2.27 7c3e3eb098 binutils: update to version 2.30, resolves issues with LTO 55055aee50 binutils: backport an upstream fix for a linker bug that triggers with LTO 7ddba08d87 kernel: bcm47xxpart: fix getting user-space data partition name a5188eb258 nasm: disable LTO, remove host specific workarounds 98a6bee09a odhcpd: update to latest git HEAD e204717ef2 toolchain/nasm: force ar and ranlib only on macOSX 79b38047b9 build: README punctuation pendantry 5781fc6b3f build: Update README & github help edf338f248 basefiles: Reword sysupgrade message 6476148034 ath79: add support for OCEDO Raccoon da6c09eff4 kernel: move CONFIG_USB_MTU3 to generic config 29fa9ac559 kernel: disable some DRM_PANEL config options
just found this in the OpenWrt-devel list: http://lists.infradead.org/pipermail/openwrt-devel/2018-October/014272.html Probably someone can test? |
This might also fix this problem: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=fe7d965ea95e78905328fe5425c8e90e3bf11e58 |
Ever since I upgraded the firmware on the erx-sfp (coloniaallee) from some 1.0.0-alpha version to 1.0.2 the router has been online. Not once has the problem described above happened again. |
As @booo mentioned, he was able to see this several times with a more recent kernel than Hedy-1.0.2 is using. So I'm quite sure, this bug is still waiting to get triggered... |
830440d nodogsplash: Backport Version 4.0.1. (#494) a93e684 alfred: Merge bugfixes from 2019.3 6ea9e9b batctl: Upgrade hardif settings patches to upstream version d65d6f1 batctl: Merge bugfixes from 2019.3 9d559fd batman-adv: Merge bugfixes from 2019.3 784ae0e Merge pull request #496 from ecsv/batadv-for-19.07
830440d nodogsplash: Backport Version 4.0.1. (#494) a93e684 alfred: Merge bugfixes from 2019.3 6ea9e9b batctl: Upgrade hardif settings patches to upstream version d65d6f1 batctl: Merge bugfixes from 2019.3 9d559fd batman-adv: Merge bugfixes from 2019.3 784ae0e Merge pull request #496 from ecsv/batadv-for-19.07
there is a patch around being discussed: http://lists.infradead.org/pipermail/openwrt-devel/2019-March/016146.html, replied to on Oct 2019: http://lists.infradead.org/pipermail/openwrt-devel/2019-October/019627.html |
There is a nice report of finding "ethernet pause frames" as cause of the problem: http://lists.infradead.org/pipermail/openwrt-devel/2020-February/021742.html |
openwrt/openwrt@c8f8e59 sounds like a fix for this issue. Anyone can test? |
According to https://forum.openwrt.org/t/mtk-soc-eth-watchdog-timeout-after-r11573/50000/59 it didn't make a difference. But I am currently building with this patch. I don't have high hopes though. after 7577 seconds uptime
|
The system is still running, But at 32978 seconds, I have another kernel error
|
Just seen, that there are 2 sources of the kernel-error:
So are this probably two separate issues or really the same which cause different errors? |
I don't want to see any kernel dumps of any kind :) I'm leaving the router online until it crashes. Then I'll go back to the good old trusty WDR4900 with gonzo-rc2. I just hope I'm around when the router crashes and that the ca 70 people who use freifunk around here won't be cut-off from their youtube/facebook/ebay for too long. Here is a kernel log for another rb350gr3. It has a mix of
|
And here, an ERX-SFP (coloniaallee)
|
The test router had another kernel warning and reboot itself. Uptime 114195 seconds (just under 32 hours)
|
So also the MikroTik RG750Gr3 devices are affected? Even we don't see completely freeze of network here. |
All mt7621 devices are affected. |
Verklarung-core has almost no traffic. That's probably why it's not causing problems. Perleberger36 and coloniaallee have a lot of traffic, and if you look at the uptime, every time it reboots is because of a kernel crash and either the rooter reboots itself or the watchdog reboots it. The test was done at the scherer8, which also has a lot of traffic. 32hrs and it rebooted itself. |
in 179c140 there is a reference to openwrt-commit 498f1f4f5d, which reads that it might fix the cause of the problem. |
New Patch made it into the OpenWRT master branch openwrt/openwrt#2942 (comment) |
As usually these OpenWrt-commits have been added to the "daily/upstream-master" branch automatically in 1d4f5c9. So some tests need to be carried out. |
This seems to be fixed since OpenWRT 21.02 ? Should it be closed as fixed ? |
We could do it, if you want to. Anyway, this project is not maintained anymore since some years. We use the falter-firmware now: |
ubnt erx and +sfp have been seen in the wild, when suddenly the switch is dying which shows in loosing connections and / or interfaces.
We could investigate in that subject to find out what is causing it and try / help to solve the problem, since there will be soon a significant number of routers online (ff-Meko-project).
Some work is already going on in lede, what we could support. I attach in ascending date:
http://lists.infradead.org/pipermail/lede-dev/2017-July/008268.html | mt7621 wdt reset- console not accepting commands
http://lists.infradead.org/pipermail/lede-dev/2017-August/008594.html | Transmit timeouts with mtk_eth_soc and MT7621
http://lists.infradead.org/pipermail/lede-dev/2017-August/008738.html | ramips: Improve stability of the mt7621 switch
https://patchwork.ozlabs.org/patch/808121/ | ramips: Improve stability of the mt7621 switch
Can somebody shed light on this?
The text was updated successfully, but these errors were encountered: