These shell scripts for OpenWRT assist with diagnosing unexpected CPU resets/reboots on the Qualcomm™ IPQ806x platform.
In particular, they have been used to recreate a crash on the ZyXEL NBG6817 router featuring the IPQ8065 network processor. More details on the real-world workload below.
The computer helper script retries until a crash is detected, automatically organizes log files, etc. Though not required, it may be more convenient.
# Download
wget https://raw.githubusercontent.com/digitalcircuit/openwrt-ipq806x-qa-cpu-reset/main/debug-cpufreq-router.sh
wget https://raw.githubusercontent.com/digitalcircuit/openwrt-ipq806x-qa-cpu-reset/main/debug-cpufreq-ssh-loop.sh
# Mark launcher script as executable
chmod u+x debug-cpufreq-ssh-loop.sh
When running this QA script, the router will likely hard reboot without warning, as if unplugged from power supply. Save any changes on the router, finish up ongoing Internet transfers, voice chats, etc.
You can continue to use the router like normal during the test, just be prepared for a hard reboot, i.e. don't try to start a video conference with the CEO of Qualcomm™ :P
./debug-cpufreq-ssh-loop.sh "default" "case1" "openwrt"
Connects as user root
on SSH port 22
to the OpenWRT router at hostname openwrt
, then runs the QA test with default
max CPU frequency (1.75
GHz) while emulating the first set of crash conditions, case1
.
NOTE: It may take 8+ hours to trigger the crash!
./debug-cpufreq-ssh-loop.sh "default" "case1" "openwrt-router" "2222" "KDE Connect Pixel 4 XL"
Connects as user root
on SSH port 2222
to hostname openwrt-router
, then runs the QA test with default
max CPU frequency (1.75
GHz) while emulating the first set of crash conditions, case1
.
Also notifies the KDE Connect device KDE Connect Pixel 4 XL
of test results if kdeconnect-cli
is available and the device is paired and connected.
Note that if the router manages the local network, the KDE Connect device might not receive the message before the network is lost.
NOTE: It may take 8+ hours to trigger the crash!
./debug-cpufreq-ssh-loop.sh "1.4ghz" "case1" "openwrt"
Connects as user root
on SSH port 22
to the OpenWRT router at hostname openwrt
, then runs the QA test with 1.4ghz
max CPU frequency (1.4
GHz) while emulating the first set of crash conditions, case1
.
Update 2021-8-24: The crash may still happen, just less often. See CPU Frequencies below for more details.
./debug-cpufreq-ssh-loop.sh "1ghz" "case2" "openwrt"
Connects as user root
on SSH port 22
to the OpenWRT router at hostname openwrt
, then runs the QA test with 1ghz
max CPU frequency (1.0
GHz) while emulating the second set of crash conditions, case2
.
./debug-cpufreq-ssh-loop.sh "pin_default" "fake_load" "openwrt"
Connects as user root
on SSH port 22
to the OpenWRT router at hostname openwrt
, then runs the QA test with locking/pinning the CPU frequency to 1.75
GHz while repeatedly starting/stopping a fake load (yes >/dev/null
).
Update 2021-9-28: This recreates the CPU crash even with the CPU frequencies pinned!
This emulates the Déjà Dup bursty single-core CPU workload without needing to set up SFTP, Déjà Dup, etc.
# Download
wget https://raw.githubusercontent.com/digitalcircuit/openwrt-ipq806x-qa-cpu-reset/main/debug-cpufreq-router.sh
# Mark script as executable
chmod u+x debug-cpufreq-router.sh
See above for warnings; in brief, the router will likely hard reboot without warning, as if unplugged from power supply.
# Set router to default CPU frequency settings
./debug-cpufreq-router.sh "default"
# Run test
./debug-cpufreq-router.sh "test_cycle_freqs" "random" "case1"
Sets the router to default
max CPU frequency, then runs the QA test emulating the first set of crash conditions, case1
, across both CPUs by randomly selecting CPU 0
and CPU 1
for each change.
NOTE: It may take 8+ hours to trigger the crash!
CPU frequency mode | Outcome |
---|---|
default |
Sets max CPU frequency to 1.75 GHz (default) |
1.4ghz |
Sets max CPU frequency to 1.4 GHz (temporary workaround for issue) |
1ghz |
Sets max CPU frequency to 1.0 GHz (IPQ8064 limit for 1.0 GHz L2 cache) |
pin_default |
Locks CPU frequency to 1.75 GHz (performance governor) |
unchanged |
No change, uses current per-CPU […]/policy*/scaling_max_freq as upper limit |
All options other than unchanged
adjusts scaling_max_freq
for all CPUs, e.g. /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
.
Though this test is aimed at the IPQ8065
platform, the DTS hardware file modifies the IPQ8064
base definition (with a 1.4
GHz max CPU clock), hence trying 1ghz
as a CPU frequency selection.
NOTE: Setting a CPU frequency ceiling of 1.4
GHz is only a temporary workaround to use the router for workloads that cause crashes. It is not a permanent solution due to reducing performance.
Update 2021-8-24: At 1.4ghz
(1.4
GHz), the crash may still happen, just less often. In one case, instead of rebooting after around 5 minutes to 2 hours (as with 1.75
GHz), it rebooted at 9 hours 19 minutes. Any mitigation efforts will probably need to account for the 1.4
GHz speed as well.
Initial results suggest focusing on CPU frequency transitions that are near the L2 cache speed shift (1.0
and 1.4
→ 1.75
GHz).
Test mode | Outcome |
---|---|
random |
Randomizes CPU frequency between scaling_min_freq and scaling_max_freq |
case1 |
Cycles CPU frequency between maximum (scaling_max_freq ) and 800 MHz |
case2 |
Cycles CPU frequency between maximum (scaling_max_freq ) and 600 MHz (greater jump) |
ramp1 |
Smoothly ramps CPU frequency between scaling_min_freq and scaling_max_freq |
fake_load |
Runs a single core load at random duty cycle for 0 to 4 seconds |
CPU index | Outcome |
---|---|
all |
Frequency of all CPUs are changed at once |
random |
CPU 0 and 1 are randomly selected for each upcoming change |
<number> |
CPU <number> (0 or 1 ) is adjusted, other CPU remains at maximum (scaling_max_freq ) |
So far, limiting both CPUs maximum clock frequency to 1.0 GHz seems to stop all crashes.
Though not required, to simplify this, you can use the cpu-crash-workaround.sh
script.
NOTE: This will reduce performance! It's only a workaround, not a fix. I created this to help ensure my two NBG6817 routers are stable in between testing (one is at a remote location).
# Download
#
# (You might need to transfer the file to your router in a different way)
wget https://raw.githubusercontent.com/digitalcircuit/openwrt-ipq806x-qa-cpu-reset/main/cpu-crash-workaround.sh
# Mark script as executable
chmod u+x cpu-crash-workaround.sh
# Install
./cpu-crash-workaround.sh install
This automatically persists across sysupgrades by adding itself to the backup list, including backing up the fact that it's enabled.
The service will not modify the CPU max clock frequency if the CPU governor is not set to ondemand
.
If the script is no longer available (e.g. rebooted), you'll need to re-download it as per the install instructions.
# Remove
./cpu-crash-workaround.sh remove
Occasionally, this QA script results in date
itself segfaulting when getting current time since the Unix epoch in seconds. If connected through Mosh, sometimes Mosh will segfault instead.
When running the real workload (Déjà Dup SFTP backup), OpenSSH instead will often exit unexpectedly, presumably from segfaulting as well.
It doesn't seem to be an issue with the programs; instead, something about the CPU frequency shifting seems to rarely result in corruption of some running programs. This might be worse than a hard reboot since it's theoretically possible to silently corrupt persistent data.
Semi-reliable, "real" reproducer for this issue:
- Déjà Dup on a Linux computer
- Set destination to SFTP on OpenWRT router
- OpenWRT router has OpenSSH installed, bound to second port
- OpenSSH used so it can be locked down via
chroot
to secondary user, SFTP only - 1 TB USB 3.0 HDD plugged into OpenWRT serving as Network Attached Storage
- OpenSSH used so it can be locked down via
Déjà Dup uses duplicity
to back up to remote destinations (including SFTP) in 25 MB chunks, e.g. duplicity-full.20210821T213219Z.vol1036.difftar.gpg (25.1 MiB)
, before finishing with a larger single package of signatures, e.g. duplicity-full-signatures.20210821T213219Z.sigtar.gpg (1.6 GiB)
. In between uploads, Déjà Dup compresses and encrypts the files locally.
This results in a "bursty" workload involving 1-4 seconds of uploading to the router (local network to USB 3.0 HDD), then roughly 0.25-1.5 seconds of compressing & encrypting, during which no load is placed on the router.
When watching the router's CPU frequency, it tends to jump between 800 MHz and 1.75 GHz, switching between CPUs - notably, this is primarily a single CPU core workload, so stress-testing by loading both CPUs might not trigger the issue, even in a cyclic (load, pause, repeat) fashion.
A full Déjà Dup backup takes about 8 hours and stores around 200 GiB to the USB HDD.
For more details, see the Linux kernel documentation on dynamic debugging.
Enabling dynamic debugging to print CPU frequency changes, following logs
echo "file drivers/regulator/* =p" > /sys/kernel/debug/dynamic_debug/control
echo "file drivers/cpufreq/* =p" > /sys/kernel/debug/dynamic_debug/control
logread -f
Undoing the above, disabling dynamic debugging of CPU frequency changes
echo "file drivers/regulator/* =_" > /sys/kernel/debug/dynamic_debug/control
echo "file drivers/cpufreq/* =_" > /sys/kernel/debug/dynamic_debug/control
- Isolating the USB 3.0 HDD via a fully powered USB 3.0 hub
- USB current meter verifies at most
0.001
amps drawn from router port - Crash still happens
- USB current meter verifies at most
- Switching the USB 3.0 HDD to USB 3.0 SSD via USB 3.0 hub
- Crash still happens
- Connecting USB 3.0 SSD via USB 2.0 port, bypassing USB SuperSpeed driver
- Crash still happens
- Replacing the OEM 3.5 amp 12V DC power supply with 5 amp 12V DC power supply (12.3-ish V no load)
- Crash still happens
- Measuring USB 3.0 HDD load (around 600 mA peak), recreating via digital USB test load
- Leaving load on, setting load up to 900 mA, rapidly toggling load, etc
- No crash even under full load
- Running
iperf3
via Gigabit Ethernet alongsideopenssl benchmark
- No crash
stress-ng
in various permutations, including L2 cache testsnice -n 5 stress-ng --oomable -t 8h --times --cache 1 --cache-level 2
i=0 ; i_max=7200 ; while [ $i -lt $i_max ] ; do let i++; echo "[router] $(date -R): Iteration $i of $i_max" ; nice -n 5 stress-ng --oomable -t 3s --cache 1 --cache-level 2 || exit 1 ; sleep 1 ; done
- Above tests don't crash, other
stress-ng
tests result in near-instant crashes - Unable to determine if finding new bugs or recreating issue from real workload
- Python 3 usage of GNOME GIO library to SFTP upload a 25 MiB chunk of
/dev/urandom
- 25 MiB chunk has been created once, maybe needs to be unique each time?
- Not thoroughly tested, but does not seem to recreate crash
Not sure yet!
Ansuel had some initial suggestions on GitHub over here.
Result | Mitigation | Outcome |
---|---|---|
? unknown | Add transition frequencies (e.g. 1.75 → 1.4 → 1.0 GHz) |
Help needed to verify |
? unknown | Force both cores to same frequency (always, or at 1.4 & 1.75 GHz) |
Help needed to verify |
X fail | Increase clock latency (all, or just 1.4 & 1.75 GHz) |
No noticeable impact |
! pending | Pin L2 cache frequency to maximum | Not yet tested |
- Bug reports
- Mailing list entries
- GitHub conversations
Loosely in order of appearance: slh
, plntyk2
, zorun
, mangix
, PaulFertser
, enyc
, and Tusker
on the OpenWRT IRC channel at OFTC/#openwrt-devel
.
Ansuel
on the OpenWRT mailing list and OpenWRT GitHub repository.
And everyone else who offered advice, encouragement, and humor!