Improve sunxi fix-a64-timejump.patch #2824
Conversation
Current patch still throws BUG: WARNING: CPU: 2 PID: 31 at drivers/clocksource/arm_arch_timer.c:364 sun50i_a64_read_cntpct_el0+0x2c/0x38 Cause: Timer values are required to be exactly the same, while it may increase in between reads. Seems to happen especially during boot Timer values are like: x2 : 0000000015014561 - x0 : 0000000015014567 Solution: Normalize timer values and return normalized value.
|
So nice analysis, but why is this not happening on the mailing list, to get some more input and to fix this properly? I did some experiments with some bare metal code (running on a single core in EL2, after U-Boot), those are my observations:
|
|
@Uglymotha Could you please try the mainline version with one small change: diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index d0177824c518..f4881764bf8f 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -352,7 +352,7 @@ static u64 notrace arm64_858921_read_cntvct_el0(void)
do { \
_val = read_sysreg(reg); \
_retries--; \
- } while (((_val + 1) & GENMASK(9, 0)) <= 1 && _retries); \
+ } while (((_val + 1) & GENMASK(8, 0)) <= 1 && _retries); \
\
WARN_ON_ONCE(!_retries); \
_val; \This will have better performance than any workaround involving multiple counter reads. Some background: when developing the mainline workaround, I never experienced jumps with fewer than 11 bad bits. I received reports from other users about U-Boot MMC read failures that were resolved by adjusting the check to 10 bits, so I made the same change in the Linux workaround. More recently, another user provided proof of 9-bit errors, and I was able to reproduce those on a PinePhone with improved tools. So far there has been no evidence of 8-bit jumps, so the change above should be enough. |
|
@Uglymotha Very important research! If possible, please open a discussion here or on the forum. I'll join the testing. I ran into this problem on older kernels about two years ago and somehow got around it. I only remember that I redid the frequency table of the A64 processor in multiples of 24 MHz. And please issue the patches as (git format-patch) with a description of the problem and who the author of the fix is. Thanks. |
|
@The-going Can we PLEASE have this discussion not here or in some Armbian forums, but on the public mailing list? |
|
@apritzel Well, let's discuss where your suggestions are. |
On my pine64 with mainline kernel I get mentioned rcu_stall messages. Symptoms are stalled processes, that will mostly eventually continue, but in rare cases hang the entire machine, requiring reboot. |
I already tried that during my troubleshooting, which took a few months to go through. It will still cause BUG message to be logged on boot, because timer values differ 6 (or sometimes 9), which can not be detected by the bitmask. It can also not detect the inexplicably large dumps observed. It's the main reason I went down the way of creating a patch which can mathematically be proven to be correct and have as little impact as possible. |
Note. Small changes on upstream solution: GENMASK(9, 0 ... -> GENMASK(8, 0 ...
* Improve sunxi fix-a64-timejump.patch Current patch still throws BUG: WARNING: CPU: 2 PID: 31 at drivers/clocksource/arm_arch_timer.c:364 sun50i_a64_read_cntpct_el0+0x2c/0x38 Cause: Timer values are required to be exactly the same, while it may increase in between reads. Seems to happen especially during boot Timer values are like: x2 : 0000000015014561 - x0 : 0000000015014567 Solution: Normalize timer values and return normalized value. * Add patch for 5.12.y Note. Small changes on upstream solution: GENMASK(9, 0 ... -> GENMASK(8, 0 ... Co-authored-by: Igor Pecovnik <igor.pecovnik@gmail.com>
Current patch still throws BUG:
WARNING: CPU: 2 PID: 31 at drivers/clocksource/arm_arch_timer.c:364 sun50i_a64_read_cntpct_el0+0x2c/0x38
Short story:
Cause: Timer values are required to be exactly the same, while it may increase in between reads. Seems to happen especially during boot
Timer values are like: x2 : 0000000015014561 - x0 : 0000000015014567
Solution: Normalize timer values and return normalized value.
Full story (long read):
Linux mainline patch still suffers from (pretty serious) rcu stalls:
34107.703635] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[34107.703667] rcu: 2-...!: (17 ticks this GP) idle=c8a/1/0x4000000000000000 softirq=25484/25485 fqs=2431
[34107.703676] (detected by 1, t=238767 jiffies, g=60137, q=4756)
[34107.703681] Task dump for CPU 2:
[34107.703687] task:ksoftirqd/2 state:R running task stack: 0 pid: 21 ppid: 2 flags:0x00000028
[34107.703703] Call trace:
[34107.703725] __switch_to+0xf4/0x148
[34107.703733] 0xd100c1b192ba8400
[34107.703745] rcu: rcu_sched kthread starved for 233905 jiffies! g60137 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
[34107.703750] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[34107.703753] rcu: RCU grace-period kthread stack dump:
[34107.703757] task:rcu_sched state:I stack: 0 pid: 11 ppid: 2 flags:0x00000028
[34107.703765] Call trace:
[34107.703773] __switch_to+0xf4/0x148
[34107.703782] __schedule+0x214/0x588
[34107.703788] schedule+0x48/0xd8
[34107.703796] schedule_timeout+0x1a0/0x2d8
[34107.703807] rcu_gp_kthread+0x418/0xbb8
[34107.703816] kthread+0x120/0x128
[34107.703823] ret_from_fork+0x10/0x30
Armbian supplied patch still logs BUG message at boot (mostly cosmetical), note the hrtimer: interrupt took message in there, clear indication timing is still not fully correct.
[ 6.433674] ------------[ cut here ]------------
[ 6.436322] WARNING: CPU: 2 PID: 31 at drivers/clocksource/arm_arch_timer.c:364 sun50i_a64_read_cntpct_el0+0x2c/0x38
[ 6.438397] Modules linked in: cpufreq_dt realtek pinctrl_axp209 dwmac_sun8i i2c_mv64xxx mdio_mux uas
[ 6.441577] CPU: 2 PID: 31 Comm: kworker/2:1 Not tainted 5.10.29-sunxi64 #trunk
[ 6.443701] Hardware name: Pine64+ (DT)
[ 6.446152] Workqueue: events dbs_work_handler
[ 6.448432] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[ 6.450658] pc : sun50i_a64_read_cntpct_el0+0x2c/0x38
[ 6.452858] lr : arch_counter_get_cntpct_stable+0x3c/0x80
[ 6.454903] sp : ffff800011663a30
[ 6.456983] x29: ffff800011663a30 x28: ffff00000506fd00
[ 6.459266] x27: ffff0000034fa600 x26: ffff000077b6f098
[ 6.461580] x25: 0000000000000000
[ 6.463624] hrtimer: interrupt took 24086000 ns
[ 6.465769] x24: ffff0000099c6028
[ 6.467972] x23: ffff800011663b30 x22: 0000000000000001
[ 6.470292] x21: 0000000000000001 x20: 0000000000000018
[ 6.472575] x19: ffff800011474000 x18: 0000000000000001
[ 6.474894] x17: 0000000000000018 x16: 0000000000000002
[ 6.477133] x15: 0000000000000001 x14: 0000000000000001
[ 6.479420] x13: 0000000000000005 x12: 0000000000000021
[ 6.481815] x11: 0000000000000005 x10: 00000000bcd3d800
[ 6.484087] x9 : 0000000000000004 x8 : 0000000000000004
[ 6.486339] x7 : ffff800011431180 x6 : 00000000112a8800
[ 6.488597] x5 : 0000000000000010 x4 : 0000000000000012
[ 6.490850] x3 : 0000000000000050 x2 : 0000000015014561
[ 6.493186] x1 : 0000000000000000 x0 : 0000000015014567
[ 6.495544] Call trace:
[ 6.497825] sun50i_a64_read_cntpct_el0+0x2c/0x38
[ 6.500060] __delay+0x24/0xa8
[ 6.502260] __udelay+0x2c/0x38
[ 6.504441] ccu_mux_notifier_cb+0x64/0xa0
[ 6.506654] srcu_notifier_call_chain+0x70/0xc0
[ 6.508789] __clk_notify+0x8c/0xc8
[ 6.510943] clk_propagate_rate_change+0xc4/0xe0
[ 6.513175] clk_core_set_rate_nolock+0x118/0x1f0
[ 6.515430] clk_set_rate+0x38/0xa8
[ 6.517704] dev_pm_opp_set_rate+0x258/0x5c8
[ 6.519905] set_target+0x30/0x40 [cpufreq_dt]
[ 6.522058] __cpufreq_driver_target+0x2b0/0x698
[ 6.524203] od_dbs_update+0x140/0x1a0
[ 6.526382] dbs_work_handler+0x40/0x78
[ 6.528501] process_one_work+0x1f0/0x3c0
[ 6.530667] worker_thread+0x140/0x520
[ 6.532837] kthread+0x120/0x128
[ 6.535081] ret_from_fork+0x10/0x30
[ 6.537277] ---[ end trace 565df19941ff3b35 ]---
I decided to build some debugging straight into the __sun50i_a64_read_reg(reg) function to establish what the kernel is actually seeing when it reads the CP15 registers. First step is to check whether compiler is generating code that does the reads in two consecutive instructions, which it does:
77c: d53be023 mrs x3, cntpct_el0
780: d53be034 mrs x20, cntpct_el0
Going from there I build in some counters and do some outputting of counter values as read by the kernel. Of course the already known issue is seen (in this case 3 consecutive reads are done):
Feb 5 14:08:22 pine64 kernel: [ 27.782699] arch_timer: (cpu 2): old: 34f6ffff new: 34f6c1ff cur: 34f70000
However I established some more, seemingly inexplicable behavior, like (4 consecutive reads):
[ 3593.682792] arch_timer: (cpu 0): try 4 tmp: 1422498e6b old: 1422498e6b new: 1422499578 cur: 142249957e
Apparently here, in between two consecutive reads (old - new), the timer value increased by almost 2000 ticks, ie almost 100.000 cpu cycles seemingly past between two consecutive instructions. This is not easily explained by either memory latency, branch prediction, cpu caching, out of order execution (cortex A53 doesn't), or preemption (Non-preemptable kernel and this is scheduler anyways), because well, 100.000 cpu cycles. And I have seen (unfortunately not saved the evidence) jumps up to representing almost half a million cpu cycles.
Therefor I decided to take a more mathematical approach, as follows.
So, lets take a closer look at the BUG message generated by Armbian patch:
[ 6.490850] x3 : 0000000000000050 x2 : 0000000015014561
[ 6.493186] x1 : 0000000000000000 x0 : 0000000015014567
Here x0 and x2 registers hold the last read values, where new value in x0 is 6 higher than old value in x2. This is why Armbian patch fails, during my research I have seen the timer values normally be the same or increase by 1, 6, or 9 (and then some inexplicably large jumps and the known backward jumps). So normalization of timer value should be doable by a requiring new to be within 16 of old.
Implementing this with some counters monitoring the success / failure rate of this algorithm shows this to be around 1 in 5000 to 1 in 10000, higher load seem to lead to less failures.
[ 20.701225] arch_timer: (cpu 2): tmp: 2aca3879 old: 2aca3879 new: 2aca3879 cur: 2aca387a ok: 3223553 fail: 2164
[ 255.909185] arch_timer: (cpu 2): tmp: 17b41f1ad old: 17b41f1ad new: 17b41f1ad cur: 17b41f1ae ok: 4165633 fail: 2473
[ 261.316976] arch_timer: (cpu 2): tmp: 182fe560f old: 182fe560f new: 182fe560f cur: 182fe5610 ok: 4169729 fail: 2473
[ 264.764859] arch_timer: (cpu 1): tmp: 187ecfd12 old: 187ecfd12 new: 187ecfd12 cur: 187ecfd13 ok: 4173825 fail: 2473
[ 270.988662] arch_timer: (cpu 2): tmp: 190d43690 old: 190d43690 new: 190d43691 cur: 190d43692 ok: 4177921 fail: 2473
[ 276.159639] arch_timer: (cpu 3): tmp: 19839e16e old: 19839e16f new: 19839e16f cur: 19839e170 ok: 4182017 fail: 2473
[ 276.201489] arch_timer: (cpu 2): tmp: 198492c1a old: 198492c1a new: 198492c1a cur: 198492c1b ok: 4186113 fail: 2475
Issuing an synchronization barrier (isb) before reading the values (as is recommended by ARM platform documentation on doing consecutive reads) drastically reduces the failure rate:
Feb 20 08:03:26 pine64 kernel: [ 13.929052] arch_timer: (cpu 3): tmp: 211a959a old: 211a959a new: 211a959a cur: 211a959f ok: 209001 fail: 113
Feb 20 08:03:26 pine64 kernel: [ 14.290863] arch_timer: (cpu 3): tmp: 219f15e7 old: 219f15e7 new: 219f15e7 cur: 219f15e9 ok: 20b001 fail: 114
[94229.670320] arch_timer: (cpu 3): tmp: 20e99cd9e75 old: 20e99cd9e75 new: 20e99cd9e75 cur: 20e99cd9e76 ok: 4a86001 fail: 925
ISB upon every timer read is not preferred, since this may be many 1000s of times a second on a busy system and it will severely impact performance. From what I have seen an isb may take anywhere between a few, to almost a 1000 cpu cycles.
A little note should be made, that the counters were not thread safe, and are off by a few percent, but this should still be accurate enough.
So the proposed normalization algorithm is:
If we go with the low end of failure rate (1 in 5000), step 4 will be reached 1 / 125000000 timer reads. That should be irregularly enough to warrant the isb in step 4.
Step 6 (after isb low end failure rate of 1/20000) should be reached once in 5e19 reads. Even with 1000000000 timer reads/s that should not be more than once every 1600 years or so. Allowing for an error message every time this happens.
This chance of course should be increased with the chance that 2 consecutive reads yield an accepted value while it is actually undefined. This chance is a little bit harder to calculate, but the chance of this occurring is about 16 / 2e56 (56 bit counter). This may of course happen in any of the 5 iterations, so around 1 - (1 - (16 / 2e56)^5) = around 1.1e-15 = around once every 9e14 reads, ie with a billion reads per second maybe once every other month or so.
This is mitigated by the mathematically provable fact that (assuming the correctness of the second assumption of the timer continuously increasing) even if two consecutive reads are undefined that if they lie within the acceptable bounds, they must still be correct. This is than also the largest caveat of this patch, if said assumption is not correct, this patch will also lead to error messages seen by users of the platform, but in this case at least every occurence can be logged.
The proposed patch should therefore provide a better handling of the hardware inconsistency than both currently in use by mainline and armbian. Added bonus is that this algorithm will perform better than both, since it only requires of few iterations to establish a timer value, instead of banging the registers 200 times before bailing out. Also the retries counter decrease has been moved outside of the loop, this will save a few cpu cycles for every successful read,
So without further ado, the patch now running on my pine64, seeing no more BUG messages, no more stalls, no more time jumps, no more hrtimer: interrupt took, no more. System is now running silky smooth on an otherwise mainline 5.10 kernel.
The text was updated successfully, but these errors were encountered: