Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sunxi fix-a64-timejump.patch #2824

Merged
merged 3 commits into from May 22, 2021
Merged

Improve sunxi fix-a64-timejump.patch #2824

merged 3 commits into from May 22, 2021

Conversation

@Uglymotha
Copy link
Contributor

@Uglymotha Uglymotha commented May 13, 2021

Current patch still throws BUG:
WARNING: CPU: 2 PID: 31 at drivers/clocksource/arm_arch_timer.c:364 sun50i_a64_read_cntpct_el0+0x2c/0x38

Short story:
Cause: Timer values are required to be exactly the same, while it may increase in between reads. Seems to happen especially during boot
Timer values are like: x2 : 0000000015014561 - x0 : 0000000015014567

Solution: Normalize timer values and return normalized value.

Full story (long read):
Linux mainline patch still suffers from (pretty serious) rcu stalls:
34107.703635] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[34107.703667] rcu: 2-...!: (17 ticks this GP) idle=c8a/1/0x4000000000000000 softirq=25484/25485 fqs=2431
[34107.703676] (detected by 1, t=238767 jiffies, g=60137, q=4756)
[34107.703681] Task dump for CPU 2:
[34107.703687] task:ksoftirqd/2 state:R running task stack: 0 pid: 21 ppid: 2 flags:0x00000028
[34107.703703] Call trace:
[34107.703725] __switch_to+0xf4/0x148
[34107.703733] 0xd100c1b192ba8400
[34107.703745] rcu: rcu_sched kthread starved for 233905 jiffies! g60137 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
[34107.703750] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[34107.703753] rcu: RCU grace-period kthread stack dump:
[34107.703757] task:rcu_sched state:I stack: 0 pid: 11 ppid: 2 flags:0x00000028
[34107.703765] Call trace:
[34107.703773] __switch_to+0xf4/0x148
[34107.703782] __schedule+0x214/0x588
[34107.703788] schedule+0x48/0xd8
[34107.703796] schedule_timeout+0x1a0/0x2d8
[34107.703807] rcu_gp_kthread+0x418/0xbb8
[34107.703816] kthread+0x120/0x128
[34107.703823] ret_from_fork+0x10/0x30

Armbian supplied patch still logs BUG message at boot (mostly cosmetical), note the hrtimer: interrupt took message in there, clear indication timing is still not fully correct.
[ 6.433674] ------------[ cut here ]------------
[ 6.436322] WARNING: CPU: 2 PID: 31 at drivers/clocksource/arm_arch_timer.c:364 sun50i_a64_read_cntpct_el0+0x2c/0x38
[ 6.438397] Modules linked in: cpufreq_dt realtek pinctrl_axp209 dwmac_sun8i i2c_mv64xxx mdio_mux uas
[ 6.441577] CPU: 2 PID: 31 Comm: kworker/2:1 Not tainted 5.10.29-sunxi64 #trunk
[ 6.443701] Hardware name: Pine64+ (DT)
[ 6.446152] Workqueue: events dbs_work_handler
[ 6.448432] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[ 6.450658] pc : sun50i_a64_read_cntpct_el0+0x2c/0x38
[ 6.452858] lr : arch_counter_get_cntpct_stable+0x3c/0x80
[ 6.454903] sp : ffff800011663a30
[ 6.456983] x29: ffff800011663a30 x28: ffff00000506fd00
[ 6.459266] x27: ffff0000034fa600 x26: ffff000077b6f098
[ 6.461580] x25: 0000000000000000
[ 6.463624] hrtimer: interrupt took 24086000 ns
[ 6.465769] x24: ffff0000099c6028
[ 6.467972] x23: ffff800011663b30 x22: 0000000000000001
[ 6.470292] x21: 0000000000000001 x20: 0000000000000018
[ 6.472575] x19: ffff800011474000 x18: 0000000000000001
[ 6.474894] x17: 0000000000000018 x16: 0000000000000002
[ 6.477133] x15: 0000000000000001 x14: 0000000000000001
[ 6.479420] x13: 0000000000000005 x12: 0000000000000021
[ 6.481815] x11: 0000000000000005 x10: 00000000bcd3d800
[ 6.484087] x9 : 0000000000000004 x8 : 0000000000000004
[ 6.486339] x7 : ffff800011431180 x6 : 00000000112a8800
[ 6.488597] x5 : 0000000000000010 x4 : 0000000000000012
[ 6.490850] x3 : 0000000000000050 x2 : 0000000015014561
[ 6.493186] x1 : 0000000000000000 x0 : 0000000015014567
[ 6.495544] Call trace:
[ 6.497825] sun50i_a64_read_cntpct_el0+0x2c/0x38
[ 6.500060] __delay+0x24/0xa8
[ 6.502260] __udelay+0x2c/0x38
[ 6.504441] ccu_mux_notifier_cb+0x64/0xa0
[ 6.506654] srcu_notifier_call_chain+0x70/0xc0
[ 6.508789] __clk_notify+0x8c/0xc8
[ 6.510943] clk_propagate_rate_change+0xc4/0xe0
[ 6.513175] clk_core_set_rate_nolock+0x118/0x1f0
[ 6.515430] clk_set_rate+0x38/0xa8
[ 6.517704] dev_pm_opp_set_rate+0x258/0x5c8
[ 6.519905] set_target+0x30/0x40 [cpufreq_dt]
[ 6.522058] __cpufreq_driver_target+0x2b0/0x698
[ 6.524203] od_dbs_update+0x140/0x1a0
[ 6.526382] dbs_work_handler+0x40/0x78
[ 6.528501] process_one_work+0x1f0/0x3c0
[ 6.530667] worker_thread+0x140/0x520
[ 6.532837] kthread+0x120/0x128
[ 6.535081] ret_from_fork+0x10/0x30
[ 6.537277] ---[ end trace 565df19941ff3b35 ]---

I decided to build some debugging straight into the __sun50i_a64_read_reg(reg) function to establish what the kernel is actually seeing when it reads the CP15 registers. First step is to check whether compiler is generating code that does the reads in two consecutive instructions, which it does:
77c: d53be023 mrs x3, cntpct_el0
780: d53be034 mrs x20, cntpct_el0
Going from there I build in some counters and do some outputting of counter values as read by the kernel. Of course the already known issue is seen (in this case 3 consecutive reads are done):
Feb 5 14:08:22 pine64 kernel: [ 27.782699] arch_timer: (cpu 2): old: 34f6ffff new: 34f6c1ff cur: 34f70000

However I established some more, seemingly inexplicable behavior, like (4 consecutive reads):
[ 3593.682792] arch_timer: (cpu 0): try 4 tmp: 1422498e6b old: 1422498e6b new: 1422499578 cur: 142249957e
Apparently here, in between two consecutive reads (old - new), the timer value increased by almost 2000 ticks, ie almost 100.000 cpu cycles seemingly past between two consecutive instructions. This is not easily explained by either memory latency, branch prediction, cpu caching, out of order execution (cortex A53 doesn't), or preemption (Non-preemptable kernel and this is scheduler anyways), because well, 100.000 cpu cycles. And I have seen (unfortunately not saved the evidence) jumps up to representing almost half a million cpu cycles.

Therefor I decided to take a more mathematical approach, as follows.

  1. It shall be assumed that reading the timer may yield an undefined value. This immediately invalidates the mainline patch, as it relies on only one read (which may be undefined).
  2. It shall be assumed that (apart from already known backwards jump), the timer is a continuously incrementing timer .
  3. It shall thus be required to normalize the timer value and return the normalized value. IE will not accept inexplicable jumps. This invalidates Armbian supplied patch as it requires the values to be exactly the same.

So, lets take a closer look at the BUG message generated by Armbian patch:
[ 6.490850] x3 : 0000000000000050 x2 : 0000000015014561
[ 6.493186] x1 : 0000000000000000 x0 : 0000000015014567

Here x0 and x2 registers hold the last read values, where new value in x0 is 6 higher than old value in x2. This is why Armbian patch fails, during my research I have seen the timer values normally be the same or increase by 1, 6, or 9 (and then some inexplicably large jumps and the known backward jumps). So normalization of timer value should be doable by a requiring new to be within 16 of old.
Implementing this with some counters monitoring the success / failure rate of this algorithm shows this to be around 1 in 5000 to 1 in 10000, higher load seem to lead to less failures.
[ 20.701225] arch_timer: (cpu 2): tmp: 2aca3879 old: 2aca3879 new: 2aca3879 cur: 2aca387a ok: 3223553 fail: 2164
[ 255.909185] arch_timer: (cpu 2): tmp: 17b41f1ad old: 17b41f1ad new: 17b41f1ad cur: 17b41f1ae ok: 4165633 fail: 2473
[ 261.316976] arch_timer: (cpu 2): tmp: 182fe560f old: 182fe560f new: 182fe560f cur: 182fe5610 ok: 4169729 fail: 2473
[ 264.764859] arch_timer: (cpu 1): tmp: 187ecfd12 old: 187ecfd12 new: 187ecfd12 cur: 187ecfd13 ok: 4173825 fail: 2473
[ 270.988662] arch_timer: (cpu 2): tmp: 190d43690 old: 190d43690 new: 190d43691 cur: 190d43692 ok: 4177921 fail: 2473
[ 276.159639] arch_timer: (cpu 3): tmp: 19839e16e old: 19839e16f new: 19839e16f cur: 19839e170 ok: 4182017 fail: 2473
[ 276.201489] arch_timer: (cpu 2): tmp: 198492c1a old: 198492c1a new: 198492c1a cur: 198492c1b ok: 4186113 fail: 2475

Issuing an synchronization barrier (isb) before reading the values (as is recommended by ARM platform documentation on doing consecutive reads) drastically reduces the failure rate:
Feb 20 08:03:26 pine64 kernel: [ 13.929052] arch_timer: (cpu 3): tmp: 211a959a old: 211a959a new: 211a959a cur: 211a959f ok: 209001 fail: 113
Feb 20 08:03:26 pine64 kernel: [ 14.290863] arch_timer: (cpu 3): tmp: 219f15e7 old: 219f15e7 new: 219f15e7 cur: 219f15e9 ok: 20b001 fail: 114
[94229.670320] arch_timer: (cpu 3): tmp: 20e99cd9e75 old: 20e99cd9e75 new: 20e99cd9e75 cur: 20e99cd9e76 ok: 4a86001 fail: 925

ISB upon every timer read is not preferred, since this may be many 1000s of times a second on a busy system and it will severely impact performance. From what I have seen an isb may take anywhere between a few, to almost a 1000 cpu cycles.

A little note should be made, that the counters were not thread safe, and are off by a few percent, but this should still be accurate enough.

So the proposed normalization algorithm is:

  1. Do two consecutive reads of the timer.
  2. Accept the value if the second is within 0 to 16 larger then the first
  3. Repeat step 1 and 2 for 2 more tries.
  4. If still unable to establish normalized value issue isb. and repeat step 1 and 2.
  5. Repeat step 4.
  6. Bail out.

If we go with the low end of failure rate (1 in 5000), step 4 will be reached 1 / 125000000 timer reads. That should be irregularly enough to warrant the isb in step 4.
Step 6 (after isb low end failure rate of 1/20000) should be reached once in 5e19 reads. Even with 1000000000 timer reads/s that should not be more than once every 1600 years or so. Allowing for an error message every time this happens.
This chance of course should be increased with the chance that 2 consecutive reads yield an accepted value while it is actually undefined. This chance is a little bit harder to calculate, but the chance of this occurring is about 16 / 2e56 (56 bit counter). This may of course happen in any of the 5 iterations, so around 1 - (1 - (16 / 2e56)^5) = around 1.1e-15 = around once every 9e14 reads, ie with a billion reads per second maybe once every other month or so.
This is mitigated by the mathematically provable fact that (assuming the correctness of the second assumption of the timer continuously increasing) even if two consecutive reads are undefined that if they lie within the acceptable bounds, they must still be correct. This is than also the largest caveat of this patch, if said assumption is not correct, this patch will also lead to error messages seen by users of the platform, but in this case at least every occurence can be logged.

The proposed patch should therefore provide a better handling of the hardware inconsistency than both currently in use by mainline and armbian. Added bonus is that this algorithm will perform better than both, since it only requires of few iterations to establish a timer value, instead of banging the registers 200 times before bailing out. Also the retries counter decrease has been moved outside of the loop, this will save a few cpu cycles for every successful read,

So without further ado, the patch now running on my pine64, seeing no more BUG messages, no more stalls, no more time jumps, no more hrtimer: interrupt took, no more. System is now running silky smooth on an otherwise mainline 5.10 kernel.

Current patch still throws BUG:
WARNING: CPU: 2 PID: 31 at drivers/clocksource/arm_arch_timer.c:364 sun50i_a64_read_cntpct_el0+0x2c/0x38

Cause: Timer values are required to be exactly the same, while it may increase in between reads. Seems to happen especially during boot
Timer values are like: x2 : 0000000015014561 - x0 : 0000000015014567

Solution: Normalize timer values and return normalized value.
@apritzel
Copy link

@apritzel apritzel commented May 14, 2021

So nice analysis, but why is this not happening on the mailing list, to get some more input and to fix this properly? I did some experiments with some bare metal code (running on a single core in EL2, after U-Boot), those are my observations:

  • on some boots I see no issues at all
  • depending on the frequency two back-to-back counter reads give the same value in between 1/4th till 1/8th of the cases. This makes the Freescale workaround (the current Armbian hack) rather expensive (only one of multiple problems with this version)
  • at 408 MHz there are no jumps
  • at 24 MHz the counter reads are always 6 ticks apart
  • this altogether tells me that a counter mrs takes 6 cycles: at 1152 MHz there are 48 CPU cycles per timer tick, so a 1/8 chance of a 6-cycle period to be split up: and indeed around 12% of read pairs differ by 1.
  • in all my experiments the value pairs being not 0 or 1 apart (at >= 408 MHz) all had their lowest 11 bits either all cleared or all set - which is what the mainline workaround covers
    So can you describe what the problems with the current mainline workaround are? Apparently there are issues, but no one seemed to have reported this on the list? I will try to run some more tests overnight, and with more boards, but would be grateful if we get some hints at how to reproduce the mainline issue.

@smaeul
Copy link

@smaeul smaeul commented May 14, 2021

@Uglymotha Could you please try the mainline version with one small change:

diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index d0177824c518..f4881764bf8f 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -352,7 +352,7 @@ static u64 notrace arm64_858921_read_cntvct_el0(void)
        do {                                                            \
                _val = read_sysreg(reg);                                \
                _retries--;                                             \
-       } while (((_val + 1) & GENMASK(9, 0)) <= 1 && _retries);        \
+       } while (((_val + 1) & GENMASK(8, 0)) <= 1 && _retries);        \
                                                                        \
        WARN_ON_ONCE(!_retries);                                        \
        _val;                                                           \

This will have better performance than any workaround involving multiple counter reads.

Some background: when developing the mainline workaround, I never experienced jumps with fewer than 11 bad bits. I received reports from other users about U-Boot MMC read failures that were resolved by adjusting the check to 10 bits, so I made the same change in the Linux workaround. More recently, another user provided proof of 9-bit errors, and I was able to reproduce those on a PinePhone with improved tools. So far there has been no evidence of 8-bit jumps, so the change above should be enough.

@The-going
Copy link
Contributor

@The-going The-going commented May 15, 2021

@Uglymotha Very important research! If possible, please open a discussion here or on the forum. I'll join the testing. I ran into this problem on older kernels about two years ago and somehow got around it. I only remember that I redid the frequency table of the A64 processor in multiples of 24 MHz. And please issue the patches as (git format-patch) with a description of the problem and who the author of the fix is. Thanks.

@apritzel
Copy link

@apritzel apritzel commented May 16, 2021

@The-going Can we PLEASE have this discussion not here or in some Armbian forums, but on the public mailing list?
@smaeul posted a patch to LAKML already: http://lists.infradead.org/pipermail/linux-arm-kernel/2021-May/657682.html
Please report any findings there!
This is Linux after all, and it lives from people contributing BACK, to the source: mainlinux Linux.
If those issues would have been reported on the list already when Armbian took the Freescale hack, we probably would have a solution working for ALL users already.

@The-going
Copy link
Contributor

@The-going The-going commented May 17, 2021

@apritzel Well, let's discuss where your suggestions are.

@Uglymotha
Copy link
Contributor Author

@Uglymotha Uglymotha commented May 18, 2021

So can you describe what the problems with the current mainline workaround are? Apparently there are issues, but no one seemed to have reported this on the list? I will try to run some more tests overnight, and with more boards, but would be grateful if we get some hints at how to reproduce the mainline issue.

On my pine64 with mainline kernel I get mentioned rcu_stall messages. Symptoms are stalled processes, that will mostly eventually continue, but in rare cases hang the entire machine, requiring reboot.

@Uglymotha
Copy link
Contributor Author

@Uglymotha Uglymotha commented May 18, 2021

motha Could you please try the mainline version with one small change:

I already tried that during my troubleshooting, which took a few months to go through. It will still cause BUG message to be logged on boot, because timer values differ 6 (or sometimes 9), which can not be detected by the bitmask. It can also not detect the inexplicably large dumps observed.

It's the main reason I went down the way of creating a patch which can mathematically be proven to be correct and have as little impact as possible.

Note. Small changes on upstream solution:
GENMASK(9, 0 ... -> GENMASK(8, 0 ...
igorpecovnik added a commit that referenced this issue May 22, 2021
* Improve sunxi fix-a64-timejump.patch

Current patch still throws BUG:
WARNING: CPU: 2 PID: 31 at drivers/clocksource/arm_arch_timer.c:364 sun50i_a64_read_cntpct_el0+0x2c/0x38

Cause: Timer values are required to be exactly the same, while it may increase in between reads. Seems to happen especially during boot
Timer values are like: x2 : 0000000015014561 - x0 : 0000000015014567

Solution: Normalize timer values and return normalized value.

* Add patch for 5.12.y

Note. Small changes on upstream solution:
GENMASK(9, 0 ... -> GENMASK(8, 0 ...

Co-authored-by: Igor Pecovnik <igor.pecovnik@gmail.com>
@igorpecovnik igorpecovnik merged commit 314061c into armbian:master May 22, 2021
2 checks passed
2 checks passed
@github-actions[bot]
shellcheck
Details
@github-actions[bot]
Compile changed kernel
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants