Skip to content

fix race condition and NULL ptr deref in PCIe threaded probe#458

Merged
igorpecovnik merged 1 commit intoarmbian:rk-6.1-rkr5.1from
AlomeProg:fix-threaded-pcie-init
Mar 23, 2026
Merged

fix race condition and NULL ptr deref in PCIe threaded probe#458
igorpecovnik merged 1 commit intoarmbian:rk-6.1-rkr5.1from
AlomeProg:fix-threaded-pcie-init

Conversation

@AlomeProg
Copy link
Contributor

@AlomeProg AlomeProg commented Mar 20, 2026

Description

This PR adds a patch to fix a critical kernel panic in the Rockchip PCIe driver (pcie-dw-rockchip.c) occurring on threaded initialization failure.

Documentation summary for feature / change

When CONFIG_PCIE_RK_THREADED_INIT is enabled, a race condition exists between the probe and remove paths. If the hardware initialization fails (e.g., power supply issues or PHY errors), rk_pcie_remove may execute concurrently with the error handling path in the probe thread.

The original implementation had multiple flaws:

  1. Unsafe Synchronization: It used a manual while loop with schedule_timeout instead of proper kernel synchronization primitives.
  2. Broken Timeout Logic: The loop condition time_before(start, start + timeout) was always true (due to start not being updated), potentially leading to infinite loops.
  3. NULL Pointer Dereference: The finish_probe flag was set to true even during error cleanup. This caused rk_pcie_remove to attempt accessing uninitialized structures (rk_pcie->pci->pp.bridge->bus), resulting in a Kernel Oops (Unable to handle kernel NULL pointer dereference at virtual address 00000000000002f8).

My solution:

The patch refactors the synchronization mechanism using standard kernel APIs:

  1. Replaces the busy-wait loop with struct completion (probe_done) to ensure safe thread synchronization.
  2. Replaces the ambiguous finish_probe flag with probe_ok to explicitly signal successful initialization.
  3. Adds a check for probe_ok in rk_pcie_remove to exit early if initialization failed, preventing access to invalid pointers.
  4. Ensures completion is initialized immediately after memory allocation to prevent usage of uninitialized data.

Testing

[x] Tested on RK3588 platform.
[x] Verified that the kernel boots successfully.
[x] Verified that no regression is introduced for successful probe scenarios.

dmesg log
[   27.170851] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   27.171511] pc : rk_pcie_remove+0x94/0x1bc
[   27.171915] lr : platform_remove+0x5c/0x74
[   27.172315] sp : ffff80000ccdbd50
[   27.172508] rk-pcie fe160000.pcie: can't get current limit.
[   27.172634] x29: ffff80000ccdbd50
[   27.173007] rk-pcie fe160000.pcie: host bridge /pcie@fe160000 ranges:
[   27.173161]  x28: 0000000000000000
[   27.173209] rk-pcie fe160000.pcie:       IO 0x00f1100000..0x00f11fffff -> 0x00f1100000
[   27.173231]  x27: 0000000000000000
[   27.173265] rk-pcie fe160000.pcie:      MEM 0x00f1200000..0x00f1ffffff -> 0x00f1200000
[   27.173306] 
[   27.173336] rk-pcie fe160000.pcie:      MEM 0x0940000000..0x097fffffff -> 0x0940000000
[   27.173382] x26: 0000000000000000
[   27.173449] rk-pcie fe160000.pcie: iATU unroll: enabled
[   27.173739]  x25: 00000000ffffffed
[   27.173764] rk-pcie fe160000.pcie: iATU regions: 8 ob, 8 ib, align 64K, limit 8G
[   27.173799]  x24: ffff00010149c000
[   27.179307] x23: 0000000000000001 x22: 0000000000000001 x21: ffff000107ee0000
[   27.179983] x20: ffff00010149c010 x19: ffff000100b04880 x18: 0000000000000000
[   27.180659] x17: 00656963702e3030 x16: 3030353165663a6d x15: 726f6674616c702b
[   27.181335] x14: 0000000000000000 x13: 64656c696166206f x12: 0000000000000000
[   27.182011] x11: ffff000100753120 x10: ffff0001003f7a10 x9 : ffff80000899f028
[   27.182687] x8 : 0101010101010101 x7 : 7f7f7f7f7f7f7f7f x6 : 736d47ff6364626d
[   27.183363] x5 : 0000008000000000 x4 : 0000000000000006 x3 : 000000000000000c
[   27.184038] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000101b7a480
[   27.184715] Call trace:
[   27.184958]  rk_pcie_remove+0x94/0x1bc
[   27.185321]  platform_remove+0x5c/0x74
[   27.185681]  device_remove+0x54/0x7c
[   27.186039]  device_release_driver_internal+0x94/0x150
[   27.186535]  device_release_driver+0x20/0x2c
[   27.186949]  rk_pcie_really_probe+0x8f0/0x918
[   27.187373]  kthread+0xc4/0xd4
[   27.187684]  ret_from_fork+0x10/0x20
[   27.159256] Unable to handle kernel NULL pointer dereference at virtual address 00000000000002f8
[   27.159263] Mem abort info:
[   27.159266]   ESR = 0x0000000096000004
[   27.159270]   EC = 0x25: DABT (current EL), IL = 32 bits
[   27.159275]   SET = 0, FnV = 0
[   27.159279]   EA = 0, S1PTW = 0
[   27.159283]   FSC = 0x04: level 0 translation fault
[   27.159288] Data abort info:
[   27.159291]   ISV = 0, ISS = 0x00000004
[   27.159295]   CM = 0, WnR = 0
[   27.159299] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000109c4d000
[   27.159305] [00000000000002f8] pgd=0000000000000000, p4d=0000000000000000
[   27.159319] Internal error: Oops: 0000000096000004 [#1] SMP

Signed-off-by: AlomeProg <alomeprog@gmail.com>
@igorpecovnik igorpecovnik merged commit 32af029 into armbian:rk-6.1-rkr5.1 Mar 23, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants