Panic when running sysctl sys.device.drmn0.pcie_replay_count with amdgpu #15

NorwegianRockCat · 2020-07-18T06:46:43Z

There seems to be a panic when running the sysctl device.drmn0.pcie_replay_count with the amdgpu loaded (a Navi10 card).

The dump does not look very helpful:

panic: page fault

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:


Fatal trap 12: page fault while in kernel mode
cpuid = 23; apic id = 17
fault virtual address   = 0x0
fault code              = supervisor read instruction, page not present
instruction pointer     = 0x20:0x0
stack pointer           = 0x28:0xfffffe00f834e7f8
frame pointer           = 0x28:0xfffffe00f834e810
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 3330 (sysctl)
trap number             = 12
panic: page fault
cpuid = 23
time = 1595050508
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00f834e4b0
vpanic() at vpanic+0x182/frame 0xfffffe00f834e500
panic() at panic+0x43/frame 0xfffffe00f834e560
trap_fatal() at trap_fatal+0x387/frame 0xfffffe00f834e5c0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00f834e610
trap() at trap+0x271/frame 0xfffffe00f834e720
calltrap() at calltrap+0x8/frame 0xfffffe00f834e720
--- trap 0xc, rip = 0, rsp = 0xfffffe00f834e7f8, rbp = 0xfffffe00f834e810 ---
??() at 0/frame 0xfffffe00f834e810
sysctl_handle_attr() at sysctl_handle_attr+0x70/frame 0xfffffe00f834e860
sysctl_root_handler_locked() at sysctl_root_handler_locked+0x91/frame 0xfffffe00f834e8b0
sysctl_root() at sysctl_root+0x249/frame 0xfffffe00f834e930
userland_sysctl() at userland_sysctl+0x173/frame 0xfffffe00f834e9e0
sys___sysctl() at sys___sysctl+0x5f/frame 0xfffffe00f834ea90
amd64_syscall() at amd64_syscall+0x119/frame 0xfffffe00f834ebb0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe00f834ebb0
--- syscall (202, FreeBSD ELF64, sys___sysctl), rip = 0x800414caa, rsp = 0x7fffffffc3b8, rbp = 0x7fffffffc3f0 ---
KDB: enter: panic

I put the core.txt up on pastebin if more information is wanted.

The text was updated successfully, but these errors were encountered:

evadot · 2020-07-24T11:31:48Z

I cannot reproduce on my amd machine.
Does that happens everytime for you ?
Do you have other kernel modules that could cause this ?

NorwegianRockCat · 2020-07-25T08:16:36Z

It seems to happen every time I tried this. This core dump was done right after I loaded the system. I typed sysctl -a to see that it worked.

then

kldload amdgpu

then sysctl -a and the system crashed.

Otherwise, cpuctl and amdtemp are the only other modules I loaded explicitly. I think the core dump might show what other required modules were loaded.

I am away from the system until Friday, but I can try and investigate a little more when I am back.

NorwegianRockCat · 2020-08-01T08:14:56Z

Did a new rebuild on the system and it still happens when amdgpu is loaded. I'll see if I can dig more.

Perusing the source code I did find this interesting comment in amdgpu_pm.c:418:

   /* sysctl -a can panic if this data is uninitialized */
    memset(&data, 0, sizeof(struct pp_states_info));

So, I wonder if there is some other Navi 10-specific structures that might have a similar issue? Any suggestions where one could start looking? Are all snprintf() calls likely being fed to sysctl?

NorwegianRockCat · 2020-08-02T15:05:21Z

I had a little bit of time and read the sysctl manpage. It seems that oid that causes the problem is sys.device.drmn0.pcie_replay_count. I've updated the bug report to reflect that information.

DarkKirb · 2020-11-04T12:16:07Z

i poked around with kgdb a bit and it appears that https://github.com/freebsd/drm-kmod/blob/master/drivers/gpu/drm/amd/amdgpu/nv.c#L564-L582 nv_asic_funcs is not fully defined, namely it is missing the fields:

int (*get_pcie_lanes)(struct amdgpu_device *adev);
void (*set_pcie_lanes)(struct amdgpu_device *adev, int lanes);
uint64_t (*get_pcie_replay_count)(struct amdgpu_device *adev);

This bug affects Navi GPUs (the rx 5xxx series). The other instantiations of the amdgpu_asic_funcs structure contain the get_pcie_replay count field, meaning it is not reproducable there.

this commit adds a stub "nv_get_pcie_replay_count" function that prevents nullptr() from being called in kernelspace on systems with navi gpus. This commit fixes freebsd#15

NorwegianRockCat · 2020-11-04T18:37:23Z

Great sleuthing @DarkKirb! Thanks for tracking this down! I've been meaning to track this more, but my current machine with the Navi card is currently packed away waiting for a move.

NorwegianRockCat · 2021-02-13T10:42:12Z

Hi @evadot, this problem is still present in 13-STABLE (and -BETA I imagine).

You mentioned that you would cherry-pick torvalds/linux@2af8153 from 5.5 to 5.4-lts (ref #37) . The cherry-pick indeed solves the problem.

It would be nice to have this working for 13-RELEASE as sysctl -a |grep … is a pattern that shows up from time to time in some of my workflows.

Can you do the cherry-pick, please?

evadot · 2021-03-11T16:06:30Z

Mhm, I somehow was sure that I did cherry-picked the patch but it seems not ...
Anyway, it's done now.
I have a few other stuff to include before I cut a new release but if you could confirm that building form the 5.4-lts branch fixes the issue for you.
Thanks.

NorwegianRockCat · 2021-03-11T17:54:10Z

I already have that patch applied locally to the 5.4-lts branch, and I can confirm that it does fix the problem.

I thought you had applied the patch too, but when I looked at the code, I couldn't see it. So hence my nudge here, glad it was finally dealt with before the 13.0-RELEASE.

I'll let you do the honors of closing the issue :-).

Recently we got a hard hang during the boot on DCN 3.0.1, which caused the below null pointer exception: [ +0.000426] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ +0.000003] #PF: supervisor read access in kernel mode [ +0.000003] #PF: error_code(0x0000) - not-present page [ +0.000003] PGD 0 P4D 0 [ +0.000004] Oops: 0000 [freebsd#1] PREEMPT SMP NOPTI [ +0.000005] CPU: 6 PID: 874 Comm: Xorg Not tainted 5.16.0.asdn-apr28+ freebsd#15 [ +0.000004] Hardware name: AMD Chachani-VN/Chachani-VN, BIOS WCH2303N 03/03/2022 [ +0.000003] RIP: 0010:resource_map_pool_resources+0x431/0xa70 [amdgpu] [ +0.000356] Code: c1 4d 89 c8 49 c1 e0 07 4d 01 c8 49 c1 e0 04 4d 01 f0 49 83 b8 f0 01 00 00 00 0f 85 16 02 00 00 49 8b b8 e0 02 00 00 89 45 c0 <48> 8b 17 4c 8b 92 a0 01 00 00 4d 85 d2 74 24 4c 89 4d 88 48 8d 4d [ +0.000003] RSP: 0018:ffffa92a4142f718 EFLAGS: 00010246 [ +0.000003] RAX: 0000000000000000 RBX: ffff9a0b86d93000 RCX: 0000000000000000 [ +0.000002] RDX: 0000000000000000 RSI: 000000000000554b RDI: 0000000000000000 [ +0.000002] RBP: ffffa92a4142f798 R08: ffff9a0bdb3c0000 0000000000000000 [ +0.000002] R10: 0000000000000000 R11: 000000000000f000 R12: 0000000000000000 [ +0.000001] R13: ffff9a0b88360000 R14: ffff9a0bdb3c0000 R15: ffff9a0b86273000 [ +0.000003] FS: 00007f4b5641ca40(0000) GS:ffff9a0cb7f80000(0000) knlGS:0000000000000000 [ +0.000002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000002] CR2: 0000000000000000 CR3: 0000000102cb2000 CR4: 00000000003506e0 [ +0.000003] Call Trace: [ +0.000002] <TASK> [ +0.000004] ? kvmalloc_node+0x5c/0x90 [ +0.000009] dcn20_add_stream_to_ctx+0x1c/0x90 [amdgpu] [ +0.000330] dcn30_add_stream_to_ctx+0xe/0x10 [amdgpu] [ +0.000313] dc_add_stream_to_ctx+0x67/0x80 [amdgpu] [ +0.000300] dm_update_crtc_state+0x4dd/0x6e0 [amdgpu] [ +0.000320] amdgpu_dm_atomic_check+0x63b/0x1270 [amdgpu] [ +0.000311] ? __drm_mode_object_add+0x90/0xc0 [drm] [ +0.000043] ? preempt_count_add+0x74/0xc0 [ +0.000005] ? _raw_spin_lock_irqsave+0x2a/0x60 [ +0.000006] ? _raw_spin_unlock_irqrestore+0x29/0x3d [ +0.000003] ? drm_connector_list_iter_next+0x8e/0xb0 [drm] [ +0.000038] drm_atomic_check_only+0x5dd/0xa20 [drm] [ +0.000044] drm_atomic_commit+0x18/0x60 [drm] [ +0.000046] drm_client_modeset_commit_atomic+0x1e5/0x220 [drm] [ +0.000051] drm_client_modeset_commit_locked+0x57/0x160 [drm] [ +0.000038] __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xd0 [drm_kms_helper] [ +0.000027] drm_fb_helper_set_par+0x40/0x50 [drm_kms_helper] [ +0.000022] fb_set_var+0x1c8/0x3d0 [ +0.000007] ? __ext4_mark_inode_dirty+0x83/0x210 [ +0.000006] ? __ext4_journal_stop+0x3c/0xb0 [ +0.000008] fbcon_blank+0x228/0x290 [ +0.000007] do_unblank_screen+0xae/0x150 [ +0.000005] vt_ioctl+0xcf4/0x1360 [ +0.000005] ? get_max_files+0x20/0x20 [ +0.000005] ? get_max_files+0x20/0x20 [ +0.000004] ? debug_smp_processor_id+0x17/0x20 [ +0.000004] tty_ioctl+0x373/0x8a0 [ +0.000005] ? __fput+0x123/0x260 [ +0.000004] ? __fget_light+0xc5/0x100 [ +0.000005] __x64_sys_ioctl+0x91/0xc0 [ +0.000005] do_syscall_64+0x3b/0xc0 [ +0.000005] entry_SYSCALL_64_after_hwframe+0x44/0xae This issue happens because "pipe_ctx->stream_res.tg" needs to be initialized first before reading its members. This commit fixes this issue by properly initializing the pointer before accessing the target data. Fixes: 663d2daeaee6 ("drm/amd/display: Add odm seamless boot support") Cc: Agustin Gutierrez <agustin.gutierrez@amd.com> Signed-off-by: Sung Joon Kim <Sungjoon.Kim@amd.com> Reviewed-by: Agustin Gutierrez <agustin.gutierrez@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

evadot added the bug Something isn't working label Jul 24, 2020

NorwegianRockCat changed the title ~~Panic when running sysctl -a with amdgpu loaded~~ Panic when running sysctl sys.device.drmn0.pcie_replay_count with amdgpu Aug 2, 2020

DarkKirb mentioned this issue Nov 4, 2020

Add stub get_pcie_replay_count for navi gpus #37

Closed

valpackett mentioned this issue Nov 28, 2020

Panic running "sysctl sys" with Radeon Pro W5700 #41

Closed

tsujp mentioned this issue Nov 28, 2020

AMD Navi 5700XT on 13-CURRENT does not work #42

Closed

valpackett mentioned this issue Mar 9, 2021

panic on 14-CURRENT with AMD NAVI10 #64

Closed

sausunoki mentioned this issue Apr 12, 2021

amdgpu navi10, panic on 13-STABLE or 14-CURRENT #68

Closed

ctipper mentioned this issue May 2, 2021

kernel panic from linux_dump_stack() caused by drm_atomic_helper.c:621 #69

Closed

evadot closed this as completed Oct 8, 2021

thisplacestinksoffascism mentioned this issue Apr 11, 2022

panic on 14-CURRENT; AMD NAVI10; X suspension/resumption; vm_fault_lookup: fault on nodefault entry ... #157

Open

JustAnotherHumanBeing mentioned this issue Jan 16, 2023

Update to Linux 5.13 drivers #224

Merged

JustAnotherHumanBeing mentioned this issue Mar 20, 2023

Update to Linux 5.17 drivers #236

Merged

daemonblade mentioned this issue Dec 31, 2023

drm-515-kmod 5.15.118_3 panic on STABLE-14/amd64 built 27-Dec-2023 #276

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic when running sysctl sys.device.drmn0.pcie_replay_count with amdgpu #15

Panic when running sysctl sys.device.drmn0.pcie_replay_count with amdgpu #15

NorwegianRockCat commented Jul 18, 2020 •

edited

evadot commented Jul 24, 2020

NorwegianRockCat commented Jul 25, 2020 •

edited

NorwegianRockCat commented Aug 1, 2020 •

edited

NorwegianRockCat commented Aug 2, 2020

DarkKirb commented Nov 4, 2020 •

edited

NorwegianRockCat commented Nov 4, 2020 •

edited

NorwegianRockCat commented Feb 13, 2021

evadot commented Mar 11, 2021

NorwegianRockCat commented Mar 11, 2021

Panic when running sysctl sys.device.drmn0.pcie_replay_count with amdgpu #15

Panic when running sysctl sys.device.drmn0.pcie_replay_count with amdgpu #15

Comments

NorwegianRockCat commented Jul 18, 2020 • edited

evadot commented Jul 24, 2020

NorwegianRockCat commented Jul 25, 2020 • edited

NorwegianRockCat commented Aug 1, 2020 • edited

NorwegianRockCat commented Aug 2, 2020

DarkKirb commented Nov 4, 2020 • edited

NorwegianRockCat commented Nov 4, 2020 • edited

NorwegianRockCat commented Feb 13, 2021

evadot commented Mar 11, 2021

NorwegianRockCat commented Mar 11, 2021

NorwegianRockCat commented Jul 18, 2020 •

edited

NorwegianRockCat commented Jul 25, 2020 •

edited

NorwegianRockCat commented Aug 1, 2020 •

edited

DarkKirb commented Nov 4, 2020 •

edited

NorwegianRockCat commented Nov 4, 2020 •

edited