RX570/POLARIS12 panic during GPU post on 13/stable aarch64 #84

agrajag9 · 2021-06-06T11:03:10Z

When attempting to kldmod amdgpu, the system panics

FreeBSD version
FreeBSD honeycomb 13.0-STABLE FreeBSD 13.0-STABLE #2 stable/13-n245851-02966cbdf03: Wed Jun 2 23:16:06 UTC 2021 agrajag9@honeycomb:/usr/obj/usr/src/arm64.aarch64/sys/GENERIC arm64

PCI Info

pciconf -lv

nvme0@pci2:1:0:0:   class=0x010802 rev=0x00 hdr=0x00 vendor=0x144d device=0xa808 subvendor=0x144d subdevice=0xa801
    vendor     = 'Samsung Electronics Co Ltd'
    device     = 'NVMe SSD Controller SM981/PM981/PM983'
    class      = mass storage
    subclass   = NVM
vgapci0@pci4:1:0:0: class=0x030000 rev=0xc7 hdr=0x00 vendor=0x1002 device=0x699f subvendor=0x1da2 subdevice=0xe367
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X]'
    class      = display
    subclass   = VGA
none0@pci4:1:0:1:   class=0x040300 rev=0x00 hdr=0x00 vendor=0x1002 device=0xaae0 subvendor=0x1da2 subdevice=0xaae0
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]'
    class      = multimedia
    subclass   = HDA

DRM KMOD version

drm-fbsd13-kmod 5.4.92.g20210419
drm-kmod g20190710_1

To Reproduce
Steps to reproduce the behavior:
kldload -v amdgpu

Additional context

root@honeycomb:~ # kldload -v amdgpu
<6>[drm] amdgpu kernel modesetting enabled.
drmn0: <drmn> on vgapci0
vgapci0: child drmn0 requested pci_enable_io
vgapci0: child drmn0 requested pci_enable_io
sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
<6>[drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
<6>[drm] register mmio base: 0x40000000
<6>[drm] register mmio size: 262144
<6>[drm] add ip block number 0 <vi_common>
<6>[drm] add ip block number 1 <gmc_v8_0>
<6>[drm] add ip block number 2 <tonga_ih>
<6>[drm] add ip block number 3 <gfx_v8_0>
<6>[drm] add ip block number 4 <sdma_v3_0>
<6>[drm] add ip block number 5 <powerplay>
<6>[drm] add ip block number 6 <dm>
<6>[drm] add ip block number 7 <uvd_v6_0>
<6>[drm] add ip block number 8 <vce_v3_0>
<6>[drm] UVD is enabled in VM mode
<6>[drm] UVD ENC is enabled in VM mode
<6>[drm] VCE enabled in VM mode
<6>[drm] GPU posting now...
[drm ERROR :atom_op_jump] atombios stuck in loop for more than 10secs aborting
[drm ERROR :amdgpu_atom_execute_table_locked] atombios stuck executing AD44 (len 428, WS 20, PS 0) @ 0xAE76
[drm ERROR :amdgpu_atom_execute_table_locked] atombios stuck executing A984 (len 158, WS 0, PS 8) @ 0xA9E7
drmn0: gpu post error!
drmn0: Fatal error during GPU init
<6>[drm] amdgpu: finishing device.
Warning: can't remove non-dynamic nodes (dri)!
device_attach: drmn0 attach returned 22
  x0:                b
  x1:                0
  x2:     ffffffffee00
  x3:               33
  x4:         40100401
  x5:    800208000aaaa
  x6:                1
  x7:            f5ff5
  x8:              130
  x9:                0
 x10:                0
 x11:         80130000
 x12:              427
 x13:                0
 x14:         80000000
 x15:         402bd5e1
 x16:         403cd89c
 x17:     ffffffffe540
 x18:                0
 x19:     ffffffffeb30
 x20:                0
 x21:           200bd5
 x22:                1
 x23:     ffffffffee13
 x24:                0
 x25:                1
 x26:           200b2a
 x27:           200c82
 x28:                1
 x29:     ffffffffea80
  sp:     ffffffffe550
  lr:           2110fc
 elr:         403cd8a4
spsr:         80000200
 far:                0
 esr:         bf000000
panic: Unhandled System Error
cpuid = 7
time = 1613602221
KDB: stack backtrace:
#0 0xffff000000443e6c at kdb_backtrace+0x60
#1 0xffff0000003ee0cc at vpanic+0x184
#2 0xffff0000003edf44 at panic+0x44
#3 0xffff0000007048ac at do_serror+0x40
#4 0xffff0000006e5c9c at handle_serror+0x88
Uptime: 10m52s

The text was updated successfully, but these errors were encountered:

valpackett · 2021-06-08T23:18:19Z

Just to be sure, try CURRENT with manually built 5.4-lts. But looks like it might be a PCIe issue. Is this the early revision LX2160 or the newer one?

Also, why is it POSTing here, doesn't the UEFI include the QEMU package to run the VBIOS? Would be interesting to see what happens on an already-POSTed GPU.

(On my mcbin, POSTing POLARIS10 from the driver works fine too, but still)

agrajag9 · 2021-06-09T10:21:39Z

Interesting though. I haven't updated the firmware in a while, will pull a new BSP image from SR later today for testing.

I'm not sure why it seems to be POSTing so late actually, especially since it clearly already did and I typically live in in efifb just fine. Is it worth reflashing the GPU firmware with an arm64 blob? Currently my GPUs only have the vendor-installed x64 GOP driver, but adding the Amd64 GOP driver seems easy enough and might help? (https://www.workofard.com/2020/12/aarch64-option-roms-for-amd-gpus/).

In the meantime, another similar coredump with a different AMD GPU, although still POLARIS12.

Jun  9 10:23:11 honeycomb devd[79118]: notify_clients: send() failed; dropping unresponsive client
Jun  9 10:23:11 honeycomb kernel: anon_inodefs registered
Jun  9 10:23:11 honeycomb kernel: debugfs registered
Jun  9 10:23:11 honeycomb kernel: [drm] amdgpu kernel modesetting enabled.
Jun  9 10:23:11 honeycomb kernel: drmn0: <drmn> on vgapci0
Jun  9 10:23:11 honeycomb kernel: vgapci0: child drmn0 requested pci_enable_io
Jun  9 10:23:11 honeycomb syslogd: last message repeated 1 times
Jun  9 10:23:11 honeycomb kernel: sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
Jun  9 10:23:11 honeycomb kernel: [drm] initializing kernel modesetting (POLARIS12 0x1002:0x6995 0x1028:0x0B0C 0x00).
Jun  9 10:23:11 honeycomb kernel: [drm] register mmio base: 0x40000000
Jun  9 10:23:11 honeycomb kernel: [drm] register mmio size: 262144
Jun  9 10:23:11 honeycomb kernel: [drm] add ip block number 0 <vi_common>
Jun  9 10:23:11 honeycomb kernel: [drm] add ip block number 1 <gmc_v8_0>
Jun  9 10:23:11 honeycomb kernel: [drm] add ip block number 2 <tonga_ih>
Jun  9 10:23:11 honeycomb kernel: [drm] add ip block number 3 <gfx_v8_0>
Jun  9 10:23:11 honeycomb kernel: [drm] add ip block number 4 <sdma_v3_0>
Jun  9 10:23:11 honeycomb kernel: [drm] add ip block number 5 <powerplay>
Jun  9 10:23:11 honeycomb kernel: [drm] add ip block number 6 <dm>
Jun  9 10:23:11 honeycomb kernel: [drm] add ip block number 7 <uvd_v6_0>
Jun  9 10:23:11 honeycomb kernel: [drm] add ip block number 8 <vce_v3_0>
Jun  9 10:23:21 honeycomb kernel: ATOM BIOS: 113-D0910602-101
Jun  9 10:23:21 honeycomb kernel: [drm] UVD is enabled in VM mode
Jun  9 10:23:21 honeycomb kernel: [drm] UVD ENC is enabled in VM mode
Jun  9 10:23:21 honeycomb kernel: [drm] VCE enabled in VM mode
Jun  9 10:23:21 honeycomb kernel: [drm] GPU posting now...
Jun  9 10:23:21 honeycomb kernel: [drm ERROR :atom_op_jump] atombios stuck in loop for more than 10secs aborting
Jun  9 10:23:21 honeycomb kernel: [drm ERROR :amdgpu_atom_execute_table_locked] atombios stuck executing B25A (len 428, WS 20, PS 0) @ 0xB38C
Jun  9 10:23:21 honeycomb kernel: [drm ERROR :amdgpu_atom_execute_table_locked] atombios stuck executing AE7A (len 158, WS 0, PS 8) @ 0xAEDD
Jun  9 10:23:21 honeycomb kernel: drmn0: gpu post error!
Jun  9 10:23:21 honeycomb kernel: drmn0: Fatal error during GPU init
Jun  9 10:23:21 honeycomb kernel: [drm] amdgpu: finishing device.
Jun  9 10:23:21 honeycomb kernel: Warning: can't remove non-dynamic nodes (dri)!
Jun  9 10:23:21 honeycomb kernel: device_attach: drmn0 attach returned 22
Jun  9 10:23:21 honeycomb kernel:   x0:               10

$ doas kgdb kernel.debug /var/crash/vmcore.last
Password:
GNU gdb (GDB) 10.2 [GDB v10.2 for FreeBSD]
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-portbld-freebsd13.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from kernel.debug...

Unread portion of the kernel message buffer:
  x1:                0
  x2:     ffffffffee40
  x3:               33
  x4:         40100401
  x5:    800208000aaaa
  x6:                1
  x7:            f4f6f
  x8:              130
  x9:                0
 x10:                0
 x11:         80130000
 x12:              427
 x13:                0
 x14:         80000000
 x15:         402be5e1
 x16:         403ce990
 x17:     ffffffffe5a0
 x18:                0
 x19:     ffffffffeb90
 x20:                0
 x21:           200bd5
 x22:                1
 x23:     ffffffffee53
 x24:                0
 x25:                1
 x26:           200b2a
 x27:           200c82
 x28:                1
 x29:     ffffffffeae0
  sp:     ffffffffe5b0
  lr:           2110fc
 elr:         403ce998
spsr:         80000200
 far:                0
 esr:         bf000000
panic: Unhandled System Error
cpuid = 5
time = 1623234201
KDB: stack backtrace:
#0 0xffff000000448a0c at kdb_backtrace+0x60
#1 0xffff0000003f2224 at vpanic+0x184
#2 0xffff0000003f209c at panic+0x44
#3 0xffff000000712c28 at do_serror+0x40
#4 0xffff0000006f3494 at handle_serror+0x88
Uptime: 11m23s
Dumping 1269 out of 32157 MB:..1%..11%..21%

get_curthread () at /usr/src/sys/arm64/include/pcpu.h:68
68              __asm __volatile("ldr   %0, [x18]" : "=&r"(td));
(kgdb) backtrace
#0  get_curthread () at /usr/src/sys/arm64/include/pcpu.h:68
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffff0000003f1d20 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffff0000003f22b4 in vpanic (fmt=<optimized out>, ap=...) at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffff0000003f20a0 in panic (fmt=0x0) at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffff000000712c2c in do_serror (frame=<optimized out>) at /usr/src/sys/arm64/arm64/trap.c:599
#6  0xffff0000006f3498 in handle_serror () at /usr/src/sys/arm64/arm64/exception.S:216
...
#5246 0xffff0000006f3498 in handle_serror () at /usr/src/sys/arm64/arm64/exception.S:216

pciconf -lv

vgapci0@pci4:1:0:0: class=0x030000 rev=0x00 hdr=0x00 vendor=0x1002 device=0x6995 subvendor=0x1028 subdevice=0x0b0c
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Lexa XT [Radeon PRO WX 2100]'
    class      = display
    subclass   = VGA
none0@pci4:1:0:1:   class=0x040300 rev=0x00 hdr=0x00 vendor=0x1002 device=0xaae0 subvendor=0x1028 subdevice=0xaae0
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]'
    class      = multimedia
    subclass   = HDA

valpackett · 2021-06-09T14:36:34Z

Is it worth reflashing the GPU firmware with an arm64 blob?

I don't think it is.

it clearly already did and I typically live in in efifb just fine

huh. Well then probably the same thing that causes the POST to fail also causes the driver to wrongly detect the GPU as uninitialized.

Again, is this the early revision LX2160 or the newer one? And please try CURRENT with manually built 5.4-lts from this repo.

agrajag9 · 2021-06-10T11:43:50Z

Yep, still the early hardware revision, but I haven't had any PCI problems since the last we talked about it.

Will test more with CURRENT GENERIC + 5.4-lts once it's all built.

agrajag9 · 2021-06-10T16:51:11Z

No dice:

FreeBSD honeycomb.a9development.com 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n247275-aa310ebfba3: Thu Jun 10 14:01:13 UTC 2021     agrajag9@honeycomb.a9development.com:/usr/obj/usr/src/arm64.aarch64/sys/GENERIC  arm64

with drm_v5.4.92_4

# kldload -v amdgpu
anon_inodefs registered
debugfs registered
<6>[drm] amdgpu kernel modesetting enabled.
drmn0: <drmn> on vgapci0
vgapci0: child drmn0 requested pci_enable_io
vgapci0: child drmn0 requested pci_enable_io
sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
<6>[drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
<6>[drm] register mmio base: 0x40000000
<6>[drm] register mmio size: 262144
<6>[drm] add ip block number 0 <vi_common>
<6>[drm] add ip block number 1 <gmc_v8_0>
<6>[drm] add ip block number 2 <tonga_ih>
<6>[drm] add ip block number 3 <gfx_v8_0>
<6>[drm] add ip block number 4 <sdma_v3_0>
<6>[drm] add ip block number 5 <powerplay>
<6>[drm] add ip block number 6 <dm>
<6>[drm] add ip block number 7 <uvd_v6_0>
<6>[drm] add ip block number 8 <vce_v3_0>
<6>[drm] UVD is enabled in VM mode
<6>[drm] UVD ENC is enabled in VM mode
<6>[drm] VCE enabled in VM mode
<6>[drm] GPU posting now...
[drm ERROR :atom_op_jump] atombios stuck in loop for more than 10secs aborting
[drm ERROR :amdgpu_atom_execute_table_locked] atombios stuck executing AD44 (len 428, WS 20, PS 0) @ 0xAE76
[drm ERROR :amdgpu_atom_execute_table_locked] atombios stuck executing A984 (len 158, WS 0, PS 8) @ 0xA9E7
drmn0: gpu post error!
drmn0: Fatal error during GPU init
<6>[drm] amdgpu: finishing device.
Warning: can't remove non-dynamic nodes (dri)!
device_attach: drmn0 attach returned 22
  x0:                b
  x1:                0
  x2:     ffffffffee60
  x3:               13
  x4:         40100401
  x5:            80020
  x6:                1
  x7:            f4e3b
  x8:              130
  x9:                0
 x10:                0
 x11:         80130000
 x12:              427
 x13:         403c2000
 x14:         40805ee0
 x15:             2000
 x16:         402d3208
 x17:     ffffffffe5c0
 x18:               3f
 x19:     ffffffffeba8
 x20:                0
 x21:           100cd5
 x22:                1
 x23:     ffffffffee63
 x24:                0
 x25:                1
 x26:           100c2a
 x27:           100d82
 x28:                1
 x29:     ffffffffeb00
  sp:     ffffffffe5d0
  lr:           111234
 elr:         402d3210
spsr:         80000200
 far:                0
 esr:         bf000000
panic: Unhandled System Error
cpuid = 7
time = 1623343552
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x184
panic() at panic+0x44
do_serror() at do_serror+0x40
handle_serror() at handle_serror+0x88
--- system error, esr 0xbf000000
KDB: enter: panic
[ thread pid 30094 tid 194022 ]
Stopped at      kdb_enter+0x44: undefined       f904411f

valpackett · 2021-06-10T20:49:33Z

There's this amdgpu.pcie_gen_cap=0x00040004 thing on e.g. this post is about. It forces PCIe gen3, by default gen 1/2 are also allowed. I thought it would be just some performance thing but I just found this tweet (also here):

I tracked down the amdgpu hang to the PCIe bus link scaling. We can force the link to PCIe Gen3 permanently and all other memory and clock scaling works flawlessly

Soooooo try hw.amdgpu.pcie_gen_cap=0x00040004 in kenv (e.g. /boot/loader.conf)?

agrajag9 · 2021-06-10T22:53:13Z

Of course the error is in a screenshot where it's not text-searchable...

Anyways, added the following to /boot/loader.conf.local but still panicking in the same way:

hw.amdgpu.pcie_gen_cap=0x00040004
hw.syscons.disable=1

Do we still need the syscons disable? I see issue 60 where you finally puzzled that out.

valpackett · 2021-06-11T09:40:50Z

Do we still need the syscons disable? I see issue 60 where you finally puzzled that out.

The fix is in #61 which is still unmerged :/ but you can apply that yourself (rebase the branch onto current 5.4-lts or cherry-pick the commit).

Also you didn't definitely need syscons disable, only if your efifb resolution was high enough that the memory overlapped. (I needed it for >=1440p)

To 100% make sure there's no weirdness with the tunable stuff, try doing this in code instead, changing

drm-kmod/drivers/gpu/drm/amd/include/amd_pcie.h

Lines 43 to 47 in b45715c

    
           #define AMDGPU_DEFAULT_PCIE_GEN_MASK (CAIL_PCIE_LINK_SPEED_SUPPORT_GEN1 \ 
        
           				      | CAIL_PCIE_LINK_SPEED_SUPPORT_GEN2 \ 
        
           				      | CAIL_ASIC_PCIE_LINK_SPEED_SUPPORT_GEN1 \ 
        
           				      | CAIL_ASIC_PCIE_LINK_SPEED_SUPPORT_GEN2 \ 
        
           				      | CAIL_ASIC_PCIE_LINK_SPEED_SUPPORT_GEN3)

to #define AMDGPU_DEFAULT_PCIE_GEN_MASK (CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3 | CAIL_ASIC_PCIE_LINK_SPEED_SUPPORT_GEN3)

agrajag9 · 2021-06-12T12:42:31Z

.if ${.CURDIR:M*/graphics/drm-current-kmod}
EXTRA_PATCHES+= /distfiles/local-patches/graphics/drm-current-kmod.patch
.endif

--- drivers/gpu/drm/amd/include/amd_pcie.h.orig 2021-06-12 11:02:26.030476000 +0000
+++ drivers/gpu/drm/amd/include/amd_pcie.h      2021-06-12 11:03:24.635405000 +0000
@@ -40,10 +40,7 @@
 #define CAIL_ASIC_PCIE_LINK_SPEED_SUPPORT_SHIFT  0

 /* gen: chipset 1/2, asic 1/2/3 */
-#define AMDGPU_DEFAULT_PCIE_GEN_MASK (CAIL_PCIE_LINK_SPEED_SUPPORT_GEN1 \
-                                     | CAIL_PCIE_LINK_SPEED_SUPPORT_GEN2 \
-                                     | CAIL_ASIC_PCIE_LINK_SPEED_SUPPORT_GEN1 \
-                                     | CAIL_ASIC_PCIE_LINK_SPEED_SUPPORT_GEN2 \
+#define AMDGPU_DEFAULT_PCIE_GEN_MASK (CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3 \
                                      | CAIL_ASIC_PCIE_LINK_SPEED_SUPPORT_GEN3)

 /* Following flags shows PCIe lane width switch supported in driver which are decided by chipset and ASIC */

=======================<phase: patch          >============================
===>  Patching for drm-current-kmod-5.4.92.g20210526
===>  Applying extra patch /distfiles/local-patches/graphics/drm-current-kmod.patch
===========================================================================

Same panic :(

It looks I'm still using a UEFI from a while ago and there may be some updates there. I'll drop a new one in and see if that helps as well...

agrajag9 · 2021-06-12T13:20:20Z

Tried with a fresh firmware build, same panic.

agrajag9 · 2021-06-12T15:48:39Z

Found this: ROCm/ROCK-Kernel-Driver#62 (comment)

The error "GPU posting now" appears when a secondary card is initialized that didn't get posted by the BIOS. You can enable more debug messages in the GPU driver with the kernel parameter drm.debug=0xff.

This is curious, because there shouldn't be another GPU attached?

And this: http://macchiatobin.net/forums/topic/gpu/#post-7368

Added hw.drm.debug=0xff to /boot/loader.conf.local but didn't see anything new in the trace.

agrajag9 · 2021-06-12T15:51:33Z

I also found some more in threads about IOMMU and iirc we do not support IOMMU (SMMU on Arm?) and I think that's something Jon was building into the firmware. Possibly that's an issue?

valpackett · 2021-06-12T16:07:13Z

Debug flag for us is hw.dri.drm_debug.

The error "GPU posting now" appears when a secondary card is initialized that didn't get posted by the BIOS

That's just one possible cause, absolutely not the only one. (Also "GPU posting now" on its own is not an error, if you don't have efifb it's expected.)

And this: http://macchiatobin.net/forums/topic/gpu/#post-7368

Oh, as you can see this post is already talking about OpenGL stuff, not early init. I've seen the corruption myself :) This was solved upstream some time ago, my backport was FreeBSDDesktop/kms-drm@7fe2f58

we do not support IOMMU (SMMU on Arm?)

We support SMMU 3 since https://reviews.freebsd.org/D24618 but not SMMU 2. In any case it shouldn't be mandatory to use the IOMMU. Especially since you do have other PCIe cards working…

hmm hmm I wonder if the PCIe link gen is not being applied through LinuxKPI somehow

agrajag9 · 2021-06-12T16:09:32Z

Wrong button...

Latest acpidump -dt for you in case there's something wacky in there: https://gist.github.com/4cfb12e38d4f5845069d8f3f92c96fb6

valpackett · 2021-06-12T16:16:29Z

pciconf -lvbc might be more useful

valpackett · 2021-06-12T16:22:28Z

Oddly the AMDGPU_DEFAULT_PCIE_GEN_MASK is only applied for APUs (?? check is actually just pci_is_root_bus which would be true on the mcbin IIUC :D)

Try this

--- i/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ w/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4013,9 +4013,11 @@ static void amdgpu_device_get_pcie_info(struct amdgpu_device *adev)
 			adev->pm.pcie_gen_mask = AMDGPU_DEFAULT_PCIE_GEN_MASK;
 		if (adev->pm.pcie_mlw_mask == 0)
 			adev->pm.pcie_mlw_mask = AMDGPU_DEFAULT_PCIE_MLW_MASK;
-		return;
+		// return;
 	}
 
+	adev->pm.pcie_gen_mask = CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3 | CAIL_ASIC_PCIE_LINK_SPEED_SUPPORT_GEN3;
+
 	if (adev->pm.pcie_gen_mask && adev->pm.pcie_mlw_mask)
 		return;

agrajag9 · 2021-06-13T10:41:29Z

# pciconf -lvbc
nvme0@pci2:1:0:0:       class=0x010802 rev=0x00 hdr=0x00 vendor=0x144d device=0xa808 subvendor=0x144d subdevice=0xa801
    vendor     = 'Samsung Electronics Co Ltd'
    device     = 'NVMe SSD Controller SM981/PM981/PM983'
    class      = mass storage
    subclass   = NVM
    bar   [10] = type Memory, range 64, base 0x40000000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit
    cap 10[70] = PCI-Express 2 endpoint max data 128(256) FLR RO NS
                 max read 512
                 link x4(x4) speed 8.0(8.0) ASPM disabled(L1) ClockPM disabled
    cap 11[b0] = MSI-X supports 33 messages, enabled
                 Table in map 0x10[0x3000], PBA in map 0x10[0x2000]
    ecap 0001[100] = AER 2 0 fatal 0 non-fatal 0 corrected
    ecap 0003[148] = Serial 1 0000000000000000
    ecap 0004[158] = Power Budgeting 1
    ecap 0019[168] = PCIe Sec 1 lane errors 0
    ecap 0018[188] = LTR 1
    ecap 001e[190] = L1 PM Substates 1
vgapci0@pci4:1:0:0:     class=0x030000 rev=0xc7 hdr=0x00 vendor=0x1002 device=0x699f subvendor=0x1da2 subdevice=0xe367
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X]'
    class      = display
    subclass   = VGA
    bar   [10] = type Prefetchable Memory, range 64, base 0xa400000000, size 268435456, enabled
    bar   [18] = type Prefetchable Memory, range 64, base 0xa410000000, size 2097152, enabled
    bar   [20] = type I/O Port, range 32, base 0, size 256, disabled
    bar   [24] = type Memory, range 32, base 0x40000000, size 262144, enabled
    cap 09[48] = vendor (length 8)
    cap 01[50] = powerspec 3  supports D0 D1 D2 D3  current D0
    cap 10[58] = PCI-Express 2 legacy endpoint max data 128(256) RO NS
                 max read 512
                 link x8(x8) speed 8.0(8.0) ASPM disabled(L1) ClockPM disabled
    cap 05[a0] = MSI supports 1 message, 64 bit
    ecap 000b[100] = Vendor [1] ID 0001 Rev 1 Length 16
    ecap 0001[150] = AER 2 0 fatal 0 non-fatal 1 corrected
    ecap 0015[200] = Resizable BAR 1
    ecap 0019[270] = PCIe Sec 1 lane errors 0
    ecap 000f[2b0] = ATS 1
    ecap 0013[2c0] = Page Page Request 1
    ecap 001b[2d0] = Process Address Space ID 1
    ecap 0018[320] = LTR 1
    ecap 000e[328] = ARI 1
    ecap 001e[370] = L1 PM Substates 1
none0@pci4:1:0:1:       class=0x040300 rev=0x00 hdr=0x00 vendor=0x1002 device=0xaae0 subvendor=0x1da2 subdevice=0xaae0
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]'
    class      = multimedia
    subclass   = HDA
    bar   [10] = type Memory, range 64, base 0x40040000, size 16384, enabled
    cap 09[48] = vendor (length 8)
    cap 01[50] = powerspec 3  supports D0 D1 D2 D3  current D0
    cap 10[58] = PCI-Express 2 legacy endpoint max data 128(256) RO NS
                 max read 512
                 link x8(x8) speed 8.0(8.0) ASPM disabled(L1) ClockPM disabled
    cap 05[a0] = MSI supports 1 message, 64 bit
    ecap 000b[100] = Vendor [1] ID 0001 Rev 1 Length 16
    ecap 0001[150] = AER 2 0 fatal 0 non-fatal 1 corrected
    ecap 000e[328] = ARI 1

=======================<phase: patch          >============================
===>  Patching for drm-current-kmod-5.4.92.g20210526
===>  Applying extra patch /distfiles/local-patches/graphics/drm-current-kmod/amdgpu_device_c.patch
===>  Applying extra patch /distfiles/local-patches/graphics/drm-current-kmod/amd_pcie_h.patch
===========================================================================

Same panic...

agrajag9 · 2021-06-13T10:46:05Z

I just noticed this:

cap 10[58] = PCI-Express 2 legacy endpoint

valpackett · 2021-06-13T22:42:07Z

Hm, very similar on my mcbin actually, even 1 more corrected error, but everything works

pciconf here

vgapci0@pci0:0:0:0:	class=0x030000 rev=0xc7 hdr=0x00 vendor=0x1002 device=0x67df subvendor=0x1002 subdevice=0x0b37
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]'
    class      = display
    subclass   = VGA
    bar   [10] = type Prefetchable Memory, range 64, base 0x800000000, size 268435456, enabled
    bar   [18] = type Prefetchable Memory, range 64, base 0x810000000, size 2097152, enabled
    bar   [20] = type I/O Port, range 32, base 0, size 256, disabled
    bar   [24] = type Memory, range 32, base 0xc0000000, size 262144, enabled
    cap 09[48] = vendor (length 8)
    cap 01[50] = powerspec 3  supports D0 D1 D2 D3  current D0
    cap 10[58] = PCI-Express 2 legacy endpoint max data 128(256) RO NS
                 max read 512
                 link x4(x16) speed 8.0(8.0) ASPM disabled(L1)
    cap 05[a0] = MSI supports 1 message, 64 bit 
    ecap 000b[100] = Vendor [1] ID 0001 Rev 1 Length 16
    ecap 0001[150] = AER 2 0 fatal 0 non-fatal 2 corrected
    ecap 0015[200] = Resizable BAR 1
    ecap 0019[270] = PCIe Sec 1 lane errors 0
    ecap 000f[2b0] = ATS 1
    ecap 0013[2c0] = Page Page Request 1
    ecap 001b[2d0] = Process Address Space ID 1
    ecap 0018[320] = LTR 1
    ecap 000e[328] = ARI 1
    ecap 001e[370] = L1 PM Substates 1

Changes after loading amdgpu:

@@ -5,14 +5,14 @@
     subclass   = VGA
     bar   [10] = type Prefetchable Memory, range 64, base 0x800000000, size 268435456, enabled
     bar   [18] = type Prefetchable Memory, range 64, base 0x810000000, size 2097152, enabled
-    bar   [20] = type I/O Port, range 32, base 0, size 256, disabled
+    bar   [20] = type I/O Port, range 32, base 0, size 256, enabled
     bar   [24] = type Memory, range 32, base 0xc0000000, size 262144, enabled
     cap 09[48] = vendor (length 8)
     cap 01[50] = powerspec 3  supports D0 D1 D2 D3  current D0
     cap 10[58] = PCI-Express 2 legacy endpoint max data 128(256) RO NS
                  max read 512
-                 link x4(x16) speed 8.0(8.0) ASPM disabled(L1)
-    cap 05[a0] = MSI supports 1 message, 64 bit 
+                 link x4(x16) speed 5.0(8.0) ASPM disabled(L1)
+    cap 05[a0] = MSI supports 1 message, 64 bit enabled with 1 message
     ecap 000b[100] = Vendor [1] ID 0001 Rev 1 Length 16
     ecap 0001[150] = AER 2 0 fatal 0 non-fatal 2 corrected
     ecap 0015[200] = Resizable BAR 1
@@ -33,7 +33,7 @@
     cap 01[50] = powerspec 3  supports D0 D1 D2 D3  current D3
     cap 10[58] = PCI-Express 2 legacy endpoint max data 128(256) RO NS
                  max read 512
-                 link x4(x16) speed 8.0(8.0) ASPM disabled(L1)
+                 link x4(x16) speed 5.0(8.0) ASPM disabled(L1)
     cap 05[a0] = MSI supports 1 message, 64 bit 
     ecap 000b[100] = Vendor [1] ID 0001 Rev 1 Length 16
     ecap 0001[150] = AER 2 0 fatal 0 non-fatal 2 corrected

cap 10[58] = PCI-Express 2 legacy endpoint

That's just what the device is, that's static data IIUC. speed 8.0(8.0) is the indication that it's running at Gen 3 speed. (And yeah it will always be at that speed before the driver loads, so this was not very useful unfortunately…)

Seems like link speed renegotiation is initiated from the GPU side firmware, and that parameter in the driver should tell it what speeds to support, so there kinda shouldn't be differences between Linux and FreeBSD in terms of that stuff.

To be sure: have you tested under Linux?

agrajag9 · 2021-06-14T10:46:52Z

I have not had a chance to test under Linux, will try to get to that this week.

Meanwhile, I just saw this on the OpenBSD 6.9 release notes:

Fixed panics on the HoneyComb LX2K with amdgpu(4).

That got me to this commit: openbsd/src@9e1dc75

From bluerise on Discord:

That only helped a little. The one thing that helped a lot, but still doesn‘t fix all, was switching from WT (for write-combine mappings) to DEVICE.

valpackett · 2021-06-14T11:16:02Z

That got me to this commit: openbsd/src@9e1dc75

Huh.

Linux' iowrite32/ioread32 explicitly contain barriers

Well, our implementations do too.

Even though there is an odd "XXX This is all x86 specific" comment, it works fine on the MACCHIATObin so we can be sure there's nothing affecting arm64-in-general. Looking at the impl, ioread32 does readl which is

	__io_br(); // __compiler_membar() // __asm __volatile(" " : : : "memory")
	v = le32toh(__raw_readl(addr));
	__io_ar(); // rmb() ifdef rmb // defined as dmb(ld) in arm64/include/atomic.h

and that's how it is on Linux too.

switching from WT (for write-combine mappings) to DEVICE

https://reviews.freebsd.org/rS351693 ;)

agrajag9 · 2021-06-14T15:07:02Z

Sorry, forgot to answer - Yes, still the early board rev, but I haven't had problems with PCI in months.

agrajag9 · 2021-06-14T15:14:32Z

Confirmed the WX 2100 works just fine with drm-fbsd13-kmod-5.4.92.g20210419 in stable/13-n245876-088dbb4b8d3 on amd64. Waiting on some new USB drives in the mail so I can make a bootable Linux because somehow I ran out of spare thumb drives...

agrajag9 · 2021-06-14T21:54:28Z

And just to be sure I'm not doing something super dumb elsewhere, here's /boot/loader.conf.local:

boot_multicons="YES"
boot_serial="YES"
console="efi"
exec="gop set 0"
verbose_loading="YES"
boot_verbose="-v"

No kld lines in rc.conf

agrajag9 · 2021-06-20T17:49:00Z

Having an absolutely attrocious time getting Linux to run on this thing, but Fed34 gave me this:

[   51.047877] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=82, emitted seq=85
[   51.065702] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 1188 thread gnome-shel:cs0 pid 1236
[   51.078119] amdgpu 0004:01:00.0: amdgpu: GPU reset begin!
[   51.555537] amdgpu 0004:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[   51.566035] [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[   51.819517] amdgpu: cp is busy, skip halt cp
[   52.070271] amdgpu: rlc is busy, skip halt rlc
[   52.075738] amdgpu 0004:01:00.0: amdgpu: BACO reset
[   52.611663] amdgpu 0004:01:00.0: amdgpu: GPU reset succeeded, trying to resume
[   52.620026] [drm] PCIE GART of 256M enabled (table at 0x000000F400E10000).
[   52.626909] [drm] VRAM is lost due to GPU reset!
[   52.871100] [drm] UVD and UVD ENC initialized successfully.
[   52.977077] [drm] VCE initialized successfully.
[   52.986042] amdgpu 0004:01:00.0: amdgpu: recover vram bo from shadow start
[   52.994516] amdgpu 0004:01:00.0: amdgpu: recover vram bo from shadow done
[   53.001310] [drm] Skip scheduling IBs!
[   53.005049] [drm] Skip scheduling IBs!
[   53.008901] [drm] Skip scheduling IBs!
[   53.008902] amdgpu 0004:01:00.0: amdgpu: GPU reset(2) succeeded!
[   53.018652] [drm] Skip scheduling IBs!
[   53.117794] fbcon: Taking over console
[   53.138909] Console: switching to colour frame buffer device 320x90
[  110.957884] [drm:amdgpu_dm_commit_planes [amdgpu]] *ERROR* Waiting for fences timed out!
[  121.447875] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=300, emitted seq=303
[  121.465868] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 2050 thread gnome-shel:cs0 pid 2080
[  121.478280] amdgpu 0004:01:00.0: amdgpu: GPU reset begin!
[  121.955323] amdgpu 0004:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[  121.965824] [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[  122.218874] amdgpu: cp is busy, skip halt cp
[  122.469444] amdgpu: rlc is busy, skip halt rlc
[  122.474906] amdgpu 0004:01:00.0: amdgpu: BACO reset
[  123.021687] amdgpu 0004:01:00.0: amdgpu: GPU reset succeeded, trying to resume
[  123.030047] [drm] PCIE GART of 256M enabled (table at 0x000000F400E10000).
[  123.036930] [drm] VRAM is lost due to GPU reset!
[  123.281082] [drm] UVD and UVD ENC initialized successfully.
[  123.387057] [drm] VCE initialized successfully.
[  123.396054] amdgpu 0004:01:00.0: amdgpu: recover vram bo from shadow start
[  123.404751] amdgpu 0004:01:00.0: amdgpu: recover vram bo from shadow done
[  123.411547] [drm] Skip scheduling IBs!
[  123.415285] [drm] Skip scheduling IBs!
[  123.419108] amdgpu 0004:01:00.0: amdgpu: GPU reset(4) succeeded!
[  123.419108] [drm] Skip scheduling IBs!

There's a lot more but the system hangs for arbitrary amounts of time and get significant graphical distotions.

Importantly I had to set arm-smmu.disable_bypass=0 on grub's linux line or it would hang somewhere during init while spamming:

[   27.276297] arm-smmu arm-smmu.0.auto: Blocked unknown Stream ID 0x4000; boot with "arm-smmu.disable_bypass=0" to allow, but this may have security implications
[   27.290536] arm-smmu arm-smmu.0.auto:        GFSR 0x80000002, GFSYNR0 0x00000008, GFSYNR1 0x00004000, GFSYNR2 0x00000000

If I can get it to behave, I'll be able to post cleaner, fuller output...

agrajag9 · 2021-06-20T19:42:46Z

Some more context: https://gist.github.com/agrajag9/b0c3722f472d4e8ef6f27c194fc2cf19

valpackett · 2021-06-29T13:08:15Z

That read on my RX 480 + mcbin returns 0x00ec030a.

Is there a chance this is related to the issues you found with the RX 480 and EDK2 ECAM on the MACCHIATTObin?

No. The thing on the mcbin (and socionext developerbox) is that the DesignWare controller doesn't filter TLPs properly, so some devices — mostly "legacy" ones — would appear duplicated, possibly into all the slots. AMD GPUs actually do their own filtering (just like devices supporting ARI, except it doesn't support ARI), so all we had to do was remove the workaround that basically only allowed legacy devices to work.

The gen4 controller used in the early rev LX2160 is a completely different controller.

drm-kmod/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

Lines 179 to 182 in e5194b8

    
           spin_lock_irqsave(&adev->mmio_idx_lock, flags); 
        
           writel((reg * 4), ((void __iomem *)adev->rmmio) + (mmMM_INDEX * 4)); 
        
           ret = readl(((void __iomem *)adev->rmmio) + (mmMM_DATA * 4)); 
        
           spin_unlock_irqrestore(&adev->mmio_idx_lock, flags);

this is how these reads work. You could add logging right there to see what other similar reads return. Maybe also add a msleep(2) between the writel and readl.

agrajag9 · 2021-06-29T14:00:35Z

dmesg spammed with <6>[drm] In amdgpu_mm_rreg: ret == 0 :(

Current patch is looking like this:

--- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c.orig     2021-06-29 00:04:05.165929000 +0000
+++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  2021-06-29 13:16:16.977577000 +0000
@@ -178,10 +178,12 @@

                spin_lock_irqsave(&adev->mmio_idx_lock, flags);
                writel((reg * 4), ((void __iomem *)adev->rmmio) + (mmMM_INDEX * 4));
+               msleep(2);
                ret = readl(((void __iomem *)adev->rmmio) + (mmMM_DATA * 4));
                spin_unlock_irqrestore(&adev->mmio_idx_lock, flags);
        }
        trace_amdgpu_mm_rreg(adev->pdev->device, reg, ret);
+       DRM_INFO("In amdgpu_mm_rreg: ret == %x\n", ret);
        return ret;
 }

@@ -836,8 +838,10 @@
 {
        uint32_t reg;

-       if (amdgpu_sriov_vf(adev))
+       if (amdgpu_sriov_vf(adev)) {
+               DRM_INFO("amdgpu_sriov_vf(adev) == false\n");
                return false;
+       }

        if (amdgpu_passthrough(adev)) {
                /* for FIJI: In whole GPU pass-through virtualization case, after VM reboot
@@ -846,6 +850,7 @@
                 * vpost executed for smc version below 22.15
                 */
                if (adev->asic_type == CHIP_FIJI) {
+                       DRM_INFO("adev->asic_type == CHIP_FIJI\n");
                        int err;
                        uint32_t fw_ver;
                        err = request_firmware(&adev->pm.fw, "amdgpu/fiji_smc.bin", adev->dev);
@@ -860,13 +865,17 @@
        }

        if (adev->has_hw_reset) {
+               DRM_INFO("adev->has_hw_reset == false\n");
                adev->has_hw_reset = false;
                return true;
        }

        /* bios scratch used on CIK+ */
-       if (adev->asic_type >= CHIP_BONAIRE)
+       if (adev->asic_type >= CHIP_BONAIRE) {
+               DRM_INFO("adev->asic_type >= CHIP_BONAIRE\n");
+               DRM_INFO("calling amdgpu_atombios_scratch_need_asic_init(adev)\n");
                return amdgpu_atombios_scratch_need_asic_init(adev);
+    }

        /* check MEM_SIZE for older asics */
        reg = amdgpu_asic_get_config_memsize(adev);
@@ -4013,8 +4022,10 @@
                        adev->pm.pcie_gen_mask = AMDGPU_DEFAULT_PCIE_GEN_MASK;
                if (adev->pm.pcie_mlw_mask == 0)
                        adev->pm.pcie_mlw_mask = AMDGPU_DEFAULT_PCIE_MLW_MASK;
-               return;
+               // return;
        }
+
+       adev->pm.pcie_gen_mask = CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3 | CAIL_ASIC_PCIE_LINK_SPEED_SUPPORT_GEN3;

        if (adev->pm.pcie_gen_mask && adev->pm.pcie_mlw_mask)
                return;

agrajag9 · 2021-06-29T18:00:56Z

Added a little more to the patch and seeing some other interesting things:

# kldload -v amdgpu
anon_inodefs registered
debugfs registered
<6>[drm] amdgpu kernel modesetting enabled.
drmn0: <drmn> on vgapci0
vgapci0: child drmn0 requested pci_enable_io
vgapci0: child drmn0 requested pci_enable_io
sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
<6>[drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
<6>[drm] register mmio base: 0x40000000
<6>[drm] register mmio size: 262144
<6>[drm] add ip block number 0 <vi_common>
<6>[drm] add ip block number 1 <gmc_v8_0>
<6>[drm] add ip block number 2 <tonga_ih>
<6>[drm] add ip block number 3 <gfx_v8_0>
<6>[drm] add ip block number 4 <sdma_v3_0>
<6>[drm] add ip block number 5 <powerplay>
<6>[drm] add ip block number 6 <dm>
<6>[drm] add ip block number 7 <uvd_v6_0>
<6>[drm] add ip block number 8 <vce_v3_0>
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == fc3
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 5cb
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 5cf
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 3301
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 3348
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 1ad
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] UVD is enabled in VM mode
<6>[drm] UVD ENC is enabled in VM mode
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 1ad
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] VCE enabled in VM mode
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 1ad
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 1ad
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] adev->asic_type >= CHIP_BONAIRE
<6>[drm] calling amdgpu_atombios_scratch_need_asic_init(adev)
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 5d0
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c:
<6>[drm]        adev->bios_scratch_reg_offset      == 5c9
<6>[drm] RREG32(adev->bios_scratch_reg_offset + 7) == 0
<6>[drm] GPU posting now...
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 83
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 82
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 83
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 82

This repeats several thousand times until the eventual panic at 10s. Either we're failing to properly read the registers or we're pointed at the wrong location in memory.

I went further down the rabbithole and I'm wondering if maybe the linuxkpi pci code is doing something wrong here?

drm-kmod/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

Lines 2684 to 2686 in b45715c

    
           if (adev->asic_type >= CHIP_BONAIRE) { 
        
           	adev->rmmio_base = pci_resource_start(adev->pdev, 5); 
        
           	adev->rmmio_size = pci_resource_len(adev->pdev, 5);

valpackett · 2021-06-29T18:23:04Z

WAIT WAIT WAIT.

nvme0@pci2:1:0:0:       class=0x010802 rev=0x00 hdr=0x00 vendor=0x144d device=0xa808 subvendor=0x144d subdevice=0xa801
    bar   [10] = type Memory, range 64, base 0x40000000, size 16384, enabled
vgapci0@pci4:1:0:0:     class=0x030000 rev=0xc7 hdr=0x00 vendor=0x1002 device=0x699f subvendor=0x1da2 subdevice=0xe367
    bar   [24] = type Memory, range 32, base 0x40000000, size 262144, enabled

Is this.. supposed to happen.. or are these host physical addresses and did both PCIe controllers map their devices into the same address??

Just in case I'm not completely stupid, please test without an NVMe drive, using SATA or USB for the system disk.

agrajag9 · 2021-06-29T19:03:14Z

Removed the NVMe and booted from USB. It still tries to post and panics, but it looks like it might be at least accessing different registers eventually?

<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 83
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 83
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 83
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 83
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 83
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 83
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 83
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == e
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == f
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == e
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == e
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
...
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 16f5
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 16f4
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 16f4
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
<6>[drm] In amdgpu_mm_rreg:
<6>[drm] (reg * 4) < adev->rmmio_size
<6>[drm] reg == 16fb
<6>[drm] adev->rmmio_size == 40000
<6>[drm] ret == 0
[drm ERROR :atom_op_jump] atombios stuck in loop for more than 10secs aborting
[drm ERROR :amdgpu_atom_execute_table_locked] atombios stuck executing CA56 (len 130, WS 0, PS 0) @ 0xCA75
[drm ERROR :amdgpu_atom_execute_table_locked] atombios stuck executing A984 (len 158, WS 0, PS 8) @ 0xA9B9
drmn0: gpu post error!

agrajag9 · 2021-06-29T19:11:38Z

Well, that's curious...
https://gist.github.com/agrajag9/6986f05cd70baa774384fe8249aa356e#file-honeycomb-aml-L4604
and
https://gist.github.com/agrajag9/6986f05cd70baa774384fe8249aa356e#file-honeycomb-aml-L4847
Looks like that's where we're getting our base address and it's the same for both busses.

valpackett · 2021-06-29T19:57:29Z

That's range minimum, but they have different translation offset. In pciconf output we only see that offset applied in "Prefetchable Memory" BARs but not ones marked just "Memory". So might not be an issue after all (?) I really don't know at this point, it just looks suspicious.

agrajag9 · 2021-06-30T17:00:38Z

This is interesting...

https://gist.github.com/agrajag9/cfc0a6887a8a001d0b35335ba4de0400

Starts with

vgapci0: In sys/compat/linuxkpi/common/src/linux_pci.c:
vgapci0: rle->type  == 0x3
vgapci0: rle->rid   == 0x10
vgapci0: rle->flags == 0x1
vgapci0: rle->start == 0xa400000000
vgapci0: rle->end   == 0xa40fffffff
vgapci0: rle->count == 0x10000000

but then almost immediately after:

vgapci0: In sys/compat/linuxkpi/common/src/linux_pci.c:
vgapci0: rle->type  == 0x3
vgapci0: rle->rid   == 0x24
vgapci0: rle->flags == 0x1
vgapci0: rle->start == 0x40000000
vgapci0: rle->end   == 0x4003ffff
vgapci0: rle->count == 0x40000

From pciconf -lvbc:

    bar   [10] = type Prefetchable Memory, range 64, base 0xa400000000, size 268435456, enabled
    bar   [18] = type Prefetchable Memory, range 64, base 0xa410000000, size 2097152, enabled
    bar   [20] = type I/O Port, range 32, base 0, size 256, disabled
    bar   [24] = type Memory, range 32, base 0x40000000, size 262144, enabled

valpackett · 2021-06-30T17:23:58Z

You've replicated pciconf with log statements :) Look at the rle->rid, the first output matches [10], the second is for [24].

BTW, full info about the BARs is found at https://rocmdocs.amd.com/en/latest/GCN_ISA_Manuals/PCIe-features.html#bar-memory-overview — so the [10] is VRAM, [18] is doorbell, [24] is of course the configuration registers that all return 0 on reads for you…

Looking at the Linux dmesg again

[    1.774856] pci 0002:01:00.0: BAR 0: assigned [mem 0x9400000000-0x9400003fff 64bit]

[    1.993408] pci 0004:01:00.0: BAR 0: assigned [mem 0xa400000000-0xa40fffffff 64bit pref]
[    2.001497] pci 0004:01:00.0: BAR 2: assigned [mem 0xa410000000-0xa4101fffff 64bit pref]
[    2.009580] pci 0004:01:00.0: BAR 5: assigned [mem 0xa040000000-0xa04003ffff]
[    2.016707] pci 0004:01:00.0: BAR 6: assigned [mem 0xa040040000-0xa04005ffff pref]
[    2.024264] pci 0004:01:00.1: BAR 0: assigned [mem 0xa410200000-0xa410203fff 64bit]
[    2.031916] pci 0004:01:00.0: BAR 4: assigned [io  0x10000-0x100ff]

so, we are supposed to assign 0xa040000000 to this BAR, and 0x9400000000 to the NVMe drive's only one, otherwise they collide. Looks like this is the bug after all.

valpackett · 2021-06-30T18:04:18Z

Potentially useful logging:

diff --git i/sys/dev/pci/pci_host_generic.c w/sys/dev/pci/pci_host_generic.c
index 0c45f5d316e..3f999d86c5b 100644
--- i/sys/dev/pci/pci_host_generic.c
+++ w/sys/dev/pci/pci_host_generic.c
@@ -345,6 +345,7 @@ generic_pcie_translate_resource(device_t dev, int type, rman_res_t start,
 			phys_base = sc->ranges[i].phys_base;
 			size = sc->ranges[i].size;
 
+			device_printf(dev, "translate: start %lx pci_base %lx phys_base %lx size %x\n", start, pci_base, phys_base, size);
 			if (start < pci_base || start >= pci_base + size)
 				continue;
 
@@ -364,6 +365,7 @@ generic_pcie_translate_resource(device_t dev, int type, rman_res_t start,
 			if (type == space) {
 				*new_start = start - pci_base + phys_base;
 				*new_end = end - pci_base + phys_base;
+				device_printf(dev, "translate: new start %lx end %lx\n", new_start, new_end);
 				found = true;
 				break;
 			}
@@ -412,6 +414,10 @@ pci_host_generic_core_alloc_resource(device_t dev, device_t child, int type,
 		    device_get_nameunit(child));
 		return (NULL);
 	}
+	device_printf(dev,
+	    "translated resource %jx-%jx type %x for %s to %lx-%lx\n",
+	    (uintmax_t)start, (uintmax_t)end, type,
+	    device_get_nameunit(child), phys_start, phys_end);
 
 	if (bootverbose) {
 		device_printf(dev,
@@ -456,9 +462,14 @@ generic_pcie_activate_resource(device_t dev, device_t child, int type,
 
 	start = rman_get_start(r);
 	end = rman_get_end(r);
+	rman_res_t ostart = start, oend = end;
 	if (!generic_pcie_translate_resource(dev, type, start, end, &start,
 	    &end))
 		return (EINVAL);
+	device_printf(dev,
+	    "activate:translated resource %jx-%jx type %x for %s to %lx-%lx\n",
+	    (uintmax_t)ostart, (uintmax_t)oend, type,
+	    device_get_nameunit(child), start, end);
 	rman_set_start(r, start);
 	rman_set_end(r, end);
 
diff --git i/sys/dev/pci/pci_host_generic_acpi.c w/sys/dev/pci/pci_host_generic_acpi.c
index 763a84d2fd5..d16d614f5b1 100644
--- i/sys/dev/pci/pci_host_generic_acpi.c
+++ w/sys/dev/pci/pci_host_generic_acpi.c
@@ -157,6 +157,7 @@ pci_host_generic_acpi_parse_resource(ACPI_RESOURCE *res, void *arg)
 	    res->Data.Address.ResourceType == ACPI_IO_RANGE) {
 		sc->base.ranges[r].pci_base = min;
 		sc->base.ranges[r].phys_base = min + off;
+		device_printf(dev, "ACPIPCI-parse range %d pci_base %lx phys_base %lx\n", r, min, min + off);
 		sc->base.ranges[r].size = max - min + 1;
 		if (res->Data.Address.ResourceType == ACPI_MEMORY_RANGE)
 			sc->base.ranges[r].flags |= FLAG_TYPE_MEM;

Not tested so you'd have to fix the errors if there are any.

agrajag9 · 2021-06-30T19:42:17Z

No change to the output when loading the module, but dmesg here: https://gist.github.com/agrajag9/be5c9c58b91497923ae9512dac32f0d3

pcib0: <Generic PCI host controller> on acpi0
pcib0: ACPIPCI-parse range 0 pci_base 40000000 phys_base 9040000000
pcib0: ACPIPCI-parse range 1 pci_base 9400000000 phys_base 9400000000
pcib0: ACPIPCI-parse range 2 pci_base 0 phys_base 9010000000
pcib0: Bus is cache-coherent
pcib0: ECAM for bus 1-255 at mem 9000100000-900fffffff

pcib1: <Generic PCI host controller> on acpi0
pcib1: ACPIPCI-parse range 0 pci_base 40000000 phys_base a040000000
pcib1: ACPIPCI-parse range 1 pci_base a400000000 phys_base a400000000
pcib1: ACPIPCI-parse range 2 pci_base 0 phys_base a010000000
pcib1: Bus is cache-coherent
pcib1: ECAM for bus 1-255 at mem a000100000-a00fffffff
pcib1: translate: start a400000000 pci_base 40000000 phys_base a040000000 size c0000000
pcib1: translate: start a400000000 pci_base a400000000 phys_base a400000000 size 400000000
pcib1: translate: new start a400000000 end a40fffffff
pcib1: translated resource a400000000-a40fffffff type 3 for (null) to a400000000-a40fffffff
pcib1: rman_reserve_resource: start=0xa400000000, end=0xa40fffffff, count=0x10000000
pcib1: translate: start a410000000 pci_base 40000000 phys_base a040000000 size c0000000
pcib1: translate: start a410000000 pci_base a400000000 phys_base a400000000 size 400000000
pcib1: translate: new start a410000000 end a4101fffff
pcib1: translated resource a410000000-a4101fffff type 3 for (null) to a410000000-a4101fffff
pcib1: rman_reserve_resource: start=0xa410000000, end=0xa4101fffff, count=0x200000
pcib1: translate: start 40000000 pci_base 40000000 phys_base a040000000 size c0000000
pcib1: translate: new start a040000000 end a04003ffff
pcib1: translated resource 40000000-4003ffff type 3 for (null) to a040000000-a04003ffff
pcib1: rman_reserve_resource: start=0x40000000, end=0x4003ffff, count=0x40000
pcib1: translate: start 40040000 pci_base 40000000 phys_base a040000000 size c0000000
pcib1: translate: new start a040040000 end a040043fff
pcib1: translated resource 40040000-40043fff type 3 for (null) to a040040000-a040043fff
pcib1: rman_reserve_resource: start=0x40040000, end=0x40043fff, count=0x4000

Also had to change a few things in your patch to make it work right, also in that gist.

agrajag9 · 2021-06-30T20:03:56Z

Interestingly I think we only see this for pcib1. There are no translation lines for pcib0, when I'm pretty sure there should be.

valpackett · 2021-06-30T20:37:11Z

Oh, you should've loaded amdgpu in that dmesg, the activate:translated line is for that case. The translated addresses are actually set on the resource only on activation. Which, hmm, 1) why? and 2) maybe the activation is just not happening..somehow?

In any case,

diff --git i/sys/dev/pci/pci_host_generic.c w/sys/dev/pci/pci_host_generic.c
index 0c45f5d316e..99927487e29 100644
--- i/sys/dev/pci/pci_host_generic.c
+++ w/sys/dev/pci/pci_host_generic.c
@@ -419,7 +419,7 @@ pci_host_generic_core_alloc_resource(device_t dev, device_t child, int type,
 		    start, end, count);
 	}
 
-	res = rman_reserve_resource(rm, start, end, count, flags, child);
+	res = rman_reserve_resource(rm, phys_start, phys_end, count, flags, child);
 	if (res == NULL)
 		goto fail;

try pciconf -lvbc with this patch and then load amdgpu

agrajag9 · 2021-06-30T22:11:04Z

New dmesg: https://gist.github.com/agrajag9/7a46de7807c43a8bea3c876727d85820

Except now it panics WAY faster:

# kldload -v amdgpu
anon_inodefs registered
debugfs registered
<6>[drm] amdgpu kernel modesetting enabled.
drmn0: <drmn> on vgapci0
vgapci0: In sys/compat/linuxkpi/common/src/linux_pci.c:
vgapci0: rle->type  == 0x3
vgapci0: rle->rid   == 0x10
vgapci0: rle->flags == 0x1
vgapci0: rle->start == 0xa400000000
vgapci0: rle->end   == 0xa40fffffff
vgapci0: rle->count == 0x10000000
vgapci0: child drmn0 requested pci_enable_io
vgapci0: child drmn0 requested pci_enable_io
sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
<6>[drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
panic: Assertion size > 0 failed at /usr/src/sys/kern/subr_vmem.c:1332
cpuid = 0
time = 1625091021
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x184
panic() at panic+0x44
vmem_alloc() at vmem_alloc+0x104
kva_alloc() at kva_alloc+0x28
pmap_mapdev_attr() at pmap_mapdev_attr+0xbc
_ioremap_attr() at _ioremap_attr+0x10
amdgpu_device_init() at amdgpu_device_init+0x8fc
amdgpu_driver_load_kms() at amdgpu_driver_load_kms+0x48
drm_dev_register() at drm_dev_register+0xcc
amdgpu_pci_probe() at amdgpu_pci_probe+0x210
linux_pci_attach_device() at linux_pci_attach_device+0x294
device_attach() at device_attach+0x400
device_probe_and_attach() at device_probe_and_attach+0x7c
bus_generic_driver_added() at bus_generic_driver_added+0x74
devclass_driver_added() at devclass_driver_added+0x44
devclass_add_driver() at devclass_add_driver+0x140
_linux_pci_register_driver() at _linux_pci_register_driver+0xc8
amdgpu_evh() at amdgpu_evh+0xb4
module_register_init() at module_register_init+0xc4
linker_load_module() at linker_load_module+0xb2c
kern_kldload() at kern_kldload+0x15c
sys_kldload() at sys_kldload+0x64
do_el0_sync() at do_el0_sync+0x4a0
handle_el0_sync() at handle_el0_sync+0x90
--- exception, esr 0x56000000
KDB: enter: panic
[ thread pid 65343 tid 100525 ]
Stopped at      kdb_enter+0x44: undefined       f904411f
db>

valpackett · 2021-06-30T23:38:46Z

pci_host_generic_core_alloc_resource FAIL

oh.. okay. So it must be done this way for a reason.

BTW:

pci0: on pcib0

I'm not sure if there's any harm caused by this (quite possibly none) but I've noticed https://reviews.freebsd.org/D30953 has appeared recently to fix this

valpackett · 2021-06-30T23:51:43Z

Okay now I think I see it.

LinuxKPI uses BUS_TRANSLATE_RESOURCE to actually get the physical address.

pci_host_generic does not implement it. Only ofw_pcib currently does in the whole tree.

Because of that, LinuxKPI returns the PCI address instead of the translated physical address to the driver.

diff --git i/sys/dev/pci/pci_host_generic.c w/sys/dev/pci/pci_host_generic.c
index 0c45f5d316e..6694da9d43c 100644
--- i/sys/dev/pci/pci_host_generic.c
+++ w/sys/dev/pci/pci_host_generic.c
@@ -324,7 +324,7 @@ pci_host_generic_core_release_resource(device_t dev, device_t child, int type,
 }
 
 static bool
-generic_pcie_translate_resource(device_t dev, int type, rman_res_t start,
+generic_pcie_translate_resource_end(device_t dev, int type, rman_res_t start,
     rman_res_t end, rman_res_t *new_start, rman_res_t *new_end)
 {
 	struct generic_pcie_core_softc *sc;
@@ -380,6 +380,16 @@ generic_pcie_translate_resource(device_t dev, int type, rman_res_t start,
 	return (found);
 }
 
+static int
+generic_pcie_translate_resource(device_t bus, int type,
+    rman_res_t start, rman_res_t *newstart)
+{
+	rman_res_t newend; /* unused */
+
+	return (generic_pcie_translate_resource_end(
+	    bus, type, start, 0, newstart, &newend));
+}
+
 struct resource *
 pci_host_generic_core_alloc_resource(device_t dev, device_t child, int type,
     int *rid, rman_res_t start, rman_res_t end, rman_res_t count, u_int flags)
@@ -404,7 +414,7 @@ pci_host_generic_core_alloc_resource(device_t dev, device_t child, int type,
 		    type, rid, start, end, count, flags));
 
 	/* Translate the address from a PCI address to a physical address */
-	if (!generic_pcie_translate_resource(dev, type, start, end, &phys_start,
+	if (!generic_pcie_translate_resource_end(dev, type, start, end, &phys_start,
 	    &phys_end)) {
 		device_printf(dev,
 		    "Failed to translate resource %jx-%jx type %x for %s\n",
@@ -456,7 +466,7 @@ generic_pcie_activate_resource(device_t dev, device_t child, int type,
 
 	start = rman_get_start(r);
 	end = rman_get_end(r);
-	if (!generic_pcie_translate_resource(dev, type, start, end, &start,
+	if (!generic_pcie_translate_resource_end(dev, type, start, end, &start,
 	    &end))
 		return (EINVAL);
 	rman_set_start(r, start);
@@ -527,6 +537,7 @@ static device_method_t generic_pcie_methods[] = {
 	DEVMETHOD(bus_activate_resource,	generic_pcie_activate_resource),
 	DEVMETHOD(bus_deactivate_resource,	generic_pcie_deactivate_resource),
 	DEVMETHOD(bus_release_resource,		pci_host_generic_core_release_resource),
+	DEVMETHOD(bus_translate_resource,	generic_pcie_translate_resource),
 	DEVMETHOD(bus_setup_intr,		bus_generic_setup_intr),
 	DEVMETHOD(bus_teardown_intr,		bus_generic_teardown_intr),

agrajag9 · 2021-07-01T02:25:20Z

No dice:

# kldload -v amdgpu
anon_inodefs registered
debugfs registered
<6>[drm] amdgpu kernel modesetting enabled.
drmn0: <drmn> on vgapci0
vgapci0: In sys/compat/linuxkpi/common/src/linux_pci.c:
vgapci0: rle->type  == 0x3
vgapci0: rle->rid   == 0x10
vgapci0: rle->flags == 0x1
vgapci0: rle->start == 0xa400000000
vgapci0: rle->end   == 0xa40fffffff
vgapci0: rle->count == 0x10000000
pcib1: translate: start a400000000 pci_base 40000000 phys_base a040000000 size c0000000
pcib1: translate: start a400000000 pci_base a400000000 phys_base a400000000 size 400000000
pcib1: translate: new start a400000000 end 0
drmn0: translate of 0xa400000000 failed
vgapci0: child drmn0 requested pci_enable_io
vgapci0: child drmn0 requested pci_enable_io
sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
<6>[drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
panic: Assertion size > 0 failed at /usr/src/sys/kern/subr_vmem.c:1332
cpuid = 1
time = 1625106242
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x184
panic() at panic+0x44
vmem_alloc() at vmem_alloc+0x104
kva_alloc() at kva_alloc+0x28
pmap_mapdev_attr() at pmap_mapdev_attr+0xbc
_ioremap_attr() at _ioremap_attr+0x10
amdgpu_device_init() at amdgpu_device_init+0x8fc
amdgpu_driver_load_kms() at amdgpu_driver_load_kms+0x48
drm_dev_register() at drm_dev_register+0xcc
amdgpu_pci_probe() at amdgpu_pci_probe+0x210
linux_pci_attach_device() at linux_pci_attach_device+0x294
device_attach() at device_attach+0x400
device_probe_and_attach() at device_probe_and_attach+0x7c
bus_generic_driver_added() at bus_generic_driver_added+0x74
devclass_driver_added() at devclass_driver_added+0x44
devclass_add_driver() at devclass_add_driver+0x140
_linux_pci_register_driver() at _linux_pci_register_driver+0xc8
amdgpu_evh() at amdgpu_evh+0xb4
module_register_init() at module_register_init+0xc4
linker_load_module() at linker_load_module+0xb2c
kern_kldload() at kern_kldload+0x15c
sys_kldload() at sys_kldload+0x64
do_el0_sync() at do_el0_sync+0x4a0
handle_el0_sync() at handle_el0_sync+0x90
--- exception, esr 0x56000000
KDB: enter: panic
[ thread pid 33922 tid 100502 ]
Stopped at      kdb_enter+0x44: undefined       f904411f

dmesg: https://gist.github.com/agrajag9/8c7d7f03b536f9287d913b6c6ea3a4e4

valpackett · 2021-07-01T08:25:19Z

pci_host_generic_core_alloc_resource FAIL

Looks like you haven't reverted the bad one line patch from above that changes rman_reserve_resource args?

agrajag9 · 2021-07-01T12:31:55Z

🤦‍♂️ yes, you are correct...

But still broken :(

pcib1: <Generic PCI host controller> on acpi0
pcib1: ACPIPCI-parse range 0 pci_base 40000000 phys_base a040000000
pcib1: ACPIPCI-parse range 1 pci_base a400000000 phys_base a400000000
pcib1: ACPIPCI-parse range 2 pci_base 0 phys_base a010000000
pcib1: Bus is cache-coherent
pcib1: ECAM for bus 1-255 at mem a000100000-a00fffffff
pci1: <PCI bus> on pcib1
pci1: domain=4, physical bus=1
found->	vendor=0x1002, dev=0x699f, revid=0xc7
	domain=4, bus=1, slot=0, func=0
	class=03-00-00, hdrtype=0x00, mfdev=1
	cmdreg=0x0006, statreg=0x0010, cachelnsz=0 (dwords)
	lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns)
	intpin=a, irq=255
	powerspec 3  supports D0 D1 D2 D3  current D0
	MSI supports 1 message, 64 bit
	map[10]: type Prefetchable Memory, range 64, base 0xa400000000, size 28, enabled
pcib1: translate: start a400000000 pci_base 40000000 phys_base a040000000 size c0000000
pcib1: translate: start a400000000 pci_base a400000000 phys_base a400000000 size 400000000
pcib1: translate: new start a400000000 end a40fffffff
pcib1: translated resource a400000000-a40fffffff type 3 for (null) to a400000000-a40fffffff
pcib1: rman_reserve_resource: start=0xa400000000, end=0xa40fffffff, count=0x10000000
	map[18]: type Prefetchable Memory, range 64, base 0xa410000000, size 21, enabled
pcib1: translate: start a410000000 pci_base 40000000 phys_base a040000000 size c0000000
pcib1: translate: start a410000000 pci_base a400000000 phys_base a400000000 size 400000000
pcib1: translate: new start a410000000 end a4101fffff
pcib1: translated resource a410000000-a4101fffff type 3 for (null) to a410000000-a4101fffff
pcib1: rman_reserve_resource: start=0xa410000000, end=0xa4101fffff, count=0x200000
	map[20]: type I/O Port, range 32, base 0, size  8, port disabled
	map[24]: type Memory, range 32, base 0x40000000, size 18, enabled
pcib1: translate: start 40000000 pci_base 40000000 phys_base a040000000 size c0000000
pcib1: translate: new start a040000000 end a04003ffff
pcib1: translated resource 40000000-4003ffff type 3 for (null) to a040000000-a04003ffff
pcib1: rman_reserve_resource: start=0x40000000, end=0x4003ffff, count=0x40000
pcib1: pci_host_generic_core_alloc_resource FAIL: type=3, rid=36, start=0000000040000000, end=000000004003ffff, count=0000000000040000, flags=4800
pcib1: translate: start 0 pci_base 40000000 phys_base a040000000 size c0000000
pcib1: translate: start 0 pci_base a400000000 phys_base a400000000 size 400000000
pcib1: translate: start 0 pci_base 0 phys_base a010000000 size 10000
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: Failed to translate resource 0-ffffffffffffffff type 3 for (null)
pci1: pci4:1:0:0 bar 0x24 failed to allocate
found->	vendor=0x1002, dev=0xaae0, revid=0x00
	domain=4, bus=1, slot=0, func=1
	class=04-03-00, hdrtype=0x00, mfdev=1
	cmdreg=0x0000, statreg=0x0010, cachelnsz=0 (dwords)
	lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns)
	intpin=b, irq=255
	powerspec 3  supports D0 D1 D2 D3  current D0
	MSI supports 1 message, 64 bit
	map[10]: type Memory, range 64, base 0x40040000, size 14, memory disabled
pcib1: translate: start 40040000 pci_base 40000000 phys_base a040000000 size c0000000
pcib1: translate: new start a040040000 end a040043fff
pcib1: translated resource 40040000-40043fff type 3 for (null) to a040040000-a040043fff
pcib1: rman_reserve_resource: start=0x40040000, end=0x40043fff, count=0x4000
pcib1: pci_host_generic_core_alloc_resource FAIL: type=3, rid=16, start=0000000040040000, end=0000000040043fff, count=0000000000004000, flags=3800
pcib1: translate: start 0 pci_base 40000000 phys_base a040000000 size c0000000
pcib1: translate: start 0 pci_base a400000000 phys_base a400000000 size 400000000
pcib1: translate: start 0 pci_base 0 phys_base a010000000 size 10000
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: translate: start 0 pci_base 0 phys_base 0 size 0
pcib1: Failed to translate resource 0-ffffffffffffffff type 3 for (null)
pci1: pci4:1:0:1 bar 0x10 failed to allocate
vgapci0: <VGA-compatible display> mem 0xa400000000-0xa40fffffff,0xa410000000-0xa4101fffff at device 0.0 on pci1
pci1: <multimedia, HDA> at device 0.1 (no driver attached)

]# kldload -v amdgpu
anon_inodefs registered
debugfs registered
<6>[drm] amdgpu kernel modesetting enabled.
drmn0: <drmn> on vgapci0
vgapci0: In sys/compat/linuxkpi/common/src/linux_pci.c:
vgapci0: rle->type  == 0x3
vgapci0: rle->rid   == 0x10
vgapci0: rle->flags == 0x1
vgapci0: rle->start == 0xa400000000
vgapci0: rle->end   == 0xa40fffffff
vgapci0: rle->count == 0x10000000
pcib1: translate: start a400000000 pci_base 40000000 phys_base a040000000 size c0000000
pcib1: translate: start a400000000 pci_base a400000000 phys_base a400000000 size 400000000
pcib1: translate: new start a400000000 end 0
drmn0: translate of 0xa400000000 failed
vgapci0: child drmn0 requested pci_enable_io
vgapci0: child drmn0 requested pci_enable_io
sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
<6>[drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
panic: Assertion size > 0 failed at /usr/src/sys/kern/subr_vmem.c:1332
cpuid = 2
time = 1625142389
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x184
panic() at panic+0x44
vmem_alloc() at vmem_alloc+0x104
kva_alloc() at kva_alloc+0x28
pmap_mapdev_attr() at pmap_mapdev_attr+0xbc
_ioremap_attr() at _ioremap_attr+0x10
amdgpu_device_init() at amdgpu_device_init+0x8fc
amdgpu_driver_load_kms() at amdgpu_driver_load_kms+0x48
drm_dev_register() at drm_dev_register+0xcc
amdgpu_pci_probe() at amdgpu_pci_probe+0x210
linux_pci_attach_device() at linux_pci_attach_device+0x294
device_attach() at device_attach+0x400
device_probe_and_attach() at device_probe_and_attach+0x7c
bus_generic_driver_added() at bus_generic_driver_added+0x74
devclass_driver_added() at devclass_driver_added+0x44
devclass_add_driver() at devclass_add_driver+0x140
_linux_pci_register_driver() at _linux_pci_register_driver+0xc8
amdgpu_evh() at amdgpu_evh+0xb4
module_register_init() at module_register_init+0xc4
linker_load_module() at linker_load_module+0xb2c
kern_kldload() at kern_kldload+0x15c
sys_kldload() at sys_kldload+0x64
do_el0_sync() at do_el0_sync+0x4a0
handle_el0_sync() at handle_el0_sync+0x90
--- exception, esr 0x56000000
KDB: enter: panic
[ thread pid 25859 tid 121455 ]
Stopped at      kdb_enter+0x44: undefined       f904411f
db>

dmesg.boot and the current patchset: https://gist.github.com/agrajag9/5d5242d920f4b4b1a90ea2d0c29f479b

valpackett · 2021-07-01T12:54:21Z

That's weird.

Now first, revert ALL patches, rebuild everything — make sure you're in the state that was there when the thread started — post error.

Then apply this:

--- i/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ w/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2682,7 +2682,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	/* Registers mapping */
 	/* TODO: block userspace mapping of io register */
 	if (adev->asic_type >= CHIP_BONAIRE) {
-		adev->rmmio_base = pci_resource_start(adev->pdev, 5);
+		adev->rmmio_base = pci_resource_start(adev->pdev, 5) + 0xa000000000;
 		adev->rmmio_size = pci_resource_len(adev->pdev, 5);
 	} else {
 		adev->rmmio_base = pci_resource_start(adev->pdev, 2);
@@ -4013,9 +4013,11 @@ static void amdgpu_device_get_pcie_info(struct amdgpu_device *adev)
 			adev->pm.pcie_gen_mask = AMDGPU_DEFAULT_PCIE_GEN_MASK;
 		if (adev->pm.pcie_mlw_mask == 0)
 			adev->pm.pcie_mlw_mask = AMDGPU_DEFAULT_PCIE_MLW_MASK;
-		return;
+		// return;
 	}
 
+	adev->pm.pcie_gen_mask = CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3 | CAIL_ASIC_PCIE_LINK_SPEED_SUPPORT_GEN3;
+
 	if (adev->pm.pcie_gen_mask && adev->pm.pcie_mlw_mask)
 		return;

If it works, two things to test:

revert the second hunk, remove hw.amdgpu.pcie_gen_cap=0x00040004 from loader.conf, see if there's no issue with the generation at all, and if there is, add back the loader.conf thing to see if it works fine
revert the first hunk (the hardcoded 0xa000000000) and apply the last patch (BUS_TRANSLATE_RESOURCE) instead

agrajag9 · 2021-07-01T14:49:23Z

No panic! Need to test all the other stuff too, but certainly this is progress of some sort.

[root@honeycomb ~]# kldload -v amdgpu
anon_inodefs registered
debugfs registered
<6>[drm] amdgpu kernel modesetting enabled.
drmn0: <drmn> on vgapci0
vgapci0: child drmn0 requested pci_enable_io
vgapci0: child drmn0 requested pci_enable_io
sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
<6>[drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
<6>[drm] register mmio base: 0x40000000
<6>[drm] register mmio size: 262144
<6>[drm] add ip block number 0 <vi_common>
<6>[drm] add ip block number 1 <gmc_v8_0>
<6>[drm] add ip block number 2 <tonga_ih>
<6>[drm] add ip block number 3 <gfx_v8_0>
<6>[drm] add ip block number 4 <sdma_v3_0>
<6>[drm] add ip block number 5 <powerplay>
<6>[drm] add ip block number 6 <dce_v11_0>
<6>[drm] add ip block number 7 <uvd_v6_0>
<6>[drm] add ip block number 8 <vce_v3_0>
<6>[drm] UVD is enabled in VM mode
<6>[drm] UVD ENC is enabled in VM mode
<6>[drm] VCE enabled in VM mode
<6>[drm] vm size is 128 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
polaris12_mc.bin: could not load firmware image, error 2
amdgpu/polaris12_mc.bin: could not load firmware image, error 2
amdgpu_polaris12_mc.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_mc_bin' version 0: 32608 bytes loaded at 0xffff000179b0c620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_mc.bin'
drmn0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
drmn0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
<6>[drm] Detected VRAM RAM=4096M, BAR=256M
<6>[drm] RAM width 128bits GDDR5
<6>[drm] amdgpu: 4096M of VRAM memory ready
<6>[drm] amdgpu: 4096M of GTT memory ready.
<6>[drm] GART: num cpu pages 65536, num gpu pages 65536
<6>[drm] PCIE GART of 256M enabled (table at 0x000000F400E10000).
vgapci0: attempting to allocate 1 MSI vectors (1 supported)
vgapci0: using IRQ 30 for MSI
<6>[drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
<6>[drm] Driver supports precise vblank timestamp query.
polaris12_pfp_2.bin: could not load firmware image, error 2
amdgpu/polaris12_pfp_2.bin: could not load firmware image, error 2
amdgpu_polaris12_pfp_2.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_pfp_2_bin' version 0: 17044 bytes loaded at 0xffff000179b35620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_pfp_2.bin'
polaris12_me_2.bin: could not load firmware image, error 2
amdgpu/polaris12_me_2.bin: could not load firmware image, error 2
amdgpu_polaris12_me_2.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_me_2_bin' version 0: 17044 bytes loaded at 0xffff000179b5a620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_me_2.bin'
polaris12_ce_2.bin: could not load firmware image, error 2
amdgpu/polaris12_ce_2.bin: could not load firmware image, error 2
amdgpu_polaris12_ce_2.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_ce_2_bin' version 0: 8852 bytes loaded at 0xffff000179b7f620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_ce_2.bin'
<6>[drm] Chained IB support enabled!
polaris12_rlc.bin: could not load firmware image, error 2
amdgpu/polaris12_rlc.bin: could not load firmware image, error 2
amdgpu_polaris12_rlc.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_rlc_bin' version 0: 16660 bytes loaded at 0xffff000179ba2620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_rlc.bin'
polaris12_mec_2.bin: could not load firmware image, error 2
amdgpu/polaris12_mec_2.bin: could not load firmware image, error 2
amdgpu_polaris12_mec_2.bin: could not load firmware image, error 2
b kernel: Failed to add firmware: 'amdgpu_polaris12_mec_2_bin' version 0: 262824 bytes loaded at 0xffff00017c000620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_mec_2.bin'
polaris12_mec2_2.bin: could not load firmware image, error 2
amdgpu/polaris12_mec2_2.bin: could not load firmware image, error 2
amdgpu_polaris12_mec2_2.bin: could not load firmware image, error 2
WC MTRR for [0xa400000000-0xa40fffffff]: -45; pefirmware: 'amdgpu_polaris12_mec2_2_bin' version 0: 262824 bytes loaded at 0xffff00017c061620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_mec2_2.bin'
rformance may suffer
polaris12_sdma.bin: could not load firmware image, error 2
amdgpu/polaris12_sdma.bin: could not load firmware image, error 2
amdgpu_polaris12_sdma.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_sdma_bin' version 0: 12692 bytes loaded at 0xffff000179bc7620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_sdma.bin'
polaris12_sdma1.bin: could not load firmware image, error 2
amdgpu/polaris12_sdma1.bin: could not load firmware image, error 2
amdgpu_polaris12_sdma1.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_sdma1_bin' version 0: 12692 bytes loaded at 0xffff00017c0c2620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_sdma1.bin'
<6>[drm] Connector DP-1: get mode from tunables:
<6>[drm]   - kern.vt.fb.modes.DP-1
<6>[drm]   - kern.vt.fb.default_mode
<6>[drm] Connector HDMI-A-1: get mode from tunables:
<6>[drm]   - kern.vt.fb.modes.HDMI-A-1
<6>[drm]   - kern.vt.fb.default_mode
<6>[drm] Connector DVI-D-1: get mode from tunables:
<6>[drm]   - kern.vt.fb.modes.DVI-D-1
<6>[drm]   - kern.vt.fb.default_mode
<6>[drm] AMDGPU Display Connectors
<6>[drm] Connector 0:
<6>[drm]   DP-1
<6>[drm]   HPD5
<6>[drm]   DDC: 0x4868 0x4868 0x4869 0x4869 0x486a 0x486a 0x486b 0x486b
<6>[drm]   Encoders:
<6>[drm]     DFP1: INTERNAL_UNIPHY1
<6>[drm] Connector 1:
<6>[drm]   HDMI-A-1
<6>[drm]   HPD3
<6>[drm]   DDC: 0x4874 0x4874 0x4875 0x4875 0x4876 0x4876 0x4877 0x4877
<6>[drm]   Encoders:
<6>[drm]     DFP2: INTERNAL_UNIPHY1
<6>[drm] Connector 2:
<6>[drm]   DVI-D-1
<6>[drm]   HPD4
<6>[drm]   DDC: 0x4878 0x4878 0x4879 0x4879 0x487a 0x487a 0x487b 0x487b
<6>[drm]   Encoders:
<6>[drm]     DFP3: INTERNAL_UNIPHY
polaris12_uvd.bin: could not load firmware image, error 2
amdgpu/polaris12_uvd.bin: could not load firmware image, error 2
amdgpu_polaris12_uvd.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_uvd_bin' version 0: 375424 bytes loaded at 0xffff00017c0e6620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_uvd.bin'
<6>[drm] Found UVD firmware Version: 1.130 Family ID: 16
polaris12_vce.bin: could not load firmware image, error 2
amdgpu/polaris12_vce.bin: could not load firmware image, error 2
amdgpu_polaris12_vce.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_vce_bin' version 0: 166816 bytes loaded at 0xffff00017c163620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_vce.bin'
<6>[drm] Found VCE firmware Version: 53.26 Binary ID: 3
polaris12_smc.bin: could not load firmware image, error 2
amdgpu/polaris12_smc.bin: could not load firmware image, error 2
amdgpu_polaris12_smc.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_smc_bin' version 0: 130388 bytes loaded at 0xffff00017c1ad620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_smc.bin'
<6>[drm] UVD and UVD ENC initialized successfully.
<6>[drm] VCE initialized successfully.
<6>[drm] fb mappable at 0xA401340000
<6>[drm] vram apper at 0xA400000000
<6>[drm] size 14745600
<6>[drm] fb depth is 24
<6>[drm]    pitch is 10240
WARNING: Device "fb" is Giant locked and may be deleted before FreeBSD 14.0.
VT: initialize with new VT driver "fb".
taskqueue_drain with the following non-sleepable locks held:
exclusive sleep mutex vtdev (vtdev) r = 0 (0xffff000000aacad0) locked @ /usr/src/sys/dev/vt/vt_core.c:3012
stack backtrace:
#0 0xffff0000004de88c at witness_debugger+0x64
#1 0xffff0000004dfa20 at witness_warn+0x400
#2 0xffff0000004d1c14 at taskqueue_drain+0x34
#3 0xffff00017ab36cc8 at vt_kms_postswitch+0x78
#4 0xffff000000311558 at vt_fb_init+0x158
#5 0xffff0000003164d8 at vt_replace_backend+0x10c
#6 0xffff000000311604 at vt_fb_attach+0x14
#7 0xffff00017ab375dc at linux_register_framebuffer+0x45c
#8 0xffff00017ab3e234 at __drm_fb_helper_initial_config_and_unlock+0x3ec
#9 0xffff00017a8d7020 at amdgpu_fbdev_init+0xd8
#10 0xffff00017a8ce850 at amdgpu_device_init+0x1ce4
#11 0xffff00017a8e1d0c at amdgpu_driver_load_kms+0x48
#12 0xffff00017ab0f148 at drm_dev_register+0xcc
#13 0xffff00017a8d6454 at amdgpu_pci_probe+0x210
#14 0xffff00017ab844e0 at linux_pci_attach_device+0x294
#15 0xffff0000004aa50c at device_attach+0x400
#16 0xffff0000004aa074 at device_probe_and_attach+0x7c
#17 0xffff0000004ac134 at bus_generic_driver_added+0x74
start FB_INFO:
type=11 height=1440 width=2560 depth=32
cmsize=16 size=14745600
pbase=0xa401340000 vbase=0xffff00017c65f000
name=drmn0 flags=0x0 stride=10240 bpp=32
cmap[0]=0 cmap[1]=7f0000 cmap[2]=7f00 cmap[3]=c4a000
end FB_INFO
drmn0: fb0: amdgpudrmfb frame buffer device
<6>[drm] Initialized amdgpu 3.35.0 20150101 for drmn0 on minor 0

agrajag9 · 2021-07-01T15:27:44Z

Loads without forcing PCIe 3.0, even without the sysctl in loader.conf!

[root@honeycomb ~]# kldload -v amdgpu
anon_inodefs registered
debugfs registered
<6>[drm] amdgpu kernel modesetting enabled.
drmn0: <drmn> on vgapci0
vgapci0: child drmn0 requested pci_enable_io
vgapci0: child drmn0 requested pci_enable_io
sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
<6>[drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
<6>[drm] register mmio base: 0x40000000
<6>[drm] register mmio size: 262144
<6>[drm] add ip block number 0 <vi_common>
<6>[drm] add ip block number 1 <gmc_v8_0>
<6>[drm] add ip block number 2 <tonga_ih>
<6>[drm] add ip block number 3 <gfx_v8_0>
<6>[drm] add ip block number 4 <sdma_v3_0>
<6>[drm] add ip block number 5 <powerplay>
<6>[drm] add ip block number 6 <dm>
<6>[drm] add ip block number 7 <uvd_v6_0>
<6>[drm] add ip block number 8 <vce_v3_0>
<6>[drm] UVD is enabled in VM mode
<6>[drm] UVD ENC is enabled in VM mode
<6>[drm] VCE enabled in VM mode
<6>[drm] vm size is 128 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
polaris12_mc.bin: could not load firmware image, error 2
amdgpu/polaris12_mc.bin: could not load firmware image, error 2
amdgpu_polaris12_mc.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_mc_bin' version 0: 32608 bytes loaded at 0xffff000187961620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_mc.bin'
drmn0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
drmn0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
<6>[drm] Detected VRAM RAM=4096M, BAR=256M
<6>[drm] RAM width 128bits GDDR5
<6>[drm] amdgpu: 4096M of VRAM memory ready
<6>[drm] amdgpu: 4096M of GTT memory ready.
<6>[drm] GART: num cpu pages 65536, num gpu pages 65536
<6>[drm] PCIE GART of 256M enabled (table at 0x000000F400E10000).
vgapci0: attempting to allocate 1 MSI vectors (1 supported)
vgapci0: using IRQ 30 for MSI
polaris12_pfp_2.bin: could not load firmware image, error 2
amdgpu/polaris12_pfp_2.bin: could not load firmware image, error 2
amdgpu_polaris12_pfp_2.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_pfp_2_bin' version 0: 17044 bytes loaded at 0xffff00018798a620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_pfp_2.bin'
polaris12_me_2.bin: could not load firmware image, error 2
amdgpu/polaris12_me_2.bin: could not load firmware image, error 2
amdgpu_polaris12_me_2.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_me_2_bin' version 0: 17044 bytes loaded at 0xffff0001879af620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_me_2.bin'
polaris12_ce_2.bin: could not load firmware image, error 2
amdgpu/polaris12_ce_2.bin: could not load firmware image, error 2
amdgpu_polaris12_ce_2.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_ce_2_bin' version 0: 8852 bytes loaded at 0xffff0001879d4620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_ce_2.bin'
<6>[drm] Chained IB support enabled!
polaris12_rlc.bin: could not load firmware image, error 2
amdgpu/polaris12_rlc.bin: could not load firmware image, error 2
amdgpu_polaris12_rlc.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_rlc_bin' version 0: 16660 bytes loaded at 0xffff000188a00620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_rlc.bin'
polaris12_mec_2.bin: could not load firmware image, error 2
amdgpu/polaris12_mec_2.bin: could not load firmware image, error 2
amdgpu_polaris12_mec_2.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_mec_2_bin' version 0: 262824 bytes loaded at 0xffff000188a25620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_mec_2.bin'
polaris12_mec2_2.bin: could not load firmware image, error 2
amdgpu/polaris12_mec2_2.bin: could not load firmware image, error 2
amdgpu_polaris12_mec2_2.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_mec2_2_bin' version 0: 262824 bytes loaded at 0xffff000188a86620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_mec2_2.bin'
polaris12_sdma.bin: could not load firmware image, error 2
amdgpu/polaris12_sdma.bin: could not load firmware image, error 2
amdgpu_polaris12_sdma.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_sdma_bin' version 0: 12692 bytes loaded at 0xffff000188ae7620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_sdma.bin'
polaris12_sdma1.bin: could not load firmware image, error 2
amdgpu/polaris12_sdma1.bin: could not load firmware image, error 2
amdgpu_polaris12_sdma1.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_sdma1_bin' version 0: 12692 bytes loaded at 0xffff000188b0b620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_sdma1.bin'
polaris12_uvd.bin: could not load firmware image, error 2
amdgpu/polaris12_uvd.bin: could not load firmware image, error 2
amdgpu_polaris12_uvd.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_uvd_bin' version 0: 375424 bytes loaded at 0xffff000188b2f620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_uvd.bin'
<6>[drm] Found UVD firmware Version: 1.130 Family ID: 16
polaris12_vce.bin: could not load firmware image, error 2
amdgpu/polaris12_vce.bin: could not load firmware image, error 2
amdgpu_polaris12_vce.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_vce_bin' version 0: 166816 bytes loaded at 0xffff000188bac620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_vce.bin'
<6>[drm] Found VCE firmware Version: 53.26 Binary ID: 3
polaris12_smc.bin: could not load firmware image, error 2
amdgpu/polaris12_smc.bin: could not load firmware image, error 2
amdgpu_polaris12_smc.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_smc_bin' version 0: 130388 bytes loaded at 0xffff000189200620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_smc.bin'
<6>[drm] DM_PPLIB: values for Engine clock
<6>[drm] DM_PPLIB:       214000
<6>[drm] DM_PPLIB:       551000
<6>[drm] DM_PPLIB:       734000
<6>[drm] DM_PPLIB:       980000
<6>[drm] DM_PPLIB:       1046000
<6>[drm] DM_PPLIB:       1098000
<6>[drm] DM_PPLIB:       1124000
<6>[drm] DM_PPLIB:       1206000
<6>[drm] DM_PPLIB: Validation clocks:
<6>[drm] DM_PPLIB:    engine_max_clock: 120600
<6>[drm] DM_PPLIB:    memory_max_clock: 175000
<6>[drm] DM_PPLIB:    level           : 8
<6>[drm] DM_PPLIB: values for Memory clock
<6>[drm] DM_PPLIB:       300000
<6>[drcaching Rev 2 (21.10.2013).
<6>[drm] Driver supports precise vblank timestamp query.
<6>[drm] UVD and UVD ENC initialized successfully.
<6>[drm] VCE initialized successfully.
b kernel: Failed to adWC MTRR for [0xa40000000<6>[drm] fb mappable at 0xA401340000
<6>[drm] vram apper at 0xA400000000
<6>[drm] size 14745600
<6>[drm] fb depth is 24
<6>[drm]    pitch is 10240
WARNING: Device "fb" is Giant locked and may be deleted before FreeBSD 14.0.
VT: initialize with new VT driver "fb".
taskqueue_drain with the following non-sleepable locks held:
exclusive sleep mutex vtdev (vtdev) r = 0 (0xffff000000aacad0) locked @ /usr/src/sys/dev/vt/vt_core.c:3012
stack backtrace:
#0 0xffff0000004de88c at witness_debugger+0x64
#1 0xffff0000004dfa20 at witness_warn+0x400
#2 0xffff0000004d1c14 at taskqueue_drain+0x34
#3 0xffff000179b7ecc8 at vt_kms_postswitch+0x78
#4 0xffff000000311558 at vt_fb_init+0x158
#5 0xffff0000003164d8 at vt_replace_backend+0x10c
#6 0xffff000000311604 at vt_fb_attach+0x14
#7 0xffff000179b7f5dc at linux_register_framebuffer+0x45c
#8 0xffff000179b86234 at __drm_fb_helper_initial_config_and_unlock+0x3ec
#9 0xffff0001876d7030 at amdgpu_fbdev_init+0xd8
#10 0xffff0001876ce860 at amdgpu_device_init+0x1cf4
#11 0xffff0001876e1d1c at amdgpu_driver_load_kms+0x48
#12 0xffff000179b57148 at drm_dev_register+0xcc
#13 0xffff0001876d6464 at amdgpu_pci_probe+0x210
#14 0xffff000179bcc4e0 at linux_pci_attach_device+0x294
#15 0xffff0000004aa50c at device_attach+0x400
#16 0xffff0000004aa074 at device_probe_and_attach+0x7c
#17 0xffff0000004ac134 at bus_generic_driver_added+0x74
start FB_INFO:
type=11 height=1440 width=2560 depth=32
cmsize=16 size=14745600
pbase=0xa401340000 vbase=0xffff000189400000
name=drmn0 flags=0x0 stride=10240 bpp=32
cmap[0]=0 cmap[1]=7f0000 cmap[2]=7f00 cmap[3]=c4a000
end FB_INFO
drmn0: fb0: amdgpudrmfb frame buffer device
<6>[drm] Initialized amdgpu 3.35.0 20150101 for drmn0 on minor 0
rformance may suffer
Loaded amdgpu, id=11

valpackett · 2021-07-01T16:20:45Z

So that confirms the issue 🥳 🎉 🚀 the "suspicious overlap" wasn't itself the issue but it did point me in the right direction eventually.

Now, using the BUS_TRANSLATE_RESOURCE patch only should result in the same successful loading… hopefully…?

Funny how it still logs

<6>[drm] register mmio base: 0x40000000

because the logging statement cuts the address down to a 32-bit type

agrajag9 · 2021-07-01T17:32:41Z

Well this is interesting:

# kldload -v amdgpu
anon_inodefs registered
debugfs registered
<6>[drm] amdgpu kernel modesetting enabled.
drmn0: <drmn> on vgapci0
drmn0: translate of 0xa400000000 failed
vgapci0: child drmn0 requested pci_enable_io
vgapci0: child drmn0 requested pci_enable_io
sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
<6>[drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
drmn0: translate of 0x40000000 failed
<6>[drm] register mmio base: 0x00000000
<6>[drm] register mmio size: 262144
<6>[drm] add ip block number 0 <vi_common>
<6>[drm] add ip block number 1 <gmc_v8_0>
<6>[drm] add ip block number 2 <tonga_ih>
<6>[drm] add ip block number 3 <gfx_v8_0>
<6>[drm] add ip block number 4 <sdma_v3_0>
<6>[drm] add ip block number 5 <powerplay>
<6>[drm] add ip block number 6 <dm>
<6>[drm] add ip block number 7 <uvd_v6_0>
<6>[drm] add ip block number 8 <vce_v3_0>
<6>[drm] UVD is enabled in VM mode
<6>[drm] UVD ENC is enabled in VM mode
<6>[drm] VCE enabled in VM mode
drmn0: translate of 0xa410000000 failed
<6>[drm] GPU posting now...
[drm ERROR :atom_op_jump] atombios stuck in loop for more than 10secs aborting
[drm ERROR :amdgpu_atom_execute_table_locked] atombios stuck executing AD44 (len 428, WS 20, PS 0) @ 0xAE76
[drm ERROR :amdgpu_atom_execute_table_locked] atombios stuck executing A984 (len 158, WS 0, PS 8) @ 0xA9E7
drmn0: gpu post error!
drmn0: Fatal error during GPU init
<6>[drm] amdgpu: finishing device.
Warning: can't remove non-dynamic nodes (dri)!
device_attach: drmn0 attach returned 22
Loaded amdgpu, id=11

Still trying to post and fails, but now it doesn't panic?

valpackett · 2021-07-01T17:45:40Z

oh I didn't check what return value was expected, sorry *facepalm*

--- i/sys/dev/pci/pci_host_generic.c
+++ w/sys/dev/pci/pci_host_generic.c
@@ -324,7 +324,7 @@ pci_host_generic_core_release_resource(device_t dev, device_t child, int type,
 }
 
 static bool
-generic_pcie_translate_resource(device_t dev, int type, rman_res_t start,
+generic_pcie_translate_resource_end(device_t dev, int type, rman_res_t start,
     rman_res_t end, rman_res_t *new_start, rman_res_t *new_end)
 {
 	struct generic_pcie_core_softc *sc;
@@ -380,6 +380,16 @@ generic_pcie_translate_resource(device_t dev, int type, rman_res_t start,
 	return (found);
 }
 
+static int
+generic_pcie_translate_resource(device_t bus, int type,
+    rman_res_t start, rman_res_t *newstart)
+{
+	rman_res_t newend; /* unused */
+
+	return (!generic_pcie_translate_resource_end(
+	    bus, type, start, 0, newstart, &newend));
+}
+
 struct resource *
 pci_host_generic_core_alloc_resource(device_t dev, device_t child, int type,
     int *rid, rman_res_t start, rman_res_t end, rman_res_t count, u_int flags)
@@ -404,7 +414,7 @@ pci_host_generic_core_alloc_resource(device_t dev, device_t child, int type,
 		    type, rid, start, end, count, flags));
 
 	/* Translate the address from a PCI address to a physical address */
-	if (!generic_pcie_translate_resource(dev, type, start, end, &phys_start,
+	if (!generic_pcie_translate_resource_end(dev, type, start, end, &phys_start,
 	    &phys_end)) {
 		device_printf(dev,
 		    "Failed to translate resource %jx-%jx type %x for %s\n",
@@ -456,7 +466,7 @@ generic_pcie_activate_resource(device_t dev, device_t child, int type,
 
 	start = rman_get_start(r);
 	end = rman_get_end(r);
-	if (!generic_pcie_translate_resource(dev, type, start, end, &start,
+	if (!generic_pcie_translate_resource_end(dev, type, start, end, &start,
 	    &end))
 		return (EINVAL);
 	rman_set_start(r, start);
@@ -527,6 +537,7 @@ static device_method_t generic_pcie_methods[] = {
 	DEVMETHOD(bus_activate_resource,	generic_pcie_activate_resource),
 	DEVMETHOD(bus_deactivate_resource,	generic_pcie_deactivate_resource),
 	DEVMETHOD(bus_release_resource,		pci_host_generic_core_release_resource),
+	DEVMETHOD(bus_translate_resource,	generic_pcie_translate_resource),
 	DEVMETHOD(bus_setup_intr,		bus_generic_setup_intr),
 	DEVMETHOD(bus_teardown_intr,		bus_generic_teardown_intr),

agrajag9 · 2021-07-01T19:48:52Z

That did it!

# kldload -v amdgpu
anon_inodefs registered
debugfs registered
<6>[drm] amdgpu kernel modesetting enabled.
drmn0: <drmn> on vgapci0
vgapci0: child drmn0 requested pci_enable_io
vgapci0: child drmn0 requested pci_enable_io
sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
<6>[drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
<6>[drm] register mmio base: 0x40000000
<6>[drm] register mmio size: 262144
<6>[drm] add ip block number 0 <vi_common>
<6>[drm] add ip block number 1 <gmc_v8_0>
<6>[drm] add ip block number 2 <tonga_ih>
<6>[drm] add ip block number 3 <gfx_v8_0>
<6>[drm] add ip block number 4 <sdma_v3_0>
<6>[drm] add ip block number 5 <powerplay>
<6>[drm] add ip block number 6 <dm>
<6>[drm] add ip block number 7 <uvd_v6_0>
<6>[drm] add ip block number 8 <vce_v3_0>
<6>[drm] UVD is enabled in VM mode
<6>[drm] UVD ENC is enabled in VM mode
<6>[drm] VCE enabled in VM mode
<6>[drm] vm size is 128 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
polaris12_mc.bin: could not load firmware image, error 2
amdgpu/polaris12_mc.bin: could not load firmware image, error 2
amdgpu_polaris12_mc.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_mc_bin' version 0: 32608 bytes loaded at 0xffff000179b0c620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_mc.bin'
drmn0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
drmn0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
<6>[drm] Detected VRAM RAM=4096M, BAR=256M
<6>[drm] RAM width 128bits GDDR5
<6>[drm] amdgpu: 4096M of VRAM memory ready
<6>[drm] amdgpu: 4096M of GTT memory ready.
<6>[drm] GART: num cpu pages 65536, num gpu pages 65536
<6>[drm] PCIE GART of 256M enabled (table at 0x000000F400E10000).
vgapci0: attempting to allocate 1 MSI vectors (1 supported)
vgapci0: using IRQ 30 for MSI
polaris12_pfp_2.bin: could not load firmware image, error 2
amdgpu/polaris12_pfp_2.bin: could not load firmware image, error 2
amdgpu_polaris12_pfp_2.bin: could not load firmware image, error 2
b kernel: Failed to add firmware: 'amdgpu_polaris12_pfp_2_bin' version 0: 17044 bytes loaded at 0xffff000179b35620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_pfp_2.bin'
polaris12_me_2.bin: could not load firmware image, error 2
amdgpu/polaris12_me_2.bin: could not load firmware image, error 2
amdgpu_polaris12_me_2.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_me_2_bin' version 0: 17044 bytes loaded at 0xffff000179b5a620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_me_2.bin'
polaris12_ce_2.bin: could not load firmware image, error 2
amdgpu/polaris12_ce_2.bin: could not load firmware image, error 2
amdgpu_polaris12_ce_2.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_ce_2_bin' version 0: 8852 bytes loaded at 0xffff000179b7f620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_ce_2.bin'
<6>[drm] Chained IB support enabled!
polaris12_rlc.bin: could not load firmware image, error 2
amdgpu/polaris12_rlc.bin: could not load firmware image, error 2
amdgpu_polaris12_rlc.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_rlc_bin' version 0: 16660 bytes loaded at 0xffff000179ba2620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_rlc.bin'
polaris12_mec_2.bin: could not load firmware image, error 2
amdgpu/polaris12_mec_2.bin: could not load firmware image, error 2
amdgpu_polaris12_mec_2.bin: could not load firmware image, error 2
WC MTRR for [0xa40000000firmware: 'amdgpu_polaris12_mec_2_bin' version 0: 262824 bytes loaded at 0xffff00017c000620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_mec_2.bin'
polaris12_mec2_2.bin: could not load firmware image, error 2
amdgpu/polaris12_mec2_2.bin: could not load firmware image, error 2
amdgpu_polaris12_mec2_2.bin: could not load firmware image, error 2
0-0xa40fffffff]: -45; pefirmware: 'amdgpu_polaris12_mec2_2_bin' version 0: 262824 bytes loaded at 0xffff00017c061620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_mec2_2.bin'
polaris12_sdma.bin: could not load firmware image, error 2
amdgpu/polaris12_sdma.bin: could not load firmware image, error 2
amdgpu_polaris12_sdma.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_sdma_bin' version 0: 12692 bytes loaded at 0xffff000179bc7620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_sdma.bin'
polaris12_sdma1.bin: could not load firmware image, error 2
amdgpu/polaris12_sdma1.bin: could not load firmware image, error 2
amdgpu_polaris12_sdma1.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_sdma1_bin' version 0: 12692 bytes loaded at 0xffff00017c0c2620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_sdma1.bin'
polaris12_uvd.bin: could not load firmware image, error 2
amdgpu/polaris12_uvd.bin: could not load firmware image, error 2
amdgpu_polaris12_uvd.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_uvd_bin' version 0: 375424 bytes loaded at 0xffff00017c0e6620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_uvd.bin'
<6>[drm] Found UVD firmware Version: 1.130 Family ID: 16
polaris12_vce.bin: could not load firmware image, error 2
amdgpu/polaris12_vce.bin: could not load firmware image, error 2
amdgpu_polaris12_vce.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_vce_bin' version 0: 166816 bytes loaded at 0xffff00017c163620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_vce.bin'
<6>[drm] Found VCE firmware Version: 53.26 Binary ID: 3
polaris12_smc.bin: could not load firmware image, error 2
amdgpu/polaris12_smc.bin: could not load firmware image, error 2
amdgpu_polaris12_smc.bin: could not load firmware image, error 2
firmware: 'amdgpu_polaris12_smc_bin' version 0: 130388 bytes loaded at 0xffff00017c1ad620
drmn0: successfully loaded firmware image 'amdgpu/polaris12_smc.bin'
<6>[drm] DM_PPLIB: values for Engine clock
<6>[drm] DM_PPLIB:       214000
<6>[drm] DM_PPLIB:       551000
<6>[drm] DM_PPLIB:       734000
<6>[drm] DM_PPLIB:       980000
<6>[drm] DM_PPLIB:       1046000
<6>[drm] DM_PPLIB:       1098000
<6>[drm] DM_PPLIB:       1124000
<6>[drm] DM_PPLIB:       1206000
<6>[drm] DM_PPLIB: Validation clocks:
<6>[drm] DM_PPLIB:    engine_max_clock: 120600
<6>[drm] DM_PPLIB:    memory_max_clock: 175000
<6>[drm] DM_PPLIB:    level           : 8
<6>[drm] DM_PPLIB: values for Memory clock
<6>[drm] DM_PPLIB:       300000
<6>[drm] DM_PPLIB:       625000
<6>[drm] DM_PPLIB:       1750000
<6>[drm] DM_PPLIB: Validation clocks:
<6>[drm] DM_PPLIB:    engine_max_clock: 120600
<6>[drm] DM_PPLIB:    memory_max_clock: 175000
<6>[drm] DM_PPLIB:    level           : 8
<6>[drm] Display Core initialized with v3.2.48!
<6>[drm] Connector DP-1: get mode from tunables:
<6>[drm]   - kern.vt.fb.modes.DP-1
<6>[drm]   - kern.vt.fb.default_mode
<6>[drm] Connector HDMI-A-1: get mode from tunables:
<6>[drm]   - kern.vt.fb.modes.HDMI-A-1
<6>[drm]   - kern.vt.fb.default_mode
<6>[drm] Connector DVI-D-1: get mode from tunables:
<6>[drm]   - kern.vt.fb.modes.DVI-D-1
<6>[drm]   - kern.vt.fb.default_mode
<6>[drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
<6>[drm] Driver supports precise vblank timestamp query.
<6>[drm] UVD and UVD ENC initialized successfully.
<6>[drm] VCE initialized successfully.
<6>[drm] fb mappable at 0xA401340000
<6>[drm] vram apper at 0xA400000000
<6>[drm] size 14745600
<6>[drm] fb depth is 24
<6>[drm]    pitch is 10240
WARNING: Device "fb" is Giant locked and may be deleted before FreeBSD 14.0.
VT: initialize with new VT driver "fb".
taskqueue_drain with the following non-sleepable locks held:
exclusive sleep mutex vtdev (vtdev) r = 0 (0xffff000000aacc50) locked @ /usr/src/sys/dev/vt/vt_core.c:3012
stack backtrace:
#0 0xffff0000004de8f8 at witness_debugger+0x64
#1 0xffff0000004dfa8c at witness_warn+0x400
#2 0xffff0000004d1c80 at taskqueue_drain+0x34
#3 0xffff00017ab36cc8 at vt_kms_postswitch+0x78
#4 0xffff000000311558 at vt_fb_init+0x158
#5 0xffff0000003164d8 at vt_replace_backend+0x10c
#6 0xffff000000311604 at vt_fb_attach+0x14
#7 0xffff00017ab375dc at linux_register_framebuffer+0x45c
#8 0xffff00017ab3e234 at __drm_fb_helper_initial_config_and_unlock+0x3ec
#9 0xffff00017a8d701c at amdgpu_fbdev_init+0xd8
#10 0xffff00017a8ce84c at amdgpu_device_init+0x1ce0
#11 0xffff00017a8e1d08 at amdgpu_driver_load_kms+0x48
#12 0xffff00017ab0f148 at drm_dev_register+0xcc
#13 0xffff00017a8d6450 at amdgpu_pci_probe+0x210
#14 0xffff00017ab844e0 at linux_pci_attach_device+0x294
#15 0xffff0000004aa578 at device_attach+0x400
#16 0xffff0000004aa0e0 at device_probe_and_attach+0x7c
#17 0xffff0000004ac1a0 at bus_generic_driver_added+0x74
start FB_INFO:
type=11 height=1440 width=2560 depth=32
cmsize=16 size=14745600
pbase=0xa401340000 vbase=0xffff00017c65f000
name=drmn0 flags=0x0 stride=10240 bpp=32
cmap[0]=0 cmap[1]=7f0000 cmap[2]=7f00 cmap[3]=c4a000
end FB_INFO
drmn0: fb0: amdgpudrmfb frame buffer device
<6>[drm] Initialized amdgpu 3.35.0 20150101 for drmn0 on minor 0

Now to see what happens if I run glmark2...

agrajag9 · 2021-07-03T10:01:20Z

Confirmed: glmark2 runs at >5k FPS with GL_RENDERER: Radeon RX550/550 Series (POLARIS12, DRM 3.35.0, 14.0-CURRENT, LLVM 10.0.1)! 🥳

https://gist.github.com/agrajag9/62cbce1b92662079e34c8d445eb4116d

I think we can call this one solved and move further discussion to https://reviews.freebsd.org/D30986 👍

In D21096 BUS_TRANSLATE_RESOURCE was introduced to allow LinuxKPI to get physical addresses in pci_resource_start for PowerPC and implemented in ofw_pci. When the translation was implemented in pci_host_generic in 372c142, this method was not implemented; instead a local static function was added for a similar purpose. Rename the static function to "_common" and implement the bus function as a wrapper around that. With this a LinuxKPI driver using physical addresses correctly finds the configuration registers of the GPU. This unbreaks amdgpu on NXP Layerscape LX2160A SoC (SolidRun HoneyComb LX2K workstation) which has a Translation Offset in ACPI for below-4G PCI addresses. More info: freebsd/drm-kmod#84 Tested by: dan.kotowski_a9development.com Reviewed by: hselasky Differential Revision: https://reviews.freebsd.org/D30986

agrajag9 closed this as completed Jun 12, 2021

agrajag9 reopened this Jun 12, 2021

agrajag9 closed this as completed Jun 14, 2021

agrajag9 reopened this Jun 14, 2021

agrajag9 closed this as completed Jul 5, 2021

RX570/POLARIS12 panic during GPU post on 13/stable aarch64 #84

RX570/POLARIS12 panic during GPU post on 13/stable aarch64 #84

Comments

agrajag9 commented Jun 6, 2021

valpackett commented Jun 8, 2021

agrajag9 commented Jun 9, 2021 • edited Loading

valpackett commented Jun 9, 2021

agrajag9 commented Jun 10, 2021

agrajag9 commented Jun 10, 2021

valpackett commented Jun 10, 2021 • edited Loading

agrajag9 commented Jun 10, 2021

valpackett commented Jun 11, 2021

agrajag9 commented Jun 12, 2021

agrajag9 commented Jun 12, 2021

agrajag9 commented Jun 12, 2021

agrajag9 commented Jun 12, 2021

valpackett commented Jun 12, 2021

agrajag9 commented Jun 12, 2021

valpackett commented Jun 12, 2021

valpackett commented Jun 12, 2021

agrajag9 commented Jun 13, 2021

agrajag9 commented Jun 13, 2021 • edited Loading

valpackett commented Jun 13, 2021

agrajag9 commented Jun 14, 2021 • edited Loading

valpackett commented Jun 14, 2021

agrajag9 commented Jun 14, 2021

agrajag9 commented Jun 14, 2021

agrajag9 commented Jun 14, 2021

agrajag9 commented Jun 20, 2021 • edited Loading

agrajag9 commented Jun 20, 2021

valpackett commented Jun 29, 2021

agrajag9 commented Jun 29, 2021

agrajag9 commented Jun 29, 2021

valpackett commented Jun 29, 2021

agrajag9 commented Jun 29, 2021

agrajag9 commented Jun 29, 2021

valpackett commented Jun 29, 2021

agrajag9 commented Jun 30, 2021 • edited Loading

valpackett commented Jun 30, 2021 • edited Loading

valpackett commented Jun 30, 2021

agrajag9 commented Jun 30, 2021 • edited Loading

agrajag9 commented Jun 30, 2021

valpackett commented Jun 30, 2021

agrajag9 commented Jun 30, 2021

valpackett commented Jun 30, 2021

valpackett commented Jun 30, 2021

agrajag9 commented Jul 1, 2021

valpackett commented Jul 1, 2021

agrajag9 commented Jul 1, 2021

valpackett commented Jul 1, 2021 • edited Loading

agrajag9 commented Jul 1, 2021

agrajag9 commented Jul 1, 2021

valpackett commented Jul 1, 2021 • edited Loading

agrajag9 commented Jul 1, 2021

valpackett commented Jul 1, 2021

agrajag9 commented Jul 1, 2021

agrajag9 commented Jul 3, 2021 • edited Loading

agrajag9 commented Jun 9, 2021 •

edited

Loading

valpackett commented Jun 10, 2021 •

edited

Loading

agrajag9 commented Jun 13, 2021 •

edited

Loading

agrajag9 commented Jun 14, 2021 •

edited

Loading

agrajag9 commented Jun 20, 2021 •

edited

Loading

agrajag9 commented Jun 30, 2021 •

edited

Loading

valpackett commented Jun 30, 2021 •

edited

Loading

agrajag9 commented Jun 30, 2021 •

edited

Loading

valpackett commented Jul 1, 2021 •

edited

Loading

valpackett commented Jul 1, 2021 •

edited

Loading

agrajag9 commented Jul 3, 2021 •

edited

Loading