-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Description
Test environment:
-kernel 5.14.0
-qemu-kvm-9.1
-libvirt-10.10.0
-gim-dkms-8.1.0.K-0.noarch
-MI210 or MI300X
How ro reproduce the issue:
- create the MxGPU from MI210 or MI300X
# modprobe gim
- make sure the VM has AMD GPU/vGPU driver installed
# cat /etc/yum.repos.d/amdgpu.repo
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/latest/rhel/9.6/main/x86_64/
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
# cat rocm.repo
[ROCm]
name=ROCm
baseurl=https://repo.radeon.com/rocm/el9/latest/main/
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
# dnf -y install amdgpu-dkms rocm
# rpm -q amdgpu-dkms rocm
- start a VM with AMD MxGPU
- check the vGPU status in the VM
# amd-smi list
- reset the VM for 5 times
# /bin/virsh reset --domain rhel96
- check the VM dmesg via console
[ 10.912244] [drm] amdgpu kernel modesetting enabled.
[ 10.912655] amdgpu: Virtual CRAT table created for CPU
[ 10.912812] amdgpu: Topology: Add CPU node
[ 10.913253] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x74B5 0x1002:0x74A1 0x00).
[ 10.913920] [drm] register mmio base: 0x82400000
[ 10.914054] [drm] register mmio size: 2097152
[ 10.919517] [drm] host supports REQ_INIT_DATA handshake
[ 10.919782] [drm] MCBP is enabled
[ 25.950941] [drm] add ip block number 0 <soc15_common>
[ 25.951334] [drm] add ip block number 1 <gmc_v9_0>
[ 25.951645] [drm] add ip block number 2 <psp>
[ 25.951908] [drm] add ip block number 3 <vega20_ih>
[ 25.952162] [drm] add ip block number 4 <smu>
[ 25.952391] [drm] add ip block number 5 <gfx_v9_4_3>
[ 25.952682] [drm] add ip block number 6 <sdma_v4_4_2>
[ 25.952926] [drm] add ip block number 7 <vcn_v4_0_3>
[ 25.953179] [drm] add ip block number 8 <jpeg_v4_0_3>
[ 25.958915] amdgpu 0000:04:00.0: amdgpu: Fetched VBIOS from VRAM BAR
[ 25.959159] amdgpu: ATOM BIOS: 113-M3000100-102
[ 25.959450] amdgpu 0000:04:00.0: Direct firmware load for amdgpu/psp_13_0_6_cap.bin failed with error -2
[ 25.959727] amdgpu 0000:04:00.0: amdgpu: cap microcode does not exist, skip
[ 25.962247] amdgpu 0000:04:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[ 25.962785] amdgpu 0000:04:00.0: amdgpu: MEM ECC is active.
[ 25.963014] amdgpu 0000:04:00.0: amdgpu: SRAM ECC is active.
[ 25.963226] amdgpu 0000:04:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[18127] ras_mask[18127]
[ 25.963463] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 25.963739] amdgpu 0000:04:00.0: amdgpu: VRAM: 196288M 0x0000020000000000 - 0x0000022FEBFFFFFF (196288M used)
[ 25.963955] amdgpu 0000:04:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 25.964298] [drm] Detected VRAM RAM=196288M, BAR=262144M
[ 25.964513] [drm] RAM width 8192bits HBM
[ 25.965383] [drm] amdgpu: 196288M of VRAM memory ready
[ 25.965644] [drm] amdgpu: 515584M of GTT memory ready.
[ 25.965889] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 26.087915] [drm] PCIE GART of 512M enabled.
[ 26.088153] [drm] PTB located at 0x0000020000100000
[ 26.160930] [drm] Found VCN firmware Version ENC: 1.23 DEC: 9 VEP: 0 Revision: 9
[ 26.165454] [drm] MM table gpu addr = 0x200007a1000, cpu addr = 00000000124a76da.
[ 27.228916] amdgpu 0000:04:00.0: amdgpu: smu driver if version = 0x08042024, smu fw if version = 0x08042027, smu fw program = 0, smu fw version = 0x0055708e (85.112.142)
[ 27.229471] amdgpu 0000:04:00.0: amdgpu: SMU driver if version not matched
[ 27.231535] amdgpu 0000:04:00.0: amdgpu: SMU is initialized successfully!
[ 27.385901] [drm] kiq ring mec 2 pipe 1 q 0
[ 27.387025] [drm] kiq ring mec 2 pipe 1 q 0
[ 27.388196] [drm] kiq ring mec 2 pipe 1 q 0
[ 27.389377] [drm] kiq ring mec 2 pipe 1 q 0
[ 27.390611] [drm] kiq ring mec 2 pipe 1 q 0
[ 27.391815] [drm] kiq ring mec 2 pipe 1 q 0
[ 27.393076] [drm] kiq ring mec 2 pipe 1 q 0
[ 27.394326] [drm] kiq ring mec 2 pipe 1 q 0
[ 27.417626] amdgpu 0000:04:00.0: amdgpu: XGMI: Add node 0, hive 0x2101b92d193f83b3.
[ 28.064738] amdgpu: HMM registered 196288MB device memory
[ 28.069929] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[ 28.070134] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[ 28.070296] kfd kfd: amdgpu: KFD node 0 partition 0 size 196288M
[ 28.070856] kfd kfd: amdgpu: Node: 0, interrupt_bitmap: 7777
[ 28.075888] BUG: kernel NULL pointer dereference, address: 00000000000000d7
[ 28.075889] #PF: supervisor read access in kernel mode
[ 28.075890] #PF: error_code(0x0000) - not-present page
[ 28.075892] PGD 119622067 P4D 0
[ 28.075894] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 28.075896] CPU: 3 PID: 738 Comm: systemd-udevd Not tainted 5.14.0-xxx.el9.x86_64 #1
[ 28.075898] Hardware name: Red Hat KVM/RHEL, BIOS edk2-20241117-2.el9 11/17/2024
[ 28.075899] RIP: 0010:kgd_gfx_v9_hiq_mqd_load+0xc1/0x450 [amdgpu]
[ 28.076288] Code: c3 78 03 00 00 45 8b 9c 07 68 6f 01 00 45 85 db 0f 8e 23 03 00 00 48 69 c3 78 03 00 00 c1 e5 0d 4c 01 f8 48 8b 90 50 6f 01 00 <a0> d7 00 00 00 00 00 00 00 00 00 00 00 00 00 01 53 e0 0b 00 00 00
[ 28.076289] RSP: 0018:ff67ee13811775e8 EFLAGS: 00010286
[ 28.076290] RAX: ff2ca365f0900000 RBX: 0000000000000000 RCX: ff67ee1381600000
[ 28.076291] RDX: 0000000000000200 RSI: 0000000000000106 RDI: ff2ca365f0916d68
[ 28.076292] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 28.076292] R10: ffffffffc0aa3220 R11: 0000000000000100 R12: ff67ee1381539000
[ 28.076293] R13: 0000000000000000 R14: ff2ca365f0916d60 R15: ff2ca365f0900000
[ 28.076294] FS: 00007f96628a9b40(0000) GS:ff2ca4613f6c0000(0000) knlGS:0000000000000000
[ 28.076295] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 28.076296] CR2: 00000000000000d7 CR3: 0000000119640003 CR4: 0000000000771ef0
[ 28.076300] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 28.076301]
[ 28.076301] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 28.076301] PKRU: 55555554
[ 28.076302] Call Trace:
[ 28.076303] <TASK>
[ 28.076304] ? show_trace_log_lvl+0x1c4/0x2df
[ 28.076310] ? show_trace_log_lvl+0x1c4/0x2df
[ 28.076312] ? hiq_load_mqd_kiq_v9_4_3+0xbc/0x120 [amdgpu]
[ 28.076591] ? __die_body.cold+0x8/0xd
[ 28.076594] ? page_fault_oops+0x132/0x170
[ 28.076599] ? exc_page_fault+0x61/0x150
[ 28.076601] ? asm_exc_page_fault+0x22/0x30
[ 28.076605] ? __pfx_hiq_load_mqd_kiq_v9_4_3+0x10/0x10 [amdgpu]
[ 28.076885] ? kgd_gfx_v9_hiq_mqd_load+0xc1/0x450 [amdgpu]
[ 28.077188] ? kgd_gfx_v9_hiq_mqd_load+0x89/0x450 [amdgpu]
[ 28.077476] hiq_load_mqd_kiq_v9_4_3+0xbc/0x120 [amdgpu]
[ 28.077759] kq_initialize.constprop.0+0x312/0x450 [amdgpu]
[ 28.078039] kernel_queue_init+0x3c/0x60 [amdgpu]
[ 28.078306] pm_init+0x64/0xd0 [amdgpu]
[ 28.078571] start_cpsch+0x1a4/0x2c0 [amdgpu]
[ 28.078849] kfd_resume+0x18/0x36 [amdgpu]
[ 28.079166] kfd_init_node+0x15e/0x1de [amdgpu]
[ 28.079460] kgd2kfd_device_init.cold+0x46f/0x6ce [amdgpu]
[ 28.079748] amdgpu_amdkfd_device_init+0x141/0x1e0 [amdgpu]
[ 28.080044] amdgpu_device_ip_init+0x4b4/0x4cc [amdgpu]
[ 28.080347] amdgpu_device_init.cold+0x6ef/0xbd6 [amdgpu]
[ 28.080641] amdgpu_driver_load_kms+0x15/0x70 [amdgpu]
[ 28.080877] amdgpu_pci_probe+0x18d/0x3d0 [amdgpu]
[ 28.081107] ? rpm_resume+0x28e/0x770
[ 28.081112] local_pci_probe+0x4c/0xa0
[ 28.081116] pci_call_probe+0x56/0x160
[ 28.081118] pci_device_probe+0x7c/0x100
[ 28.081120] ? driver_sysfs_add+0x59/0xc0
[ 28.081124] really_probe+0xde/0x390
[ 28.081126] ? pm_runtime_barrier+0x50/0x90
[ 28.081128] __driver_probe_device+0xd6/0x130
[ 28.081130] driver_probe_device+0x1e/0x90
[ 28.081132] __driver_attach+0xd2/0x1c0
[ 28.081134] ? __pfx___driver_attach+0x10/0x10
[ 28.081136] bus_for_each_dev+0x75/0xd0
[ 28.081139] bus_add_driver+0xc2/0x1f0
[ 28.081141] driver_register+0x70/0xd0
[ 28.081142] ? __pfx_init_module+0x10/0x10 [amdgpu]
[ 28.081360] do_one_initcall+0x41/0x210
[ 28.081365] do_init_module+0x64/0x230
[ 28.081368] __do_sys_init_module+0x12e/0x1b0
[ 28.081371] do_syscall_64+0x5c/0xe0
[ 28.081374] ? __mod_memcg_lruvec_state+0x8a/0x120
[ 28.081379] ? __mod_lruvec_page_state+0x97/0x150
[ 28.081381] ? folio_add_new_anon_rmap+0x41/0xb0
[ 28.081384] ? _raw_spin_unlock+0xa/0x30
[ 28.081388] ? do_anonymous_page+0x1bb/0x3e0
[ 28.081391] ? __handle_mm_fault+0x2fe/0x650
[ 28.081394] ? __count_memcg_events+0x4f/0xb0
[ 28.081395] ? mm_account_fault+0x6c/0x100
[ 28.081397] ? handle_mm_fault+0x120/0x250
[ 28.081398] ? do_user_addr_fault+0x35d/0x620
[ 28.081399] ? clear_bhb_loop+0x25/0x80
[ 28.081402] ? clear_bhb_loop+0x25/0x80
[ 28.081404] ? clear_bhb_loop+0x25/0x80
[ 28.081406] ? clear_bhb_loop+0x25/0x80
[ 28.081408] ? clear_bhb_loop+0x25/0x80
[ 28.081409] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 28.081412] RIP: 0033:0x7f96635c24ae
[ 28.081415] Code: 48 8b 0d 6d 99 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3a 99 0e 00 f7 d8 64 89 01 48
[ 28.081416] RSP: 002b:00007ffdcf018ba8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[ 28.081417] RAX: ffffffffffffffda RBX: 000055fb6d28f750 RCX: 00007f96635c24ae
[ 28.081418] RDX: 00007f966372032c RSI: 0000000001d83408 RDI: 00007f9660689010
[ 28.081419] RBP: 00007f9660689010 R08: 000055fb6d2c6d70 R09: 0000000001d83000
[ 28.081419] R10: 0000000000000005 R11: 0000000000000246 R12: 00007f966372032c
[ 28.081420] R13: 000055fb6d2c0990 R14: 0000000000000007 R15: 000055fb6d2c50d0
[ 28.081421] </TASK>
[ 28.081421] Modules linked in: amdgpu(+) video wmi amdxcp i2c_algo_bit drm_ttm_helper ttm drm_exec gpu_sched drm_suballoc_helper drm_buddy drm_display_helper drm_kms_helper nvme_tcp nvme_fabrics drm nvme_core ahci crct10dif_pclmul libahci crc32_pclmul crc32c_intel nvme_keyring nvme_auth libata virtio_net ghash_clmulni_intel virtio_blk cec net_failover failover serio_raw dm_mirror dm_region_hash dm_log dm_mod
[ 28.081438] CR2: 00000000000000d7
[ 28.097518] ---[ end trace 0000000000000000 ]---
[ 28.097519] RIP: 0010:kgd_gfx_v9_hiq_mqd_load+0xc1/0x450 [amdgpu]
[ 28.098140] Code: c3 78 03 00 00 45 8b 9c 07 68 6f 01 00 45 85 db 0f 8e 23 03 00 00 48 69 c3 78 03 00 00 c1 e5 0d 4c 01 f8 48 8b 90 50 6f 01 00 <a0> d7 00 00 00 00 00 00 00 00 00 00 00 00 00 01 53 e0 0b 00 00 00
[ 28.098512] RSP: 0018:ff67ee13811775e8 EFLAGS: 00010286
[ 28.098702] RAX: ff2ca365f0900000 RBX: 0000000000000000 RCX: ff67ee1381600000
[ 28.098895] RDX: 0000000000000200 RSI: 0000000000000106 RDI: ff2ca365f0916d68
[ 28.099094] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 28.099287] R10: ffffffffc0aa3220 R11: 0000000000000100 R12: ff67ee1381539000
[ 28.099481] R13: 0000000000000000 R14: ff2ca365f0916d60 R15: ff2ca365f0900000
[ 28.099677] FS: 00007f96628a9b40(0000) GS:ff2ca4613f6c0000(0000) knlGS:0000000000000000
[ 28.099876] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 28.100080] CR2: 00000000000000d7 CR3: 0000000119640003 CR4: 0000000000771ef0
[ 28.100285] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 28.100489] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 28.100690] PKRU: 55555554
[ 28.100890] Kernel panic - not syncing: Fatal exception
[ 29.195807] Shutting down cpus with NMI
[ 29.196242] Kernel Offset: 0x20400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 29.197902] ---[ end Kernel panic - not syncing: Fatal exception ]---
- check the host dmesg
[43998.754088] gim error libgv: [0:3d:0:0][VF00][amdgv_reset_vf_flr:187] Issuing FLR on vf: 0.
[43998.819541] gim error libgv: [0:bd:0:0][PF][amdgv_ecc_check_global_ras_errors:388] GPU detected ECC Fatal Error.
[43998.828302] gim error libgv: [0:bd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[43998.829235] gim error libgv: [0:9d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[43998.829478] gim error libgv: [0:1b:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[43998.829633] gim error libgv: [0:4e:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[43998.829661] gim error libgv: [0:cd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[43998.829680] gim error libgv: [0:dd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[43998.829680] gim error libgv: [0:5f:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[43999.312056] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[43999.312187] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[43999.312286] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[43999.312386] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[43999.312484] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[43999.312584] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[43999.312683] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[43999.312786] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[43999.312874] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH0_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[43999.313088] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH1_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[43999.313306] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH2_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[43999.313556] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH3_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[43999.313857] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH4_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[43999.314295] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH5_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[43999.314671] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH6_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[43999.315122] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH7_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[43999.317562] gim error libgv: [0:3d:0:0][PF][mi300_mca_push_unknown_bank_count:236] socket: 0, 1 new hardware errors detected in UNKNOWN Block. 1 total UNKNOWN Block ECC errors since GPU load.
[43999.319990] gim error libgv: [0:3d:0:0][PF][mi300_reset_notify_engine_status:1224] Graphics Virtualization Scheduler has entered an abnormal state
[43999.322483] gim error libgv: [0:3d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44057.809137] gim error libgv: [0:3d:0:0][VF00][amdgv_reset_vf_flr:187] Issuing FLR on vf: 0.
[44058.367650] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44058.368027] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44058.368369] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44058.368677] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44058.368989] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44058.369321] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44058.369650] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44058.369992] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44058.370291] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH0_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44058.370903] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH1_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44058.371762] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH2_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44058.372623] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH3_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44058.373596] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH4_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44058.374508] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH5_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44058.375458] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH6_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44058.376424] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH7_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44058.377830] gim error libgv: [0:3d:0:0][PF][mi300_reset_notify_engine_status:1224] Graphics Virtualization Scheduler has entered an abnormal state
[44058.379262] gim error libgv: [0:bd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44058.379340] gim error libgv: [0:9d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44058.379384] gim error libgv: [0:dd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44058.379401] gim error libgv: [0:4e:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44058.379410] gim error libgv: [0:cd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44058.379412] gim error libgv: [0:5f:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44058.382356] gim error libgv: [0:3d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44058.383075] gim error libgv: [0:1b:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44119.067072] gim error libgv: [0:3d:0:0][VF00][amdgv_reset_vf_flr:187] Issuing FLR on vf: 0.
[44119.624647] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44119.625140] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44119.625549] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44119.625922] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44119.626405] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44119.626767] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44119.627158] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44119.627477] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44119.627767] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH0_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44119.628510] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH1_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44119.629174] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH2_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44119.629797] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH3_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44119.630653] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH4_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44119.631513] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH5_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44119.632367] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH6_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44119.633372] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH7_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44119.634631] gim error libgv: [0:3d:0:0][PF][mi300_reset_notify_engine_status:1224] Graphics Virtualization Scheduler has entered an abnormal state
[44119.635910] gim error libgv: [0:5f:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44119.637180] gim error libgv: [0:4e:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44119.637208] gim error libgv: [0:9d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44119.637211] gim error libgv: [0:bd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44119.637864] gim error libgv: [0:cd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44119.637889] gim error libgv: [0:dd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44119.638093] gim error libgv: [0:3d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44119.641235] gim error libgv: [0:1b:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44180.056990] gim error libgv: [0:3d:0:0][VF00][amdgv_reset_vf_flr:187] Issuing FLR on vf: 0.
[44180.121605] gim error libgv: [0:5f:0:0][PF][amdgv_ecc_check_global_ras_errors:388] GPU detected ECC Fatal Error.
[44180.131195] gim error libgv: [0:bd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44180.132200] gim error libgv: [0:9d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44180.132489] gim error libgv: [0:dd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44180.132549] gim error libgv: [0:cd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44180.134103] gim error libgv: [0:1b:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44180.134194] gim error libgv: [0:4e:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44180.134327] gim error libgv: [0:5f:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44180.614942] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44180.615239] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44180.615451] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44180.615631] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44180.615817] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44180.616002] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44180.616182] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44180.616352] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44180.616508] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH0_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44180.616820] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH1_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44180.617227] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH2_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44180.617549] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH3_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44180.617888] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH4_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44180.618312] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH5_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44180.618649] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH6_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44180.618994] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH7_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44180.620580] gim error libgv: [0:3d:0:0][PF][mi300_mca_push_unknown_bank_count:236] socket: 0, 1 new hardware errors detected in UNKNOWN Block. 2 total UNKNOWN Block ECC errors since GPU load.
[44180.622399] gim error libgv: [0:3d:0:0][PF][mi300_reset_notify_engine_status:1224] Graphics Virtualization Scheduler has entered an abnormal state
[44180.623823] gim error libgv: [0:3d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44215.259070] gim error libgv: [0:3d:0:0][VF00][amdgv_sched_exit_full_access_timeout:1962] VF 0 full access timeout. |start time: 0| - |end time: 44215251092|
[44215.271296] gim error libgv: [0:3d:0:0][VF00][amdgv_reset_vf_flr:187] Issuing FLR on vf: 0.
[44215.343528] gim error libgv: [0:9d:0:0][PF][amdgv_ecc_check_global_ras_errors:388] GPU detected ECC Fatal Error.
[44215.353289] gim error libgv: [0:bd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44215.354194] gim error libgv: [0:9d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44215.354484] gim error libgv: [0:cd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44215.354611] gim error libgv: [0:4e:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44215.354647] gim error libgv: [0:dd:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44215.354651] gim error libgv: [0:5f:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44215.355866] gim error libgv: [0:1b:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
[44215.836164] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44215.836370] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44215.836565] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44215.836751] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44215.836938] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44215.837179] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44215.837367] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44215.837552] gim error libgv: [0:3d:0:0][PF][wait_for_first_cmd_complete:471] Timeout of waiting command complete (500000 us).
[44215.837720] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH0_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44215.838082] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH1_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44215.838429] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH2_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44215.838774] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH3_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44215.839222] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH4_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44215.839615] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH5_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44215.840018] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH6_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44215.840411] gim error libgv: [0:3d:0:0][world_switch_bulk_goto_state_manual:294] WSSM: GFX_SCH7_RLCV Timeout moving from VF0(SHUTDOWN VF) to VF31(LOAD)
[44217.850044] gim error libgv: [0:3d:0:0][PF][amdgv_psp_cmd_km_submit:525] PSP command fence wait failed.
[44217.851135] gim error libgv: [0:3d:0:0][PF][mi300_psp_set_mb_int:474] Failed to execute VF gate command.
[44217.854317] gim error libgv: [0:3d:0:0][PF][mi300_mca_push_unknown_bank_count:236] socket: 0, 1 new hardware errors detected in UNKNOWN Block. 3 total UNKNOWN Block ECC errors since GPU load.
[44217.857788] gim error libgv: [0:3d:0:0][PF][mi300_reset_notify_engine_status:1224] Graphics Virtualization Scheduler has entered an abnormal state
[44217.861241] gim error libgv: [0:3d:0:0][PF][amdgv_reset_gpu:142] Issuing Whole GPU reset.
Metadata
Metadata
Assignees
Labels
No labels