Bug: gpu hang after bde7cd3cd949c1a85d3a199498ac98e78039d46f #7730

rhjdvsgsgks · 2024-06-04T08:58:57Z

What happened?

after bde7cd3 . inferring any llama3 q6 model will cause a gpu hang. previous version (a5735e4) is not affected

Name and Version

bde7cd3
using vulkan backend

What operating system are you seeing the problem on?

Linux

Relevant log output

radv/amdgpu: The CS has been cancelled because the context is lost. This context is innocent.
terminate called after throwing an instance of 'vk::DeviceLostError'                   what():  vk::Queue::submit: ErrorDeviceLost

dmesg

[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.1.0 timeout, signaled seq=898, emitted seq=899

The text was updated successfully, but these errors were encountered:

rhjdvsgsgks · 2024-06-06T11:16:54Z

i found that vram dont have significant increase while inferring. so maybe something else caused the issue

rhjdvsgsgks · 2024-06-06T17:32:35Z

bisected and found the commit caused the issus, so keep it open

rhjdvsgsgks · 2024-06-06T17:44:28Z

also #7769 #7640

slaren · 2024-06-06T17:51:46Z

@0cc4m this is probably my bad, I made some changes to the way views are initialized in ggml-backend that may have created this issue. Views are now initialized in the buffer of their parent tensor, instead of on the compute buffer. The reason I made this change is because I came to the conclusion that allocating views on the compute buffer cannot work reliably because the compute buffer is not always of the same type as the buffer used to allocate the tensor originally, and backends should be able to use the same extra as their parent anyway. I thought it was safe to make this change because the CUDA backend no longer needs extras for normal buffers, but I didn't realize that the vulkan backend still does.

Looking at the ggml_tensor_extra_gpu of the vulkan backend I think it should be possible to do this, the only change is that you would have to calculate the offset as t->extra->offset + t->view_offs. Essentially, add the offset of the view to the offset of the extra. Does that sound right?

rhjdvsgsgks added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Jun 4, 2024

rhjdvsgsgks closed this as not planned Won't fix, can't repro, duplicate, stale Jun 6, 2024

rhjdvsgsgks changed the title ~~Bug: memory usage increased after update from d7e852c1b to 3b38d4860~~ Bug: gpu hang after bde7cd3cd949c1a85d3a199498ac98e78039d46f Jun 6, 2024

rhjdvsgsgks reopened this Jun 6, 2024

slaren mentioned this issue Jun 6, 2024

vulkan : reuse parent extra for views #7806

Merged

0cc4m closed this as completed in #7806 Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: gpu hang after bde7cd3cd949c1a85d3a199498ac98e78039d46f #7730

Bug: gpu hang after bde7cd3cd949c1a85d3a199498ac98e78039d46f #7730

rhjdvsgsgks commented Jun 4, 2024 •

edited

Loading

rhjdvsgsgks commented Jun 6, 2024

rhjdvsgsgks commented Jun 6, 2024

rhjdvsgsgks commented Jun 6, 2024

slaren commented Jun 6, 2024 •

edited

Loading

Bug: gpu hang after bde7cd3cd949c1a85d3a199498ac98e78039d46f #7730

Bug: gpu hang after bde7cd3cd949c1a85d3a199498ac98e78039d46f #7730

Comments

rhjdvsgsgks commented Jun 4, 2024 • edited Loading

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

rhjdvsgsgks commented Jun 6, 2024

rhjdvsgsgks commented Jun 6, 2024

rhjdvsgsgks commented Jun 6, 2024

slaren commented Jun 6, 2024 • edited Loading

rhjdvsgsgks commented Jun 4, 2024 •

edited

Loading

slaren commented Jun 6, 2024 •

edited

Loading