Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL initialization crashes darktable when using AMD ROCm OpenCL runtime #14932

Closed
asdkant opened this issue Jul 23, 2023 · 15 comments
Closed
Labels
AMD OpenCL Specific to AMD OpenCL hardware or driver bug: upstream he bug needs a fix outside of the scope of darktable, in an external lib or in a driver scope: hardware support dealing with drivers and external devices: GPU, printers

Comments

@asdkant
Copy link

asdkant commented Jul 23, 2023

Describe the bug

I get a message saying "PHI node has multiple entries for the same basic block with different incoming values" (see console output below)

Running datkable with --disable-opencl works, so this is OpenCL and not someting else.

When this started happening I switched from the amdgpu-pro OCL driver to the rocm-opencl-runtime package as described in the arch wiki, but the error persists. This suggest the issue is not related to a particular OpenCL implementation.

Issue #14900 looks similar, the same user seems to have posted in the Arch forum and it was suggested there that the issue is related to multiple OpenCL implementations show in the output of clinfo --list, but I have only one so it's not that.

The other suggested root cause is some incompatibility of LLVM versions (I suppose between what Darktable expects and what the system has). My system has LLVM 15, but a LLVM 14 package is available; installing the LLVM 14 package does not resolve the issue.

Steps to reproduce

Open darktable (in a terminal so the error can be read)

Expected behavior

Darktable should open without issue

Logfile | Screenshot | Screencast

$ darktable -d opencl
Gtk-Message: 18:23:42.555: Failed to load module "appmenu-gtk-module"
     0,1665 [dt_get_sysresource_level] switched to 2 as `large'
     0,1665   total mem:       31995MB
     0,1665   mipmap cache:    3999MB
     0,1665   available mem:   21871MB
     0,1665   singlebuff:      499MB
     0,1665   OpenCL tune mem: WANTED
     0,1665   OpenCL pinned:   WANTED
[opencl_init] opencl related configuration options:
[opencl_init] opencl: ON
[opencl_init] opencl_scheduling_profile: 'default'
[opencl_init] opencl_library: 'default path'
[opencl_init] opencl_device_priority: '*/!0,*/*/*'
[opencl_init] opencl_mandatory_timeout: 200
[opencl_init] opencl library 'libOpenCL' found on your system and loaded
[opencl_init] found 1 platform
[opencl_init] found 1 device

[dt_opencl_device_init]
   DEVICE:                   0: 'gfx1032'
   PLATFORM NAME & VENDOR:   AMD Accelerated Parallel Processing, Advanced Micro Devices, Inc.
   CANONICAL NAME:           amdacceleratedparallelprocessinggfx1032
   DRIVER VERSION:           3570.0 (HSA1.1,LC)
   DEVICE VERSION:           OpenCL 2.0 
   DEVICE_TYPE:              GPU
   GLOBAL MEM SIZE:          8176 MB
   MAX MEM ALLOC:            6950 MB
   MAX IMAGE SIZE:           16384 x 16384
   MAX WORK GROUP SIZE:      256
   MAX WORK ITEM DIMENSIONS: 3
   MAX WORK ITEM SIZES:      [ 1024 1024 1024 ]
   ASYNC PIXELPIPE:          NO
   PINNED MEMORY TRANSFER:   WANTED
   MEMORY TUNING:            WANTED
   FORCED HEADROOM:          400
   AVOID ATOMICS:            NO
   MICRO NAP:                250
   ROUNDUP WIDTH:            16
   ROUNDUP HEIGHT:           16
   CHECK EVENT HANDLES:      128
   PERFORMANCE:              6.715
   TILING ADVANTAGE:         0.000
   DEFAULT DEVICE:           NO
   KERNEL BUILD DIRECTORY:   /usr/share/darktable/kernels
   KERNEL DIRECTORY:         /home/kant/.cache/darktable/cached_v1_kernels_for_AMDAcceleratedParallelProcessinggfx1032_35700HSA11LC
   CL COMPILER OPTION:       -cl-fast-relaxed-math
PHI node has multiple entries for the same basic block with different incoming values!
  %967 = phi float [ %largephi.extractslice0, %sw.default ], [ %largephi.extractslice055, %sw.bb667 ], [ %largephi.extractslice059, %sw.bb663 ], [ %largephi.extractslice063, %sw.bb659 ], [ %largephi.extractslice067, %sw.bb655 ], [ %largephi.extractslice071, %sw.bb646 ], [ %largephi.extractslice075, %_Z4fmodff.exit16 ], [ %largephi.extractslice079, %_Z4fmodff.exit13 ], [ %largephi.extractslice083, %_Z4fmodff.exit ], [ %largephi.extractslice087, %sw.bb562 ], [ %largephi.extractslice091, %sw.bb555 ], [ %largephi.extractslice095, %sw.bb533 ], [ %largephi.extractslice099, %if.then502 ], [ %largephi.extractslice0103, %if.else517 ], [ %largephi.extractslice0107, %if.then456 ], [ %largephi.extractslice0111, %if.else471 ], [ %largephi.extractslice0115, %if.then393 ], [ %largephi.extractslice0119, %if.else408 ], [ %largephi.extractslice0123, %if.then338 ], [ %largephi.extractslice0127, %if.else353 ], [ %largephi.extractslice0131, %if.then283 ], [ %largephi.extractslice0135, %if.else298 ], [ %largephi.extractslice0139, %if.then224 ], [ %largephi.extractslice0143, %if.else241 ], [ %largephi.extractslice0147, %sw.bb193 ], [ %largephi.extractslice0151, %sw.bb180 ], [ %largephi.extractslice0155, %sw.bb168 ], [ %largephi.extractslice0159, %sw.bb158 ], [ %largephi.extractslice0163, %sw.bb147 ], [ %largephi.extractslice0167, %if.then116 ], [ %largephi.extractslice0171, %if.else131 ], [ %largephi.extractslice0175, %sw.bb71 ], [ %largephi.extractslice0179, %sw.bb ], [ %largephi.extractslice0183, %if.end ], [ %largephi.extractslice0187, %if.end ], [ %largephi.extractslice0191, %if.end ], [ %largephi.extractslice0195, %if.end ], [ %largephi.extractslice0199, %if.end ]
label %if.end
  %largephi.extractslice0183 = extractelement <4 x float> %div, i64 0
  %largephi.extractslice0195 = extractelement <4 x float> %div, i64 0
in function blendop_Lab
LLVM ERROR: Broken function found, compilation aborted!
Aborted (core dumped)

Commit

No response

Where did you install darktable from?

distro packaging

darktable version

4.4.2 (package version: 2:4.4.2-1 )

What OS are you using?

Linux

What is the version of your OS?

Arch Linux (EndeavourOS)

Describe your system?

Operating System: EndeavourOS
KDE Plasma Version: 5.27.6
KDE Frameworks Version: 5.108.0
Qt Version: 5.15.10
Kernel Version: 6.1.39-1-lts (64-bit)
Graphics Platform: X11
Processors: 16 × AMD Ryzen 7 1700 Eight-Core Processor
Memory: 31,2 GiB of RAM
Graphics Processor: AMD Radeon RX 6600 XT
Manufacturer: Gigabyte Technology Co., Ltd.
Product Name: AX370-Gaming K7

Are you using OpenCL GPU in darktable?

Yes

If yes, what is the GPU card and driver?

AMD RX 6600 XT with amdgpu (xf86-video-amdgpu package)

Please provide additional context if applicable. You can attach files too, but might need to rename to .txt or .zip

No response

@asdkant
Copy link
Author

asdkant commented Jul 23, 2023

I just tried with the 4.5.0+121.g2baf8590d9-x86_64 AppImage snapshot and the issue persists.

I also tried deleting my ~/.config/darktable/ directory (well, renaming it so I can restore it later) and it does not seem to make a difference.

@asdkant
Copy link
Author

asdkant commented Jul 24, 2023

I managed to get it to work using the opencl-legacy-amdgpu-pro AUR package which provides the amdgpu-pro driver.

I think it'd be a good idea to figure out what is worng with the rocm ICD that breaks Darktable. If there's something else I can do to help debug this issue, please let me know.

@asdkant asdkant changed the title OpenCL initialization crashes darktable OpenCL initialization crashes darktable when using AMD ROCm OpenCL runtime Jul 24, 2023
@asdkant
Copy link
Author

asdkant commented Jul 24, 2023

I've reported the issue on the upstream repo for the ROCm OpenCL runtime, to make sure it has visibility there.

Also of note, I tried to run darktable using the rusticl runtime and I get funny behaviour (processing in the darkroom seems to happen, but it finishes with a completely black image), so at least as of today that's not a viable option.

@trougnouf
Copy link
Contributor

trougnouf commented Jul 24, 2023

Same issue with an AMD RX 6700S and 7900 XTX (which I believe is not supported by the legacy driver).

I would like to try using rusticl but darktable crashes right away if I have both versions installed. Changing the config file devices to !0,* does not help.
edit: sudo mv /etc/OpenCL/vendors/amdocl64.icd /etc/OpenCL/vendors/amdocl64.icd.disabled does the trick, but [opencl_init] no devices found for Mesa/X.org (vendor) - rusticl (name)
edit: RUSTICL_ENABLE=radeonsi darktable -d opencl works, color balance rgb also results in a black image.

@jenshannoschwalm
Copy link
Collaborator

Please note this is a darktable issue tracker, nothing about how to install buggy CL drivers and/installations.

@ralfbrown ralfbrown added the scope: hardware support dealing with drivers and external devices: GPU, printers label Jul 25, 2023
@jenshannoschwalm
Copy link
Collaborator

Beside installation or driver issues, if you want to dig into debugging this issue and suspect a problem in dt colorbalancergb and the used driver you might go the hard way, use cl compiler output ...

If darktable runs otherwise fine, there might be some specific cl functions in the specific cl kernel.

You might compile dt yourself including modified (by you) kernels. Look into data/extended.cl , the function in question would be kernel void colorbalancerg

if you do so, some ideas

  1. AMD drivers have been notorious for going crazy if one part of a 4float is NaN or a maths function might have a negative for sqrt or such. Nvidia seems to control that much better.
  2. you might disable the compiler flags or define others so exclude optimizer bugs
  3. we have a number of native functions here, those have also been the cause of bugs if drivers were not fully stable. Also you could look out at other issues that sound like that.

Unfortunately - at least for you - non of the dt devs runs AMD hardware so a number of AMD bugs have been fixed just by suspecting something as 1-3 while reading code. Also i am not aware of anyone in dt github or the pixls forums who is running rustcl in a stable way. (And we had "issues" like yours - issue in brackets "as not-dt-business")

Anyway - i would appreciate if you dig into this - we would love to hear about any fixable bugs in that code, a short reading by me did not find anything suspicious except the native functions. Good luck :-)

@Libert-Sin
Copy link

Libert-Sin commented Aug 5, 2023

I am the person who filed issue #14900. I have resolved the problem with help from the archlinux forum. I am not sure if this solution applies to all situations, but it was a suitable solution for my case. I hope this can be a reference for anyone who encounters the same problem.

https://bbs.archlinux.org/viewtopic.php?pid=2113677#p2113677

Expressing gratitude to Lone_Wolf.

@trougnouf
Copy link
Contributor

I am the person who filed issue #14900. I have resolved the problem with help from the archlinux forum. I am not sure if this solution applies to all situations, but it was a suitable solution for my case. I hope this can be a reference for anyone who encounters the same problem.

https://bbs.archlinux.org/viewtopic.php?pid=2113677#p2113677

Expressing gratitude to Lone_Wolf.

That doesn't fix the issue with ROCm OpenCL, it just disables it and uses RustiCL instead, but RustiCL isn't compatible with darktable either ( #14937 , https://gitlab.freedesktop.org/mesa/mesa/-/issues/7746 )

@jenshannoschwalm
Copy link
Collaborator

OK, closing this as not a dt issue as long there is no better evidence.

@jenshannoschwalm jenshannoschwalm closed this as not planned Won't fix, can't repro, duplicate, stale Aug 6, 2023
@trougnouf
Copy link
Contributor

trougnouf commented Aug 6, 2023

It's still a pending issue whether it stems from darktable or not. Closing it makes it harder to find the relevant information.
There seems to be no way to use darktable with OpenCL on AMD Radeon 7xxx GPUs since they're not supported by the pre-ROCm driver.

edit: it works with the aur opencl-amd package (detected as DRIVER VERSION: "3581.0 (HSA1.1,LC)", instead of 3570.0 in the official package) so that seems to be an issue with the Arch official package

@ralfbrown ralfbrown added the bug: upstream he bug needs a fix outside of the scope of darktable, in an external lib or in a driver label Aug 6, 2023
@jenshannoschwalm
Copy link
Collaborator

@trougnouf i understand your frustration but maybe you can understand the closing too. This is a dt issue tracker and not an OpenCL installation forum :-) You can be pretty sure i look at all OpenCL related reports and if a dt problem is suspicious i will most likely track that down and probably fix it - have done so many times the last years.

OpenCL is a bad boy sometimes. OpenCL problems with arch or windows in combination with amd drivers are notorious but not dt specific. Don't know if the installation process for both systems is so tricky or because the amd maintainers have problems keeping the stuff in shape. Also i don't know why people install several drivers for the same hardware and use workarounds that are just not right. We just have to keep our working-on-dt time under control and concentrate on non-upstream issues.

So - if you have a proposal to solve this in a more generic and helpful way due to your experience, i would love to include that somewhere in the docs or we might add some hints in the issue template.

@asdkant
Copy link
Author

asdkant commented Aug 31, 2023

It's still a pending issue whether it stems from darktable or not. Closing it makes it harder to find the relevant information. There seems to be no way to use darktable with OpenCL on AMD Radeon 7xxx GPUs since they're not supported by the pre-ROCm driver.

edit: it works with the aur opencl-amd package (detected as DRIVER VERSION: "3581.0 (HSA1.1,LC)", instead of 3570.0 in the official package) so that seems to be an issue with the Arch official package

Just commenting to note that earlier today I ran DT with opencl-legacy-amdgpu-pro installed and it didn't detect the OpenCL device, but installing opencl-amd solved the issue.

Also relevant for anyone facing this issue and reading this discussion, accoring to this the 5.7 release of rocm-opencl-runtime should solve the issue.

@prurigro
Copy link

prurigro commented Nov 3, 2023

The 5.7 release doesn't appear to fix the issue for me.

Edit: Similarly, the opencl-amd package does resolve it

@mzannoni
Copy link

mzannoni commented Nov 5, 2023

I got the exact same issue with Manjaro+KDE Plasma.
As already noted above, I "solved" by removing rocm packages from official repos and installing opencl-amd from AUR.
Or also by running darktable with the --disable-opencl option.

So, also for me ver. 5.7 of rocm-opencl-runtime doesn't work with darktable.

EDIT: I wasn't actually running version 5.7 of rocm-opencl-runtime, but still 5.6. After ver 5.7 has been made available I installed it and now it works with darktable.

@asdkant
Copy link
Author

asdkant commented Nov 13, 2023

@prurigro , @mzannoni , please follow this up in the ROCm-Developer-Tools/clr repo

In my previous issue there you can see what commands you need to run to provide the devs with useful debug information.

@jenshannoschwalm jenshannoschwalm added the AMD OpenCL Specific to AMD OpenCL hardware or driver label Aug 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AMD OpenCL Specific to AMD OpenCL hardware or driver bug: upstream he bug needs a fix outside of the scope of darktable, in an external lib or in a driver scope: hardware support dealing with drivers and external devices: GPU, printers
Projects
None yet
Development

No branches or pull requests

7 participants