Skip to content

Metal support for macOS#20817

Open
zisoft wants to merge 18 commits intodarktable-org:masterfrom
zisoft:macos-metal
Open

Metal support for macOS#20817
zisoft wants to merge 18 commits intodarktable-org:masterfrom
zisoft:macos-metal

Conversation

@zisoft
Copy link
Copy Markdown
Collaborator

@zisoft zisoft commented Apr 14, 2026

As you might know, Apple has deprecated OpenCL on macOS several years ago (in macOS 10.14).
It is recommended to transition to Metal ( further reading ).

This is work I have begun almost 2 years ago and after a long pause I finally got it to this working stage (with the help of Claude for the complicated part in pixelpipe_hb.c).

This PR implements the following on macOS:

  • setup the toolchain to compile the metal kernels at build time, kernel sources are located in ./data/metal
  • compiled kernels are placed in <install_dir>/share/darktable/metal
  • new CLI switch -d metal for logging

At runtime we have the following processing logic:

For an iop module, check if the module has a process_metal() function and the corresponding metal kernel. If yes, use it. If not, try OpenCL, with fallback to CPU.

I have started with the probably simplest kernel, the exposure module.
So with this PR merged we would get the basic things working, other modules could then be added step by step.

Since darktable is Linux first, we don't have to take care for compatibility with old OpenCL versions (OpenCL on macOS is stuck with Version 1.2).
In fact, once all kernels are transferred to metal we can stop using OpenCL on macOS at all.

This is still draft and I need some review here.
To avoid code duplication, two helper functions are created in src/develop/pixelpipe_hb.c: _pixelpipe_pre_process() and _pixelpipe_post_process().
@jenshannoschwalm: May I ask you to check if everything is correct here?
The macOS part is guarded by #if defined __APPLE__ so there should hopefully be no impact on Linux and Windows.

If this PR gets merged I would continue with the other iop modules.

@zisoft zisoft marked this pull request as draft April 14, 2026 16:21
@zisoft
Copy link
Copy Markdown
Collaborator Author

zisoft commented Apr 14, 2026

Forgot to mention that this is for Apple Silicon Macs only.
The unified memory of these chips simplifies things drastically.

Intel Macs doesn't have an integrated GPU so we would have to check for different external GPUs which would make all this stuff much more complicated.

The last Github runner image for macos-intel will be stopped in fall 2026, so darktable on Intel will definitely be EOL then and no longer supported.

@MStraeten
Copy link
Copy Markdown
Collaborator

In fact, once all kernels are transferred to metal we can stop using OpenCL on macOS at all.

unfortunately not, since that's needed for old intel macs (even they will just be supported with macports based builds if github retires intel runners ...)

@MStraeten
Copy link
Copy Markdown
Collaborator

btw. it compiles fine (after xcodebuild -downloadComponent MetalToolchain) and runs fine

@zisoft
Copy link
Copy Markdown
Collaborator Author

zisoft commented Apr 15, 2026

after xcodebuild -downloadComponent MetalToolchain

Yes, with Xcode 26 Apple decided to no longer include the metal toolchain, needs to be downloaded separately.

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

Had a first quick look at pixelpipe_hb changes, looks pretty safe.

  1. we just have to keep in mind, that another code path adds some complexity when checking issues...
  2. do you check for available memory for correct tiling dimensions?
  3. did you check for the pipe->shutdown modes?

@zisoft
Copy link
Copy Markdown
Collaborator Author

zisoft commented Apr 15, 2026

  1. we just have to keep in mind, that another code path adds some complexity when checking issues...

Right, and the code in pixelpipe_hb.c is complex and hard to read due to lots of conditional #ifdef...

  1. do you check for available memory for correct tiling dimensions?

I hope so. All the memory handling is checked in src/osx/dt_metal.cc.

  1. did you check for the pipe->shutdown modes?

Again, I hope so.

I have nearly no knowledge about the pixelpipe handling, that's why I needed the help of Claude for that part.
And I need your expertise here.

On Linux and Windows these changes should ideally have no effect.

So on macOS we would start just with the exposure module and give it a field test.

@zisoft zisoft marked this pull request as ready for review April 15, 2026 08:37
@masterpiga
Copy link
Copy Markdown
Contributor

Forgive me for the ignorant question. Does this mean that eventually for each module there will have to be three distinct implementations? Cpu, OpenCL and Metal?

@zisoft
Copy link
Copy Markdown
Collaborator Author

zisoft commented Apr 15, 2026

That's the conclusion. Yes.

@masterpiga
Copy link
Copy Markdown
Contributor

That's the conclusion. Yes.

Ouch! I am no expert in GPU coding, but I know a thing or two about software engineering and code maintenance, and this does not seem like a very sustainable approach, especially considering that openCL - IIUC - is not exactly the platform of the future.

I am sure you have already considered the alternatives, but what about deprecating OpenCL instead, and writing GPU code in GLSL/HLSL targeting Vulkan? On Windows and Linux, it would run natively on Vulkan. On macOS, it would be piped through MoltenVK and map Vulkan API calls to Metal API calls in real-time.

This would condense the GPU path into a single, modern, heavily supported graphics API. It seems a more "future-proof" approach than the one suggested here.

It would require rewriting all OpenCL code into Vulkan compute shaders, but this would be a one-off effort that can probably be by and large automated with LLMs.

@zisoft
Copy link
Copy Markdown
Collaborator Author

zisoft commented Apr 15, 2026

I am no expert in GPU coding

Me too :)

For macOS we are just at the beginning of this route, so it would be ok to stop here.
But that would mainly be a decision for the OpenCL experts...

@masterpiga
Copy link
Copy Markdown
Contributor

masterpiga commented Apr 15, 2026

An even more radical (but superior) alternative would be switching to Halide. It is an open-source DSL specifically designed for high-performance image processing. It is used heavily by Adobe, Google, and Instagram, among others.

In Halide, you write the algorithm once. You then write a schedule (how to compute it - loop unrolling, threading, GPU utilization) separately. The Halide compiler takes a single algorithm definition and compiles it natively to x86 (CPU), ARM, OpenCL, CUDA, and Apple Metal.

You would literally write only one algorithm, and Halide would generate the optimized C++ and GPU kernel code for all targets. Only one code path, instead of 3 (or 2).

@zisoft
Copy link
Copy Markdown
Collaborator Author

zisoft commented Apr 15, 2026

That all sounds good, but that would require a huge redesign of the whole pipeline.

The approach here is an addition to the already existing pipeline and can now be migrated step by step to the other modules. Everything keeps working as it is.

And converting an OpenCL kernel source to metal is also an easy task for an LLM.

@masterpiga
Copy link
Copy Markdown
Contributor

masterpiga commented Apr 15, 2026

And converting an OpenCL kernel source to metal is also an easy task for an LLM.

Yes, absolutely. It's the maintenance and code bloat that scare me. And, since my understanding is that darktable should eventually move on from OpenCL, it would make sense to do the effort only once.

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

Honestly - i don't think we can replace all OpenCL code with metal variants in some managable time. So i somewhat doubt this pr is a good plan. How shall we test / review code? That would put a big burden on the few mac-arm devs. Without a chance for others to test at all. I am not opposing but ...

About replacing all current code using halide .. not with me working on that :-) Leaving out the question of supporting legacy code. Simply a nightmare.

Currently i don't think that OpenCL is in bad shape ... the 1.2 version is not a big thing for now. There are currently just a few workarounds.

@masterpiga
Copy link
Copy Markdown
Contributor

masterpiga commented Apr 15, 2026

If transitioning away from OpenCL is not something that will happen in the short/medium term, then another alternative would be using an open-source toolchain like clspv (OpenCL C to SPIR-V) combined with SPIRV-Cross (SPIR-V to Metal Shading Language).

Devs would still write and maintain the GPU and OpenCL kernel, so no change there. The build system would compile it to SPIR-V and then transpile SPIR-V to Metal.

Mac users would still get native Metal performance without darktable developers having to learn or maintain MSL.

@zisoft
Copy link
Copy Markdown
Collaborator Author

zisoft commented Apr 15, 2026

i don't think we can replace all OpenCL code with metal variants in some managable time.

No need to, and that is my intention.
With this PR we would have the things working in pixelpipe_hb and the exposure module working with metal. All other modules will continue to work with OpenCL.

Then the OpenCL kernel of the next module can be converted to metal and the corresponding process_metal() function as well. Step by step, module by module...

That's my imagination for the progress.

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

open-source toolchain like clspv

Yes and nice. BUT no standard, nothing we can be sure of.

@da-phil
Copy link
Copy Markdown
Contributor

da-phil commented Apr 15, 2026

That's the conclusion. Yes.

Ouch! I am no expert in GPU coding, but I know a thing or two about software engineering and code maintenance, and this does not seem like a very sustainable approach, especially considering that openCL - IIUC - is not exactly the platform of the future.

I am sure you have already considered the alternatives, but what about deprecating OpenCL instead, and writing GPU code in GLSL/HLSL targeting Vulkan? On Windows and Linux, it would run natively on Vulkan. On macOS, it would be piped through MoltenVK and map Vulkan API calls to Metal API calls in real-time.

This would condense the GPU path into a single, modern, heavily supported graphics API. It seems a more "future-proof" approach than the one suggested here.

It would require rewriting all OpenCL code into Vulkan compute shaders, but this would be a one-off effort that can probably be by and large automated with LLMs.

Those are very very valid points, maintaining 2 different GPU code paths is insane for such a small group of core developers, given that we do not even have people in that group that are experienced with the new GPU code path (metal).

Instead of writing another GPU code path in a proprietary GPU framework for only one target platform (macOS), I'd rather propose to already start the migration to a well established GPU programming framework (GLSL/HLSL shaders targeting Vulkan) which is going to work for all our target platforms in the long run. We could keep this code path only for macOS for the time being and slowly retire our OpenCL implementations, step by step, also for the other platforms.
What do you think about that?

@TurboGit
Copy link
Copy Markdown
Member

I also agree with the maintenance nightmare that this would introduce. At some point we may want to find a common framework that can handle a single code and multiple target CPU, OpenCL, Metal, Vulcan...

I have also heard about Khronos (just know nothing about it).

@MStraeten
Copy link
Copy Markdown
Collaborator

i don't think we can replace all OpenCL code with metal variants in some managable time.

it would help to use metal variants just for the most performance hungry modules first - thats where the effort gives most return.

@zisoft
Copy link
Copy Markdown
Collaborator Author

zisoft commented Apr 16, 2026

Ok, it was worth a try, but what happens here?
Another topical discussion gets out of hand as soon as AI comes into play.

The attempt to implement metal support in small work units is being talked down by counter-proposals that require a complete redesign of the pixelpipe processing.

Who will take on that task? A human developer? Or will the future of darktable lie in the hands of AI coding agents?

Before this discussion here drifts off course like the ones on pixls.us, we should probably better close this PR.

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

I think the refactoring of pixelpipe code is worth the effort anyway because it makes that complicated code more clear and readable.

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

Maybe do the refactoring first in a seperate pr and decide on metal integration based on that after full testing?

Honestly i dont know how performant the mac opencl code is. So checking that on some critical code might be worthwhile.

There is one point we have to remember, if we mix metal and opencl code that would break passing cl image/buffer from one module to the next (output image is used as input by the next module). We would have to "convert". That would certainly cost some performance. So calling a metal module would only be beneficial if the algorithm perf gain is larger than loss by conversions.

@zisoft
Copy link
Copy Markdown
Collaborator Author

zisoft commented Apr 16, 2026

Maybe do the refactoring first in a seperate pr and decide on metal integration based on that after full testing?

That sounds good, but that cannot be done by me. I have very few knowledge on the pixelpipe logic.

Honestly i dont know how performant the mac opencl code is. So checking that on some critical code might be worthwhile.

The main reason for this PR is the fear that Apple will one day completely abandon OpenCL. That would kick out darktable on macOS.

There is one point we have to remember, if we mix metal and opencl code that would break passing cl image/buffer from one module to the next (output image is used as input by the next module). We would have to "convert". That would certainly cost some performance. So calling a metal module would only be beneficial if the algorithm perf gain is larger than loss by conversions.

Fully agreed, that's why I asked (especially you) for help and support here.

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

Ok there is no time pressure :-)

I will prepare the refactoring pr (possibly including the mask-cache thing) as a first step.

@MStraeten
Copy link
Copy Markdown
Collaborator

Honestly i dont know how performant the mac opencl code is. So checking that on some critical code might be worthwhile.

The macOS opencl1.2 implementation isn’t known for being benchmark. But simply porting geekbench numbers won’t be helpful. We need a quite performance hungry module for a darktable benchmark.

@zisoft
Copy link
Copy Markdown
Collaborator Author

zisoft commented Apr 18, 2026

To have a more performance hungry module to test, I have now converted the diffuse or sharpen module and made some comparisons with OpenCL.

First, I created two presets:

  1. Took the local contast|normal preset and set iterations to 50. --> local contrast 50
  2. Took the artistic effects|bloom preset and set iterations to 20. --> bloom 20

Both give unusable results of course, just for performance measuring.

darktable -d perf

local contrast 50, OpenCL

    67.4616 [dev_pixelpipe] took 0.011 secs (0.006 CPU) [preview] processed `channelmixerrgb' on GPU
    69.6929 [dev_pixelpipe] took 2.231 secs (0.073 CPU) [preview] processed `diffuse' on GPU
    69.6975 [dev_pixelpipe] took 0.005 secs (0.000 CPU) [preview] processed `agx' on GPU
    69.6980 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_RGB-->IOP_CS_LAB took 0.000 secs (0.000 GPU) [colorout]
    69.7074 [dev_pixelpipe] took 0.010 secs (0.001 CPU) [preview] processed `colorout' on GPU
    69.7127 [dev_pixelpipe] took 0.005 secs (0.007 CPU) [preview] processed `gamma' on CPU
    69.7188 [dt_ioppr_transform_image_colorspace_rgb] `system display profile' -> `sRGB' took 0.006 secs (0.039 CPU) [final histogram]
    69.7287 [histogram] took 0.016 secs (0.142 CPU) final split
    69.7300 [dev_process_image] pixel pipeline took 2.324 secs (0.250 CPU) processing `20231103_143215_Tintling_0007.CR2'
    69.7316 [dev_pixelpipe] took 0.000 secs (0.000 CPU) initing base buffer [full]
    69.7392 [dev_pixelpipe] took 0.008 secs (0.000 CPU) [full] processed `rawprepare' on GPU
    69.7435 [dev_pixelpipe] took 0.004 secs (0.004 CPU) [full] processed `temperature' on GPU
    69.7454 [histogram] took 0.005 secs (0.005 CPU) scope draw
    69.7482 [dev_pixelpipe] took 0.005 secs (0.005 CPU) [full] processed `highlights' on GPU
    69.7489 [resample_cl] plan 0.000 secs (0.000 CPU) resample 0.000 secs (0.000 CPU)
    69.8253 [dev_pixelpipe] took 0.077 secs (0.009 CPU) [full] processed `demosaic' on GPU
    69.8306 [dev_pixelpipe] took 0.005 secs (0.000 CPU) [full] processed `exposure' on GPU
    69.8330 [dev_pixelpipe] took 0.002 secs (0.000 CPU) [full] processed `colorin' on GPU
    69.8335 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_LAB-->IOP_CS_RGB took 0.000 secs (0.000 GPU) [channelmixerrgb]
    69.8367 [dev_pixelpipe] took 0.004 secs (0.000 CPU) [full] processed `channelmixerrgb' on GPU
    70.9821 [dev_pixelpipe] took 1.145 secs (0.041 CPU) [full] processed `diffuse' on GPU
    70.9847 [dev_pixelpipe] took 0.002 secs (0.000 CPU) [full] processed `agx' on GPU

local contrast 50, Metal

    17.5061 process (Metal)           CPU [full]           diffuse                3500       (0/0)  1953x1303 sc=0.290; IOP_CS_RGB 81MB
    19.1715 [pixelpipe] `diffuse' processed with Metal
    19.1718 [dev_pixelpipe] took 1.667 secs (0.014 CPU) [full] processed `diffuse' on GPU
    19.1776 [dev_pixelpipe] took 0.006 secs (0.000 CPU) [full] processed `agx' on GPU
    19.1781 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_RGB-->IOP_CS_LAB took 0.000 secs (0.000 GPU) [colorout]
    19.1830 [dev_pixelpipe] took 0.005 secs (0.001 CPU) [full] processed `colorout' on GPU
    19.1863 [dev_pixelpipe] took 0.003 secs (0.004 CPU) [full] processed `gamma' on CPU
    19.1865 [dev_process_image] pixel pipeline took 1.709 secs (0.043 CPU) processing `20231103_143215_Tintling_0007.CR2'
    19.1974 [dev_pixelpipe] took 0.007 secs (0.004 CPU) [preview] processed `colorin' on GPU
    19.1979 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_LAB-->IOP_CS_RGB took 0.000 secs (0.000 GPU) [channelmixerrgb]
    19.2040 [dev_pixelpipe] took 0.007 secs (0.001 CPU) [preview] processed `channelmixerrgb' on GPU
    19.2067 process (Metal)           CPU [preview]        diffuse                3500       (0/0)  2671x1783 sc=1.000; IOP_CS_RGB 152MB
    22.3744 [pixelpipe] `diffuse' processed with Metal
    22.3751 [dev_pixelpipe] took 3.171 secs (0.061 CPU) [preview] processed `diffuse' on GPU
    22.3876 [dev_pixelpipe] took 0.012 secs (0.000 CPU) [preview] processed `agx' on GPU

bloom 20, OpenCL

    13.7611 [dev_pixelpipe] took 0.003 secs (0.000 CPU) [full] processed `channelmixerrgb' on GPU
    13.9796 [dev_pixelpipe] took 0.218 secs (0.013 CPU) [full] processed `diffuse' on GPU
    13.9822 [dev_pixelpipe] took 0.003 secs (0.000 CPU) [full] processed `agx' on GPU
    13.9827 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_RGB-->IOP_CS_LAB took 0.000 secs (0.000 GPU) [colorout]
    13.9871 [dev_pixelpipe] took 0.005 secs (0.001 CPU) [full] processed `colorout' on GPU
    13.9896 [dev_pixelpipe] took 0.002 secs (0.004 CPU) [full] processed `gamma' on CPU
    13.9898 [dev_process_image] pixel pipeline took 0.359 secs (0.051 CPU) processing `20231103_143215_Tintling_0007.CR2'
    13.9958 [dev_pixelpipe] took 0.000 secs (0.000 CPU) initing base buffer [preview]
    13.9989 [dev_pixelpipe] took 0.003 secs (0.001 CPU) [preview] processed `rawprepare' on GPU
    13.9998 [dev_pixelpipe] took 0.001 secs (0.000 CPU) [preview] processed `temperature' on GPU
    14.0009 [dev_pixelpipe] took 0.001 secs (0.000 CPU) [preview] processed `highlights' on GPU
    14.0109 [dev_pixelpipe] took 0.010 secs (0.001 CPU) [preview] processed `demosaic' on GPU
    14.0145 [dev_pixelpipe] took 0.004 secs (0.001 CPU) [preview] processed `exposure' on GPU
    14.0177 [dev_pixelpipe] took 0.003 secs (0.000 CPU) [preview] processed `colorin' on GPU
    14.0182 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_LAB-->IOP_CS_RGB took 0.000 secs (0.000 GPU) [channelmixerrgb]
    14.0239 [dev_pixelpipe] took 0.006 secs (0.000 CPU) [preview] processed `channelmixerrgb' on GPU
    14.4771 [dev_pixelpipe] took 0.453 secs (0.012 CPU) [preview] processed `diffuse' on GPU
    14.4839 [dev_pixelpipe] took 0.007 secs (0.000 CPU) [preview] processed `agx' on GPU

bloom 20, Metal

    16.2203 [dev_pixelpipe] took 0.004 secs (0.001 CPU) [full] processed `channelmixerrgb' on GPU
    16.2243 process (Metal)           CPU [full]           diffuse                3500       (0/0)  1953x1303 sc=0.290; IOP_CS_RGB 81MB
    16.5174 [pixelpipe] `diffuse' processed with Metal
    16.5174 [dev_pixelpipe] took 0.297 secs (0.012 CPU) [full] processed `diffuse' on GPU
    16.5234 [dev_pixelpipe] took 0.006 secs (0.000 CPU) [full] processed `agx' on GPU
    16.5240 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_RGB-->IOP_CS_LAB took 0.000 secs (0.000 GPU) [colorout]
    16.5280 [dev_pixelpipe] took 0.005 secs (0.001 CPU) [full] processed `colorout' on GPU
    16.5306 [dev_pixelpipe] took 0.003 secs (0.005 CPU) [full] processed `gamma' on CPU
    16.5308 [dev_process_image] pixel pipeline took 0.336 secs (0.043 CPU) processing `20231103_143215_Tintling_0007.CR2'
    16.5384 [dev_pixelpipe] took 0.007 secs (0.002 CPU) [preview] processed `colorin' on GPU
    16.5389 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_LAB-->IOP_CS_RGB took 0.000 secs (0.000 GPU) [channelmixerrgb]
    16.5447 [dev_pixelpipe] took 0.006 secs (0.003 CPU) [preview] processed `channelmixerrgb' on GPU
    16.5486 process (Metal)           CPU [preview]        diffuse                3500       (0/0)  2671x1783 sc=1.000; IOP_CS_RGB 152MB
    17.1405 [pixelpipe] `diffuse' processed with Metal
    17.1406 [dev_pixelpipe] took 0.596 secs (0.007 CPU) [preview] processed `diffuse' on GPU
    17.1515 [dev_pixelpipe] took 0.011 secs (0.000 CPU) [preview] processed `agx' on GPU

@MStraeten : Can you please test on your system?

@MStraeten
Copy link
Copy Markdown
Collaborator

MStraeten commented Apr 18, 2026

local contast|normal preset and set iterations to 50:
perf_metal.txt
perf_opencl.txt

opencl speed is quite better on my system

[opencl_init] found 1 device

   DEVICE:                   0: 'Apple M1 Max'
   CONF KEY:                 cldevice_v6_appleapplem1max
   PLATFORM, VENDOR & ID:    Apple, Apple, ID=16940800
   CANONICAL NAME:           appleapplem1max
   DRIVER VERSION:           1.2 1.0
   DEVICE VERSION:           OpenCL 1.2  API=120
   DEVICE_TYPE:              GPU, unified mem
   GLOBAL MEM SIZE:          25559 MB
   MAX MEM ALLOC:            4792 MB
   MAX IMAGE SIZE:           16384 x 16384
   MAX CONSTANT BUFFER:      1048576 KB
   LOCAL MEM SIZE:           32 KB
   ADDRESS ALIGN:            4096 B
   COMPUTE UNITS:            32
   MAX WORK GROUP SIZE:      256 (32)
   MAX WORK ITEM DIMENSIONS: 3 [ 256 256 256 ]
   ASYNC PIXELPIPE:          NO
   PINNED MEMORY TRANSFER:   NO
   SUPPORTED ATOMICS:        INT32
   EVENTS HANDLED:           YES
   TILING ADVANTAGE:         0.000
   DEFAULT DEVICE:           NO
   KERNEL BUILD DIRECTORY:   /Users/martinstraeten/src/darktable-ms/build/share/darktable/kernels
   KERNEL DIRECTORY:         /Users/martinstraeten/.cache/darktable/cached_v6_kernels_for_AppleAppleM1Max_1210
   CL COMPILER COMMAND:      -w -cl-fast-relaxed-math -DAPPLE=1 -I/Users/martinstraeten/src/darktable-ms/build/share/darktable/kernels
   KERNEL LOADING TIME:      0.0318 sec
[opencl_init] OpenCL successfully initialized. internal numbers and names of available devices:
[opencl_init]		0	'Apple Apple M1 Max'
[opencl_init] FINALLY: opencl PREFERENCE=YES is AVAILABLE and ENABLED
[opencl_init] opencl_scheduling_profile: 'default'
[opencl_init] opencl_device_priority: '*/!0,*/*/*/!0,*'
[opencl_init] opencl_mandatory_timeout: 1000
[opencl_update_priorities] these are your device priorities:
[opencl_update_priorities] 		image	preview	export	thumbs	preview2
[opencl_update_priorities]		0	-1	0	0	-1
[opencl_update_priorities] show if opencl use is mandatory for a given pixelpipe:
[opencl_update_priorities] 		image	preview	export	thumbs	preview2
[opencl_update_priorities]		NO	NO	NO	NO	NO
[opencl_synchronization_timeout] synchronization timeout set to 200
   UNIFIED MEM SIZE:         8192 MB reserved for 'appleapplem1max' id=0[opencl_update_priorities] these are your device priorities:
[opencl_update_priorities] 		image	preview	export	thumbs	preview2
[opencl_update_priorities]		0	-1	0	0	-1
[opencl_update_priorities] show if opencl use is mandatory for a given pixelpipe:
[opencl_update_priorities] 		image	preview	export	thumbs	preview2
[opencl_update_priorities]		NO	NO	NO	NO	NO

@zisoft
Copy link
Copy Markdown
Collaborator Author

zisoft commented Apr 18, 2026

Can you please check again? Latest commit gives ~50% improvement on my system.
local contrast 50:

    14.1933 process (Metal)           CPU [full]           diffuse                3500       (0/0)  1953x1303 sc=0.290; IOP_CS_RGB 81MB
    15.0213 [pixelpipe] `diffuse' processed with Metal
    15.0213 [dev_pixelpipe] took 0.830 secs (0.032 CPU) [full] processed `diffuse' on GPU
    15.0277 [dev_pixelpipe] took 0.006 secs (0.000 CPU) [full] processed `agx' on GPU
    15.0282 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_RGB-->IOP_CS_LAB took 0.000 secs (0.000 GPU) [colorout]
    15.0335 [dev_pixelpipe] took 0.006 secs (0.001 CPU) [full] processed `colorout' on GPU
    15.0359 [dev_pixelpipe] took 0.002 secs (0.004 CPU) [full] processed `gamma' on CPU
    15.0360 [dev_process_image] pixel pipeline took 0.865 secs (0.055 CPU) processing `20231103_143215_Tintling_0007.CR2'
    15.0463 [dev_pixelpipe] took 0.006 secs (0.001 CPU) [preview] processed `colorin' on GPU
    15.0468 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_LAB-->IOP_CS_RGB took 0.000 secs (0.000 GPU) [channelmixerrgb]
    15.0530 [dev_pixelpipe] took 0.007 secs (0.003 CPU) [preview] processed `channelmixerrgb' on GPU
    15.0555 process (Metal)           CPU [preview]        diffuse                3500       (0/0)  2671x1783 sc=1.000; IOP_CS_RGB 152MB
    16.5593 [pixelpipe] `diffuse' processed with Metal
    16.5599 [dev_pixelpipe] took 1.507 secs (0.032 CPU) [preview] processed `diffuse' on GPU
    16.5720 [dev_pixelpipe] took 0.012 secs (0.000 CPU) [preview] processed `agx' on GPU

@MStraeten
Copy link
Copy Markdown
Collaborator

perf_metal.txt

     0,2308 [dt_metal_init] Initializing Metal devices
     0,2308 [dt_metal_init] metallib path: /Users/martinstraeten/src/darktable-ms/build/share/darktable/metal/darktable.metallib
     0,2309 [dt_metal_init] Device: Apple M1 Max
     0,2309 [dt_metal_create_library] Create Metal library for device: Apple M1 Max
     0,2309 [dt_metal_create_library] Library created
     0,2310 [dt_metal_create_library] Function: exposure
     0,2310 [dt_metal_create_library] Function: inpaint_mask
     0,2310 [dt_metal_create_library] Function: diffuse_pde
     0,2310 [dt_metal_create_library] Function: blur_2D_Bspline_vertical
     0,2310 [dt_metal_create_library] Function: build_mask
     0,2310 [dt_metal_create_library] Function: wavelets_detail_level
     0,2310 [dt_metal_create_library] Function: blur_2D_Bspline_horizontal
     0,6636 [dt_metal_create_kernel] Created kernel 'exposure' with id=0
     0,7691 [dt_metal_create_kernel] Created kernel 'blur_2D_Bspline_vertical' with id=1
     0,7691 [dt_metal_create_kernel] Created kernel 'blur_2D_Bspline_horizontal' with id=2
     0,7692 [dt_metal_create_kernel] Created kernel 'wavelets_detail_level' with id=3
     0,7692 [dt_metal_create_kernel] Created kernel 'build_mask' with id=4
     0,7692 [dt_metal_create_kernel] Created kernel 'inpaint_mask' with id=5
     0,7693 [dt_metal_create_kernel] Created kernel 'diffuse_pde' with id=6
     1,9077 [dt_dev_load_raw] loading the image. took 0,080 secs (0,267 CPU)
     1,9382 [export] creating pixelpipe took 0,027 secs (0,029 CPU)
     1,9390 [dev_pixelpipe] took 0,000 secs (0,000 CPU) initing base buffer [export]
     1,9500 [dev_pixelpipe] took 0,011 secs (0,001 CPU) [export] processed `rawprepare' on GPU
     1,9565 [dev_pixelpipe] took 0,006 secs (0,000 CPU) [export] processed `temperature' on GPU
     1,9706 [dev_pixelpipe] took 0,014 secs (0,001 CPU) [export] processed `highlights' on GPU
     2,0394 [dev_pixelpipe] took 0,069 secs (0,029 CPU) [export] processed `demosaic' on GPU
     2,0510 [dev_pixelpipe] took 0,012 secs (0,000 CPU) [export] processed `flip' on GPU
     2,1623 [dev_pixelpipe] took 0,111 secs (0,016 CPU) [export] processed `retouch' on GPU
     2,1729 process (Metal)           CPU [export]         exposure               2500   (396/895)  2982x3834 sc=1,000; IOP_CS_RGB 366MB
     2,2356 [pixelpipe] `exposure' processed with Metal
     2,2357 [dev_pixelpipe] took 0,073 secs (0,019 CPU) [export] processed `exposure' on GPU
     2,5053 [dev_pixelpipe] took 0,270 secs (1,705 CPU) [export] processed `toneequal' on CPU, blended on CPU
     2,5249 [dev_pixelpipe] took 0,020 secs (0,000 CPU) [export] processed `crop' on GPU
     2,5353 [dev_pixelpipe] took 0,010 secs (0,001 CPU) [export] processed `colorin' on GPU
     2,5363 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_LAB-->IOP_CS_RGB took 0,001 secs (0,000 GPU) [channelmixerrgb]
     2,5537 [dev_pixelpipe] took 0,018 secs (0,001 CPU) [export] processed `channelmixerrgb' on GPU
     2,5595 process (Metal)           CPU [export]         diffuse                3600       (0/0)  2982x3834 sc=1,000; IOP_CS_RGB 366MB
    18,7551 [pixelpipe] `diffuse' processed with Metal
    18,7569 [dev_pixelpipe] took 16,203 secs (0,057 CPU) [export] processed `diffuse' on GPU
    18,7877 [dev_pixelpipe] took 0,031 secs (0,000 CPU) [export] processed `primaries' on GPU
    19,0474 [dev_pixelpipe] took 0,260 secs (0,005 CPU) [export] processed `colorequal' on GPU
    19,0690 [dev_pixelpipe] took 0,022 secs (0,001 CPU) [export] processed `colorbalancergb' on GPU
    19,1829 [dev_pixelpipe] took 0,114 secs (0,033 CPU) [export] processed `colorbalancergb.1' on GPU, blended on GPU
    19,2019 [dev_pixelpipe] took 0,019 secs (0,000 CPU) [export] processed `agx' on GPU
    19,2023 [resample_cl] took 0,000 secs (0,000 CPU) 1:1 copy/crop of 2982x3834 pixels
    19,2109 [dev_pixelpipe] took 0,009 secs (0,000 CPU) [export] processed `finalscale' on GPU
    19,2118 [dt_ioppr_transform_image_colorspace_cl] IOP_CS_RGB-->IOP_CS_LAB took 0,001 secs (0,000 GPU) [colorout]
    19,2295 [dev_pixelpipe] took 0,019 secs (0,001 CPU) [export] processed `colorout' on GPU
    19,2339 [dev_process_export] pixel pipeline processing took 17,296 secs (1,871 CPU)

@zisoft
Copy link
Copy Markdown
Collaborator Author

zisoft commented Apr 18, 2026

So no change for your system? Did you rebuild with the latest commit?

Exporting the image with opencl I get:

    74.0585 [dev_pixelpipe] took 14.602 secs (0.289 CPU) [export] processed `diffuse' on GPU

with metal:

    17.6731 process (Metal)           CPU [export]         diffuse                3500       (0/0)  6744x4500 sc=1.000; IOP_CS_RGB 971MB
    27.3086 [pixelpipe] `diffuse' processed with Metal
    27.3112 [dev_pixelpipe] took 9.647 secs (0.283 CPU) [export] processed `diffuse' on GPU

@MStraeten
Copy link
Copy Markdown
Collaborator

after git reset --hard and pulling again - same results.
not sure if there are limits which are exceeded on M1max - will ask perplexity, if it has an idea what might have impact

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

I had an in-depth look at pixelpipe_hb.c and that looks ok and safe with existing code. So no breakage expected. Any refactoring seems pretty difficult atm as the pre and post code basically works with cpu-bound data/ram so the opencl code is mostly different.

There seem to be some relevant design restrictions.

  1. There is no tiling code supported (i am sure you are aware of that :-)
  2. but more important no checks on available memory. From what i know from apple-arm harware there is unified memory so probably what we have available overall might be used for metal (as with this pr we possibly backcopy clmem data to main ram). So we could prohibit metal code if tiling would be required or the metal code would likely swap memory if correctly implemented. (Could that explain martins results?)
  3. not sure about your perf measuring here when comparing opencl vs metal.

@zisoft
Copy link
Copy Markdown
Collaborator Author

zisoft commented Apr 18, 2026

  1. There is no tiling code supported (i am sure you are aware of that :-)

I am sure there is a lot of work still to be done :-)

  1. ... So we could prohibit metal code if tiling would be required

At the final end that would mean to fallback to CPU if we assume that opencl will not be available on Macs some day.

  1. not sure about your perf measuring here when comparing opencl vs metal.

Took an image, reset the history and only applied the diffuse local contrast normal with 50 iterations.
Then

  1. exported the image with current master (opencl, no metal code available).
  2. exported the image with this PR applied to use metal.

I am quite unsure if we really should continue this PR. All the things already said above about maintaining 3 processing routes will become a nightmare. What if someone changes an algorithm in one of the modules. We not only have to remember maintaining CPU and OpenCL. There will be the Metal code as well to keep in mind.

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

Plus colororspace handling.

@zisoft
Copy link
Copy Markdown
Collaborator Author

zisoft commented Apr 19, 2026

@MStraeten : Can you please try again?

Now with tiling.

Just to get things working and see if all that stuff is worth the effort.

After that we can decide on how to proceed.

@MStraeten
Copy link
Copy Markdown
Collaborator

MStraeten commented Apr 19, 2026

similar numbers, unfortunately: log.txt
since i don't have a clue on gpu implementation here the reasoning of perplexity /claude sonnet:
reasoning on root causes.txt

btw: the suggestion 'change to threadW=32, threadH=8' makes it even worse

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

I dont know metal but with opencl the workgroup dimensions have a big impact. Even on kernels without locals.

The aligning to more data on horizontal is a common trick for locality.

@haecksenwerk
Copy link
Copy Markdown

@zisoft @jenshannoschwalm @MStraeten I understand that adding metal support to the pipeline would introduce a significant amount of maintenance work. Still, I wonder whether the effort could be reduced by focusing primarily on the modules that appear in the 'quick access panel', and on the 'color equalizer' in particular, which has a major performance impact. When it is enabled, the exposure slider lags by up to three seconds on an M2, whereas with the equalizer disabled, the slider responds almost instantly. Other modules, such as color balance, tone equalizer or local contrast, don’t cause nearly the same slowdown.

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

and on the 'color equalizer' in particular, which has a major performance impact. When it is enabled, the exposure slider lags by up to three seconds on an M2...

@zisoft @MStraeten can you confirm the bad performance?

@haecksenwerk i am not sure about what exactly you observe. We need at least a log with -d pipe -d opencl.

@zisoft
Copy link
Copy Markdown
Collaborator Author

zisoft commented Apr 30, 2026

@haecksenwerk : please open a new issue for that

@haecksenwerk
Copy link
Copy Markdown

@haecksenwerk : please open a new issue for that

Never mind. I probably jumped the gun with my comment, but I was excited to see this MR and the possibility of some improvements coming to the macOS version. Given that you've mentioned potentially prioritizing modules that could be tackled first I just cranked out my view as a darktable user.

@MStraeten
Copy link
Copy Markdown
Collaborator

When it is enabled, the exposure slider lags by up to three seconds on an M2, whereas with the equalizer disabled, the slider responds almost instantly.

since exposure is quite early in the pipe on change the whole pipeline up to color equalizer needs to be reprocessed - so not really surprising. But without and xmp it's hard to check the root cause.

@MStraeten
Copy link
Copy Markdown
Collaborator

and on the 'color equalizer' in particular, which has a major performance impact. When it is enabled, the exposure slider lags by up to three seconds on an M2...

@zisoft @MStraeten can you confirm the bad performance?

here logs - you might try to find out which changes were made with color equalizer on/off in my opinion background processes on the system have more impact ;)

36.1736 [dev_pixelpipe] took 0.002 secs (0.004 CPU) [preview] processed `exposure' on CPU
    36.2083 [dev_pixelpipe] took 0.035 secs (0.232 CPU) [full] processed `exposure' on GPU
    37.6751 [dev_pixelpipe] took 0.001 secs (0.005 CPU) [preview] processed `exposure' on CPU
    37.7078 [dev_pixelpipe] took 0.033 secs (0.234 CPU) [full] processed `exposure' on GPU
    39.0784 [dev_pixelpipe] took 0.001 secs (0.007 CPU) [preview] processed `exposure' on CPU
    39.1109 [dev_pixelpipe] took 0.033 secs (0.223 CPU) [full] processed `exposure' on GPU
    40.8723 [dev_pixelpipe] took 0.001 secs (0.009 CPU) [preview] processed `exposure' on CPU
    40.9153 [dev_pixelpipe] took 0.044 secs (0.267 CPU) [full] processed `exposure' on GPU
    42.4240 [dev_pixelpipe] took 0.001 secs (0.004 CPU) [preview] processed `exposure' on CPU
    42.4577 [dev_pixelpipe] took 0.036 secs (0.244 CPU) [full] processed `exposure' on GPU
    48.5513 [dev_pixelpipe] took 0.029 secs (0.114 CPU) [full] processed `exposure' on GPU
    57.2903 [dev_pixelpipe] took 0.001 secs (0.004 CPU) [preview] processed `exposure' on CPU
    57.3244 [dev_pixelpipe] took 0.035 secs (0.231 CPU) [full] processed `exposure' on GPU
    58.5631 [dev_pixelpipe] took 0.001 secs (0.004 CPU) [preview] processed `exposure' on CPU
    58.5976 [dev_pixelpipe] took 0.036 secs (0.215 CPU) [full] processed `exposure' on GPU
    59.9065 [dev_pixelpipe] took 0.001 secs (0.005 CPU) [preview] processed `exposure' on CPU
    59.9402 [dev_pixelpipe] took 0.034 secs (0.224 CPU) [full] processed `exposure' on GPU
    61.9078 [dev_pixelpipe] took 0.001 secs (0.010 CPU) [preview] processed `exposure' on CPU
    61.9423 [dev_pixelpipe] took 0.036 secs (0.245 CPU) [full] processed `exposure' on GPU
    64.2438 [dev_pixelpipe] took 0.001 secs (0.004 CPU) [preview] processed `exposure' on CPU
    64.2792 [dev_pixelpipe] took 0.036 secs (0.221 CPU) [full] processed `exposure' on GPU

i don't see issues with exposure - overall processing time just differs based on the subsequent modules in the pipe...

@haecksenwerk
Copy link
Copy Markdown

After removing both ./cache/darktable and ./config/darktable, the heavy latency I was experiencing is gone.

Sorry for stirring things up unnecessarily.

259.3100 [dev_pixelpipe] took 0.015 secs (0.007 CPU) [full] processed `exposure' on GPU, blended on GPU
   259.3166 [dev_pixelpipe] took 0.007 secs (0.007 CPU) [full] processed `colorin' on GPU, blended on GPU
   259.3269 [dev_pixelpipe] took 0.010 secs (0.004 CPU) [full] processed `channelmixerrgb' on GPU, blended on GPU
   259.4730 [dev_pixelpipe] took 0.146 secs (0.009 CPU) [full] processed `colorequal' on GPU, blended on GPU
   259.4795 [dev_pixelpipe] took 0.007 secs (0.002 CPU) [full] processed `agx' on GPU, blended on GPU
   259.5054 [dev_pixelpipe] took 0.026 secs (0.003 CPU) [full] processed `bilat' on GPU, blended on GPU
   259.5120 [dev_pixelpipe] took 0.007 secs (0.000 CPU) [full] processed `colorout' on GPU, blended on GPU
   259.5159 [dev_pixelpipe] took 0.004 secs (0.006 CPU) [full] processed `gamma' on CPU, blended on CPU
   259.5163 dev_pixelpipe_change          [full]                                        top changed, 
   259.5183 dt_dev_pixelpipe_process 900x1351 x=0 y=0 problem DT_DEV_PIXELPIPE_STOP_NO
   259.5187 dev_pixelpipe_change          [preview]                                     top changed, 
   259.5237 [dev_pixelpipe] took 0.005 secs (0.007 CPU) [preview] processed `exposure' on GPU, blended on GPU
   259.5258 [dev_pixelpipe] took 0.002 secs (0.002 CPU) [preview] processed `colorin' on GPU, blended on GPU
   259.5286 [dev_pixelpipe] took 0.003 secs (0.001 CPU) [preview] processed `channelmixerrgb' on GPU, blended on GPU
   259.5884 [dev_pixelpipe] took 0.060 secs (0.018 CPU) [preview] processed `colorequal' on GPU, blended on GPU
   259.5914 [dev_pixelpipe] took 0.003 secs (0.000 CPU) [preview] processed `agx' on GPU, blended on GPU
   259.6016 [dev_pixelpipe] took 0.010 secs (0.002 CPU) [preview] processed `bilat' on GPU, blended on GPU
   259.6056 [dev_pixelpipe] took 0.004 secs (0.000 CPU) [preview] processed `colorout' on GPU, blended on GPU
   259.6072 [dev_pixelpipe] took 0.002 secs (0.003 CPU) [preview] processed `gamma' on CPU, blended on CPU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants