Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: make sure correct HLR chroma corrections #13654

Conversation

jenshannoschwalm
Copy link
Collaborator

See #13632 and #13646

In some situations the calculated chroma corrections are vastly wrong. The first idea was about precision errors because of using floats for summing up data and the counters, that could be ruled out.

The reported errors could be related (this is my current understanding) to:

  1. improper data sampling because of the relaxed global memory policy, this could still be the case although unlikely because the errors keep happening after correcting data to be always in a gpu cacheline. (It might be we have to intruce a lock-per-line)
  2. Other issues on AMD often pinpointed to a) float data not being initialized or b) math errors like div by zero. (Nvidia never had problems on this) The code has been reviewed for strictly avoiding this.
  3. The for_each_channel usage was wrong in many parts of the code as we don't calc 4-channel pixels but we do a loop over a "color-range" of 0-2. This could have resulted to a div-by-zero.
  4. The dark-level was a code left-over, it's safe to use 0.2 of clip in all cases. (This is the only maths change in this pr. The dark clips lead to problems in very few images if there were a lot of black-to-white transitions)
  5. It is very good to avoid chroma corrections derived from a very small number of border pixels, we make sure now to have at least a "sensible" number.

Thanks to many people having tested #13646, could i ask you to check again @groutr @MStraeten @kofa73 @gi-man @sarunasb ?

To test this:

  1. Use the "duck-at-the-pond" image from HLR green artifacts with OpenCL #13632
  2. Start dt either with --disable-opencl or not
  3. use -d pipe and watch out for reports like [opposed chroma cache GPU/CPU]
  4. Compare the values reported, they must be "almost identical" for CPU vs GPU, any difference must be concidered to be a bug.
  5. If you made sure: rawprepare, temperature and clip in hlr are all set to default, and using opposed method your results should be
   13.2351 [opposed chroma cache GPU] red: 1.056767 (18406), green: -0.025674 (123257), blue: -0.394672 (16525) for opphash  11498776663581970802
    3.7372 [opposed chroma cache CPU] red: 1.056766 (18406), green: -0.025674 (123257), blue: -0.394673 (16525) for opphash  11498776663581970802

A sidenote, the algo has also changed slightly, the results might be slightly different from master but tested on this pr the reports must be identical for GPU and CPU

@groutr
Copy link

groutr commented Feb 16, 2023

For reference: The test "duck-at-the-pond" image can be downloaded from this post: https://discuss.pixls.us/t/using-darktable-master-with-intel-opencl-graphics-please-help/35371/5?u=ayro
(I'm the author of the image and it's being released under CC-BySA, ie the Play RAW license)

@groutr
Copy link

groutr commented Feb 16, 2023

On my machine (Intel HD520 on i5-6300U).

32.7156 [opposed chroma cache GPU] red: 1.266276 (18406), green: 0.675044 (123257), blue: 1.512822 (16525) for opphash   2192461363833705316

6.8555 [opposed chroma cache CPU] red: 0.207705 (18406), green: 0.021343 (123257), blue: 0.334533 (16525) for opphash   2192461363833705316

@sarunasb
Copy link
Contributor

sarunasb commented Feb 16, 2023

Intel Arc A750 and Ryzen 9 3900X. HLR result looks good.

this is darktable 3.3.0+10095~g5b846b9f1
... ... ...
   CANONICAL NAME:           intelrgraphics0x56a1
   PLATFORM NAME & VENDOR:   Intel(R) OpenCL HD Graphics, Intel(R) Corporation
   DRIVER VERSION:           1.0.0
   DEVICE VERSION:           OpenCL 3.0 NEO 
... ... ...
33.1990 [opposed chroma cache GPU] red: 1.056767 (18406), green: -0.025674 (123257), blue: -0.394672 (16525) for opphash   6947967410130095348
 4.5824 [opposed chroma cache CPU] red: 1.056767 (18406), green: -0.025674 (123257), blue: -0.394672 (16525) for opphash   6947967410130095348
 

@MStraeten
Copy link
Collaborator

MStraeten commented Feb 16, 2023

macos arm64 M1Max

--disable-opencl
25,8381 [opposed chroma cache CPU] red: 1,056767 (18406), green: -0,025674 (123257), blue: -0,394672 (16525) for opphash  11498776663580808806

opencl (ok after second build)
[opposed chroma cache GPU] red: 1,056767 (18406), green: -0,025674 (123257), blue: -0,394672 (16525) for opphash  11498776663580808806

@MStraeten
Copy link
Collaborator

MStraeten commented Feb 16, 2023

macos x86_64 AMD Radeon Pro 5500M

--disable-opencl
25,2197 [opposed chroma cache CPU] red: 1,056767 (18406), green: -0,025674 (123257), blue: -0,394672 (16525) for opphash  11498776663579427954
opencl
9,5132 [opposed chroma cache GPU] red: 0,310394 (764962), green: -0,003515 (2543940), blue: -0,084141 (1404466) for opphash  11498776663579427954
  DEVICE:                   2: 'AMD Radeon Pro 5500M Compute Engine'
   CANONICAL NAME:           amdradeonpro5500mcomputeengine
   PLATFORM NAME & VENDOR:   Apple, Apple
   DRIVER VERSION:           1.2 (Feb  8 2023 20:47:03)
   DEVICE VERSION:           OpenCL 1.2
   DEVICE_TYPE:              GPU
   GLOBAL MEM SIZE:          4080 MB
   MAX MEM ALLOC:            1020 MB
   MAX IMAGE SIZE:           16384 x 16384
   MAX WORK GROUP SIZE:      256
   MAX WORK ITEM DIMENSIONS: 3
   MAX WORK ITEM SIZES:      [ 256 256 256 ]
   ASYNC PIXELPIPE:          NO
   PINNED MEMORY TRANSFER:   NO
   MEMORY TUNING:            WANTED
   FORCED HEADROOM:          400
   AVOID ATOMICS:            NO
   MICRO NAP:                250
   ROUNDUP WIDTH:            16
   ROUNDUP HEIGHT:           16
   CHECK EVENT HANDLES:      1024
   PERFORMANCE:              3.251
   TILING ADVANTAGE:         0.000
   CL COMPILER OPTION:       -cl-fast-relaxed-math

@kofa73
Copy link
Contributor

kofa73 commented Feb 16, 2023

I'll be glad to run a test if it helps, but only on Windows, and only if someone can provide a build that I can install. Note that my original attempt to reproduce this failed, so is there a point in my trying again? (I'm not allowed to touch the OS of the laptop running Windows; my Linux machine has an NVidia card.)

@sarunasb
Copy link
Contributor

Linux 5.15.0-60-generic, Ryzen 5 5600X, GeForce RTX 2070.

this is darktable 3.3.0+10095~g5b846b9f1
...
   DEVICE:                   0: 'NVIDIA GeForce RTX 2070'
   DRIVER VERSION:           520.61.05
   DEVICE VERSION:           OpenCL 3.0 CUDA, SM_20 SUPPORT
 ...
20.9683 [opposed chroma cache GPU] red: 1.056767 (18406), green: -0.025674 (123257), blue: -0.394672 (16525) for opphash   6947967410130095348
 3.6977 [opposed chroma cache CPU] red: 1.056767 (18406), green: -0.025674 (123257), blue: -0.394671 (16525) for opphash   6947967410130095348

@groutr
Copy link

groutr commented Feb 16, 2023

On my machine (Intel HD520 on i5-6300U).

32.7156 [opposed chroma cache GPU] red: 1.266276 (18406), green: 0.675044 (123257), blue: 1.512822 (16525) for opphash   2192461363833705316

6.8555 [opposed chroma cache CPU] red: 0.207705 (18406), green: 0.021343 (123257), blue: 0.334533 (16525) for opphash   2192461363833705316

What is your compile process? Seeing different values for the CPU path than everyone else makes me think that you have another issue.

My compile process is executing ./build.sh that is in the darktable repo.

# darktable repo cloned to my ~/tmp.darktable directory and checking out jenshannoschwalm/fix_opposed_opencl2
$ mkdir build
$ ./build.sh
darktable build script

Building directory:  /home/grout/tmp/darktable/build
Installation prefix: /opt/darktable
Build type:          RelWithDebInfo
Build generator:     Unix Makefiles
Build tasks:         4

-- Found OpenMP_C: -fopenmp (found suitable version "4.5", minimum required is "4.5")
-- Found OpenMP_CXX: -fopenmp (found suitable version "4.5", minimum required is "4.5")
-- Found OpenMP: TRUE (found suitable version "4.5", minimum required is "4.5")
-- Building SSE2-optimized codepaths: ON
-- Performing Test C_COMPILER_UNDERSTANDS_-Wno-error=varargs
-- Performing Test C_COMPILER_UNDERSTANDS_-Wno-error=varargs - Success
-- Performing Test CXX_COMPILER_UNDERSTANDS_-Wno-error=varargs
-- Performing Test CXX_COMPILER_UNDERSTANDS_-Wno-error=varargs - Success
-- Performing Test C_COMPILER_UNDERSTANDS_-Wno-error=address-of-packed-member
-- Performing Test C_COMPILER_UNDERSTANDS_-Wno-error=address-of-packed-member - Success
-- Performing Test CXX_COMPILER_UNDERSTANDS_-Wno-error=address-of-packed-member
-- Performing Test CXX_COMPILER_UNDERSTANDS_-Wno-error=address-of-packed-member - Success
-- Looking for external programs
-- Found perl
-- Found intltool-merge
-- Found desktop-file-validate
-- Found LLVM 15.0.7
-- Found clang compiler - /bin/clang-15
-- Found clang opencl-c.h header in //lib/clang/15.0.7/include
-- Will be able to test-compile OpenCL programs. Nice.
-- Found jsonschema
-- Found xsltproc
-- Found xmllint
-- Found exiftool
-- All external programs found
-- Found msgfmt to convert .po file. Translation enabled
-- The following OPTIONAL packages have been found:

 * FFI
 * Terminfo
 * zstd
 * LibXml2
 * LLVM
 * Gettext

-- The following REQUIRED packages have been found:

 * OpenMP (required version >= 4.5)
 * Threads
 * ZLIB

-- Checking for -march=native support
-- Checking for -march=native support - works
-- Checking for -std=c++17 support
-- Checking for -std=c++17 support - works
-- Looking for OpenMP
-- Found OpenMP_C: -fopenmp (found suitable version "4.5", minimum required is "4.5")
-- Found OpenMP_CXX: -fopenmp (found suitable version "4.5", minimum required is "4.5")
-- Looking for OpenMP - found (system)
-- Looking for pugixml
-- Found Pugixml 1.13
-- Looking for pugixml - found (system)
-- Looking for JPEG
-- Looking for JPEG - found
-- Looking for ZLIB
-- Found ZLIB: /usr/lib/libz.so (found suitable version "1.2.13", minimum required is "1.2.11")
-- Looking for ZLIB - found (system)
-- Trying to query CPU L1d cache line size
-- Deciding that the CPU L1d cache line size is 64 bytes
-- Trying to query CPU page size
-- Deciding that the CPU page size is 4096 bytes
-- Trying to query CPU large page size
-- Deciding that the CPU large page size is 2097152 bytes
-- The following features have been enabled:

 * OpenMP-based threading, used for parallelization of the library
 * XML reading, used for loading of data/cameras.xml
 * Lossy JPEG decoding, used for DNG Lossy JPEG compression decoding
 * ZLIB decoding, used for DNG Deflate compression decoding

-- The following OPTIONAL packages have been found:

 * FFI
 * Terminfo
 * zstd
 * LibXml2
 * LLVM
 * Gettext
 * XMLLINT, command line XML tool, <http://xmlsoft.org/>
   Used for validation of data/cameras.xml

-- The following RECOMMENDED packages have been found:

 * JPEG, free library for handling the JPEG image data format, implements a JPEG codec
   Used for decoding DNG Lossy JPEG compression

-- The following REQUIRED packages have been found:

 * Threads
 * OpenMP (required version >= 4.5), Open Multi-Processing, <https://www.openmp.org/>
   Used for parallelization of the library
 * Pugixml (required version >= 1.8), Light-weight, simple and fast XML parser, <http://pugixml.org/>
   Used for loading of data/cameras.xml
 * ZLIB (required version >= 1.2.11), software library used for data compression
   Used for decoding DNG Deflate compression

-- Checking for -std=c++14 support
-- Checking for -std=c++14 support - works
-- Found Glib 2.74.5
-- Found LibXml2: /usr/lib/libxml2.so (found suitable version "2.10.3", minimum required is "2.6")
-- Found ZLIB: /usr/lib/libz.so (found version "1.2.13")
-- Could NOT find libheif (missing: libheif_DIR)
-- Building LibRaw from intree copy
-- Found Sqlite3 3.40.1
-- Sqlite3 version 3.24 or newer
-- Found GIO
-- Found LibXml2: /usr/lib/libxml2.so (found version "2.10.3")
-- Found CURL: /usr/lib/libcurl.so (found version "7.87.0")
-- Found JsonGlib
-- Exiv2 >= 0.27.4 found with ISOBMFF support (CR3, AVIF, HEIF)
-- Found OpenJPEG
-- Found GraphicsMagick
-- Found GMIC
-- Found the following ICU libraries:
--   i18n (required): /usr/lib/libicui18n.so
--   data (required): /usr/lib/libicudata.so
--   uc (required): /usr/lib/libicuuc.so
-- Does the compiler support __builtin_cpu_supports(): 1
-- Checking for -march=native support
-- Lua support: Enabled
-- Found Pugixml 1.13
-- Map mode: enabled
-- Print mode: enabled
-- Game: the good knight
-- Signal debug: print-trace possible
-- Found CURL: /usr/lib/libcurl.so (found suitable version "7.87.0", minimum required is "7.56")
-- Found recent CURL version to build piwigo.
-- building darktable-cmstest with colord support. nice.
-- LibRaw string version: 0.21.1
-- LibRaw ID version:     0x0211
-- LibRaw SO version:     23.0.0
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
--
-- ----------------------------------------------------------------------------------
--  Libraw 0.21.1 configuration            <http://www.libraw.org>
--
--  Libraw will be compiled with OpenMP support .................. YES
--  Libraw will be compiled with LCMS support .................... NO
--  Libraw will be compiled with example command-line programs ... NO
--  Libraw will be compiled with RedCine codec support ........... NO
--  Libraw will be compiled with DNG deflate codec support ....... YES
--  Libraw will be compiled with DNG lossy codec support ......... YES
--  Libraw will be compiled with RawSpeed support ................ NO
--  Libraw will be compiled with debug message from dcraw ........ NO
--  Libraw will be compiled with Foveon X3F support .............. NO
--  Libraw will be compiled with Raspberry Pi RAW support ........ NO
--  Libraw will be compiled as a static library
-- ----------------------------------------------------------------------------------
--
Missing po4a-translate. Can NOT create translated manpages
Missing po4a-updatepo. Can NOT update manpage translations
-- The following features have been enabled:

 * OpenMP-based threading, used for parallelization of the library
 * XML reading, used for loading of data/cameras.xml
 * Lossy JPEG decoding, used for DNG Lossy JPEG compression decoding
 * ZLIB decoding, used for DNG Deflate compression decoding

-- The following OPTIONAL packages have been found:

 * FFI
 * Terminfo
 * zstd
 * LLVM
 * Gettext
 * XMLLINT, command line XML tool, <http://xmlsoft.org/>
   Used for validation of data/cameras.xml
 * LibSoup2
 * Gphoto2 (required version >= 2.5)
 * OpenEXR (required version >= 3.0)
 * JXL (required version >= 0.7.0)
 * WebP (required version >= 0.3.0)
 * libavif (required version >= 0.8.2)
 * libheif (required version >= 1.13.0)
 * PortMidi, Portable MIDI library, <https://github.com/PortMidi/portmidi>
   Used for hardware MIDI input devices
 * OpenJPEG
 * IsoCodes (required version >= 3.66)
 * Libsecret
 * GraphicsMagick
 * GMIC
 * ICU
 * Lua54 (required version >= 5.4)
 * OSMGpsMap
 * Colord
 * ColordGTK
 * Cups
 * SDL2, low level access to audio, keyboard, mouse, joystick, and graphics hardware, <https://www.libsdl.org/>
 * X11

-- The following REQUIRED packages have been found:

 * GTK3 (required version >= 3.24.15)
 * Threads
 * Imath
 * LensFun
 * Sqlite3 (required version >= 3.15)
 * GIO
 * GThread
 * GModule
 * PangoCairo
 * Rsvg2
 * LibXml2
 * PNG
 * TIFF
 * LCMS2
 * JsonGlib
 * Exiv2 (required version >= 0.24)
 * Pugixml (required version >= 1.2), Light-weight, simple and fast XML parser, <http://pugixml.org/>
   Used for loading of data/cameras.xml
 * CURL (required version >= 7.56)
 * Glib
 * ZLIB, software library used for data compression
   Used for decoding DNG Deflate compression
 * JPEG, free library for handling the JPEG image data format, implements a JPEG codec
   Used for decoding DNG Lossy JPEG compression
 * OpenMP, Open Multi-Processing, <https://www.openmp.org/>
   Used for parallelization of the library

-- Configuring done
-- Generating done
-- Build files have been written to: /home/grout/tmp/darktable/build

@jenshannoschwalm
Copy link
Collaborator Author

The report from @MStraeten on AMD clearly showed a problem with the masking while calculating the chroma (the number of brackets should be identical or at least very close)

  1. Indeed there was a bug possibly leading to uninitialized data in the mask.
  2. Ref values are slightly different now as we make sure CPU & GPU codes work identical. Please note!
[opposed chroma cache CPU] red: 1.078559 (16302), green: -0.028281 (111939), blue: -0.394351 (14412) for opphash  11498776663581970802
[opposed chroma cache GPU] red: 1.078560 (16302), green: -0.028281 (111939), blue: -0.394350 (14412) for opphash  11498776663581970802

@groutr
Copy link

groutr commented Feb 17, 2023

With the latest commits:

39.4182 [opposed chroma cache GPU] red: 1.287707 (16302), green: 0.672347 (111939), blue: 1.519107 (14412) for opphash   2192461363833705316
7.0457 [opposed chroma cache CPU] red: 0.215559 (16302), green: 0.019315 (111939), blue: 0.337726 (14412) for opphash   2192461363833705316

Exported with OpenCL
20190822_003649_01

@TurboGit TurboGit added wip pull request in making, tests and feedback needed bugfix pull request fixing a bug labels Feb 17, 2023
@jenshannoschwalm
Copy link
Collaborator Author

The results for CPU path are irritating. Could you build and try with "Release" Version?

@jenshannoschwalm
Copy link
Collaborator Author

I oversaw this, you must have set some parameters in raw prepare, temperature or hlr clip, the hash is different

@MStraeten
Copy link
Collaborator

macos x86_64 AMD Radeon Pro 5500M

20,0987 [opposed chroma cache CPU] red: 1,078560 (16302), green: -0,028281 (111939), blue: -0,394350 (14412) for opphash  11498776663579427954
 9,9894 [opposed chroma cache GPU] red: 1,022740 (17233), green: -0,026178 (119963), blue: -0,317101 (17777) for opphash  11498776663579427954

@MStraeten
Copy link
Collaborator

macos arm64 M1Max

4,3778 [opposed chroma cache CPU] red: 1,078560 (16302), green: -0,028281 (111939), blue: -0,394350 (14412) for opphash  11498776663580808806
6,0246 [opposed chroma cache GPU] red: 1,078560 (16302), green: -0,028281 (111939), blue: -0,394350 (14412) for opphash  11498776663580808806

@jenshannoschwalm
Copy link
Collaborator Author

@MStraeten the latest commits should likely fix your AMD issue (fingers crossed)

@jenshannoschwalm
Copy link
Collaborator Author

@groutr could you check please again on intel too? (I have installed dt on my Intel notebook too, couldn't get OpenCL running yet ...)

@groutr
Copy link

groutr commented Feb 17, 2023

With the latest commits:

5.9997 [opposed chroma cache CPU] red: 0.215559 (16302), green: 0.019315 (111939), blue: 0.337726 (14412) for opphash   2192461363833705316
28.8494 [opposed chroma cache GPU] red: 1.287707 (16302), green: 0.672347 (111939), blue: 1.519107 (14412) for opphash   2192461363833705316

Exported with OpenCL.
20190822_003649_02

Not sure if this is related or not, but just noticed that when I'm in the darkroom module, darktable is using consistently about 60% CPU even when idle (and in the background).

@MStraeten
Copy link
Collaborator

macos x86_64 AMD Radeon Pro 5500M

9,8717 [opposed chroma cache CPU] red: 1,078560 (16302), green: -0,028281 (111939), blue: -0,394350 (14412) for opphash  11498776663579427954
22,3344 [opposed chroma cache GPU] red: 1,022740 (17233), green: -0,026178 (119963), blue: -0,317101 (17777) for opphash  11498776663579427954

@jenshannoschwalm
Copy link
Collaborator Author

Not sure if this is related or not, but just noticed that when I'm in the darkroom module, darktable is using consistently about 60% CPU even when idle (and in the background).

Can reproduce here. Goes away if the left panel is hidden. You should open a new issue

@jenshannoschwalm
Copy link
Collaborator Author

Friends @MStraeten and @groutr may i ask you again?
(Again found some issues with unitialized data for morphs, a xtrans issue so worth to try)

@jenshannoschwalm
Copy link
Collaborator Author

The latest commit modifies the debug output so we should be able to have identical results if working fine

 [opposed chroma cache GPU] red: 1.051852 (18560), green: -0.025731 (122349), blue: -0.394672 (16714) for hash  10526892061764007637
 [opposed chroma cache CPU] red: 1.051851 (18560), green: -0.025731 (122349), blue: -0.394673 (16714) for hash  10526892061764007637

See darktable-org#13632 and darktable-org#13646

In some situations the calculated chroma corrections are vastly wrong. The first idea was about precision errors
because of using floats cor summing up data and the counters, that could be ruled out.

The reported errors could be related (this is my current understanding) to:
1. improper data sampling because of the relaxed global memory policy, this could still be the case although
   unlikely because the errors keep happening after correcting data to be always in a gpu cacheline.
2. Errors on AMD often pinpointed to float data not initialized or div by zero. The code has been reviewed for
   strictly avoiding this.
3. The for_each_channel usage is wrong in many parts of the code as we don't calc 4-channel pixels but we do a
   loop over a "color-range" of 0-2
4. The dark-level was a code left-over, it's safe to use 0.2 of clip in all cases.
5. It is very good to avoid chroma corrections derived from a very small number of border pixels, we make sure
   now at least a "sensible" number.
The mask initializing before dilating to find out for unclipped locations
close to unclipped->clipped transitions could have not-initialized data in opencl code.

Fixed, also make sure the GPU and CPU code paths take exactly same locations.
The way we did this before worked fine on some architectuires, on some it failed completely
because of different implementations for "relaxed memory policy" on global memory.

As we only use one-update per line we can safely do this without bad performance drops now.

Otherwise we would require OpenCL V 2.0
- the area checked for initializing masks was wrong
- dilated buffer could have some data uninitialized
- subtle error for xtrans fixed
@jenshannoschwalm
Copy link
Collaborator Author

The good point: i have finally been able to reproduce the same issue on my intel notebook.
The bad point: Something is really wrong :-(

Will come back if i have reached substancial progress.

@jenshannoschwalm
Copy link
Collaborator Author

Closing this to proceed with a specific Intel & AMD OpenCl pr.

@jenshannoschwalm jenshannoschwalm deleted the fix_opposed_opencl2 branch March 30, 2023 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix pull request fixing a bug wip pull request in making, tests and feedback needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants