Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve precision of HLR chroma corrections #13646

Conversation

jenshannoschwalm
Copy link
Collaborator

There has been some difference between OpenCL / CPU code for the calculated chroma correction coeffs due to different floating point precision. In this pr we calculate the coeffs via double leading to almost identical results. Downsides:

  1. requires double support for the device (all drivers should support that)
  2. some performance penalty. That is in fact very small especially in darkroom as we have cached data now.

Not sure if this fixes problems for certain Intel Neo drivers though.

See #13632

@groutr could you test this on your intel device?

@MStraeten could you please test on AMD and OSX?

@jenshannoschwalm
Copy link
Collaborator Author

Just a short reminder, this requires clearing the OpenCL kernel cache :-)

@groutr
Copy link

groutr commented Feb 15, 2023

@jenshannoschwalm it is much closer this time:

# With OpenCL
61.5096 [opposed chroma cache] 0.195899 -0.195747 0.430819 for opphash  16254282709875881148

# With --disable-opencl
6.5186 [opposed chroma cache] 0.185471 0.017018 0.319226 for opphash  16254282709875881148

The purple is back though.
20190822_003649

@jenshannoschwalm
Copy link
Collaborator Author

Have you tried with the tuning options set to off?

@jenshannoschwalm
Copy link
Collaborator Author

The numbers should be identical.

@groutr
Copy link

groutr commented Feb 15, 2023

Yes, "tune OpenCL performance" was set to "nothing" (and the profile was set to "default"). FWIW, segmentation doesn't seem to have any artifacts on my system.

@jenshannoschwalm
Copy link
Collaborator Author

You might try to log the cnt values,that would hint to some problem with the morph operation. It can't be just precision errors.

@TurboGit
Copy link
Member

  1. requires double support for the device (all drivers should support that)

I'm not sure it is true for M1 chipset. I heard that only float was supported natively and using double might be very slow as emulated.

@jenshannoschwalm
Copy link
Collaborator Author

The numbers should be identical. Not sure what goes wrong here. Will prepare some better log info commit tonight. Also would love to get the issue confirmed from other Intel users.

@groutr
Copy link

groutr commented Feb 15, 2023

My naive attempt to get cnt values:

# with opencl
105.7060 [opposed chroma cnt] 20377 136708 17082 0

# With --disable-opencl
6.9313 [opposed chroma cnt] 20377 136708 17082 0

@jenshannoschwalm
Copy link
Collaborator Author

So that works correctly, could you tests again for the vals?

@jenshannoschwalm
Copy link
Collaborator Author

Another idea would be cacheline problems,Nvidia and amd cards seem to use 16bytes, intel likely 64.

@groutr
Copy link

groutr commented Feb 15, 2023

with opencl
    49.1254 [opposed chroma cache] 0.195899 -0.195747 0.430819 for opphash  16254282709875881148
    49.1254 [opposed chroma cnt] 20377 136708 17082 0
    49.1254 [opposed chroma sums] 3991.836510 -26760.136259 7359.251778 0.000000


disable-opencl
     7.1269 [opposed chroma cnt] 20377 136708 17082 0
     7.1270 [opposed chroma sums] 3779.345665 2326.475698 5453.015793 0.000000
     7.1270 [opposed chroma cache] 0.185471 0.017018 0.319226 for opphash  16254282709875881148

To improve the precision of chroma corrections we use double instead of plain float.
The difference is subtle but it seems to be important to have this to keep OpenCL vs CPU
results as small as possible.

Overall performance penalty is below 1% (this depends slightly on image content)
The calculation of the chroma corrections data should use doubles too for improved precision
to keep differences with CPU code as small as possible.
We use a global buffer to keep track of chroma correction data for every line.

Because of the OpenCL relaxed global mem policy we want to make sure to write data without
cache read/write problems. Faster than using locks.
@jenshannoschwalm
Copy link
Collaborator Author

@groutr could you check with latest commits? After reading some docs about Intel CL this might be indeed a cacheline problem.

@groutr
Copy link

groutr commented Feb 15, 2023

@jenshannoschwalm This is probably going to be frustrating.

With your latest commits

5.8358 [opposed chroma cache GPU] 0.188866 -0.193616 0.430819 for opphash   2192461363833705316
6.1207 [opposed chroma cache CPU] 0.185471 0.017018 0.319226 for opphash   2192461363833705316

I was wondering what the sums represent and the large differences seen in the second and third positions.
Could there be an issue with _calc_refavg?

@jenshannoschwalm
Copy link
Collaborator Author

jenshannoschwalm commented Feb 15, 2023

Nope for refavg.

Frustrating? Nope again. I would really like to get hand on an Intel machine. These Intel problems have a long history btw.

@MStraeten
Copy link
Collaborator

kernels compiles fine with macos M1 arm64

I'm not what exactly to check, but appying the history stack taken from jpg in the package seems weird:
image

without this pr:
image

@jenshannoschwalm
Copy link
Collaborator Author

Could you check the reported coefficients in the log as described above. Should be identical for cpu and OpenCL

@jenshannoschwalm
Copy link
Collaborator Author

Also try on master first. The calculated data reported are so bad, you should see wrong colors immediately.

@MStraeten
Copy link
Collaborator

MStraeten commented Feb 15, 2023

arm64 mac M1 max
master:

with opencl
25,5326 [opposed chroma cache] 0,718308 -0,013508 0,269686 for opphash  17978588811350894852
--disable-opencl
7,6163 [opposed chroma cache] 0,718309 -0,013508 0,269686 for opphash  17978588811350894852

#13646

with opencl
15,1807 [opposed chroma cache GPU] -119299,070312 281610336899956736,000000 0,000000 for opphash  17978588811350894852
--disable-opencl
6,0172 [opposed chroma cache CPU] 0,718307 -0,013508 0,269686 for opphash  17978588811350894852

@MStraeten
Copy link
Collaborator

intel mac with AMD Radeon Pro 5500M
master

opencl
102,8777 [opposed chroma cache] 0,004149 -0,007462 0,103744 for opphash  17978588811351607058
--disable-opencl
9,9507 [opposed chroma cache] 0,718307 -0,013508 0,269687 for opphash  17978588811351607058

#13646

opencl
22,2746 [opposed chroma cache GPU] -42,368896 9634933949682400051547078656,000000 0,000000 for opphash  17978588811351607058
--disable-opencl
11,0546 [opposed chroma cache CPU] 0,718307 -0,013508 0,269686 for opphash  17978588811351607058

@jenshannoschwalm
Copy link
Collaborator Author

So there is something wrong otherwise as martins report on amd shows.

@jenshannoschwalm
Copy link
Collaborator Author

Will close this for now as it's not a precision issue. stay tuned ...

jenshannoschwalm added a commit to jenshannoschwalm/darktable that referenced this pull request Feb 16, 2023
See darktable-org#13632 and darktable-org#13646

In some situations the calculated chroma corrections are vastly wrong. The first idea was about precision errors
because of using floats cor summing up data and the counters, that could be ruled out.

The reported errors could be related (this is my current understanding) to:
1. improper data sampling because of the relaxed global memory policy, this could still be the case although
   unlikely because the errors keep happening after correcting data to be always in a gpu cacheline.
2. Errors on AMD often pinpointed to float data not initialized or div by zero. The code has been reviewed for
   strictly avoiding this.
3. The for_each_channel usage is wrong in many parts of the code as we don't calc 4-channel pixels but we do a
   loop over a "color-range" of 0-2
4. The dark-level was a code left-over, it's safe to use 0.2 of clip in all cases.
5. It is very good to avoid chroma corrections derived from a very small number of border pixels, we make sure
   now at least a "sensible" number.
jenshannoschwalm added a commit to jenshannoschwalm/darktable that referenced this pull request Feb 17, 2023
See darktable-org#13632 and darktable-org#13646

In some situations the calculated chroma corrections are vastly wrong. The first idea was about precision errors
because of using floats cor summing up data and the counters, that could be ruled out.

The reported errors could be related (this is my current understanding) to:
1. improper data sampling because of the relaxed global memory policy, this could still be the case although
   unlikely because the errors keep happening after correcting data to be always in a gpu cacheline.
2. Errors on AMD often pinpointed to float data not initialized or div by zero. The code has been reviewed for
   strictly avoiding this.
3. The for_each_channel usage is wrong in many parts of the code as we don't calc 4-channel pixels but we do a
   loop over a "color-range" of 0-2
4. The dark-level was a code left-over, it's safe to use 0.2 of clip in all cases.
5. It is very good to avoid chroma corrections derived from a very small number of border pixels, we make sure
   now at least a "sensible" number.
jenshannoschwalm added a commit to jenshannoschwalm/darktable that referenced this pull request Feb 17, 2023
See darktable-org#13632 and darktable-org#13646

In some situations the calculated chroma corrections are vastly wrong. The first idea was about precision errors
because of using floats cor summing up data and the counters, that could be ruled out.

The reported errors could be related (this is my current understanding) to:
1. improper data sampling because of the relaxed global memory policy, this could still be the case although
   unlikely because the errors keep happening after correcting data to be always in a gpu cacheline.
2. Errors on AMD often pinpointed to float data not initialized or div by zero. The code has been reviewed for
   strictly avoiding this.
3. The for_each_channel usage is wrong in many parts of the code as we don't calc 4-channel pixels but we do a
   loop over a "color-range" of 0-2
4. The dark-level was a code left-over, it's safe to use 0.2 of clip in all cases.
5. It is very good to avoid chroma corrections derived from a very small number of border pixels, we make sure
   now at least a "sensible" number.
jenshannoschwalm added a commit to jenshannoschwalm/darktable that referenced this pull request Feb 18, 2023
See darktable-org#13632 and darktable-org#13646

In some situations the calculated chroma corrections are vastly wrong. The first idea was about precision errors
because of using floats cor summing up data and the counters, that could be ruled out.

The reported errors could be related (this is my current understanding) to:
1. improper data sampling because of the relaxed global memory policy, this could still be the case although
   unlikely because the errors keep happening after correcting data to be always in a gpu cacheline.
2. Errors on AMD often pinpointed to float data not initialized or div by zero. The code has been reviewed for
   strictly avoiding this.
3. The for_each_channel usage is wrong in many parts of the code as we don't calc 4-channel pixels but we do a
   loop over a "color-range" of 0-2
4. The dark-level was a code left-over, it's safe to use 0.2 of clip in all cases.
5. It is very good to avoid chroma corrections derived from a very small number of border pixels, we make sure
   now at least a "sensible" number.
jenshannoschwalm added a commit to jenshannoschwalm/darktable that referenced this pull request Feb 18, 2023
See darktable-org#13632 and darktable-org#13646

In some situations the calculated chroma corrections are vastly wrong. The first idea was about precision errors
because of using floats cor summing up data and the counters, that could be ruled out.

The reported errors could be related (this is my current understanding) to:
1. improper data sampling because of the relaxed global memory policy, this could still be the case although
   unlikely because the errors keep happening after correcting data to be always in a gpu cacheline.
2. Errors on AMD often pinpointed to float data not initialized or div by zero. The code has been reviewed for
   strictly avoiding this.
3. The for_each_channel usage is wrong in many parts of the code as we don't calc 4-channel pixels but we do a
   loop over a "color-range" of 0-2
4. The dark-level was a code left-over, it's safe to use 0.2 of clip in all cases.
5. It is very good to avoid chroma corrections derived from a very small number of border pixels, we make sure
   now at least a "sensible" number.
@jenshannoschwalm jenshannoschwalm deleted the opposed_precision branch March 30, 2023 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants