-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve precision of HLR chroma corrections #13646
Improve precision of HLR chroma corrections #13646
Conversation
|
Just a short reminder, this requires clearing the OpenCL kernel cache :-) |
|
@jenshannoschwalm it is much closer this time: |
|
Have you tried with the tuning options set to off? |
|
The numbers should be identical. |
|
Yes, "tune OpenCL performance" was set to "nothing" (and the profile was set to "default"). FWIW, segmentation doesn't seem to have any artifacts on my system. |
|
You might try to log the cnt values,that would hint to some problem with the morph operation. It can't be just precision errors. |
I'm not sure it is true for M1 chipset. I heard that only float was supported natively and using double might be very slow as emulated. |
|
The numbers should be identical. Not sure what goes wrong here. Will prepare some better log info commit tonight. Also would love to get the issue confirmed from other Intel users. |
|
My naive attempt to get cnt values: |
|
So that works correctly, could you tests again for the vals? |
|
Another idea would be cacheline problems,Nvidia and amd cards seem to use 16bytes, intel likely 64. |
|
To improve the precision of chroma corrections we use double instead of plain float. The difference is subtle but it seems to be important to have this to keep OpenCL vs CPU results as small as possible. Overall performance penalty is below 1% (this depends slightly on image content)
The calculation of the chroma corrections data should use doubles too for improved precision to keep differences with CPU code as small as possible.
We use a global buffer to keep track of chroma correction data for every line. Because of the OpenCL relaxed global mem policy we want to make sure to write data without cache read/write problems. Faster than using locks.
96f8357 to
3d89b6c
Compare
|
@groutr could you check with latest commits? After reading some docs about Intel CL this might be indeed a cacheline problem. |
|
@jenshannoschwalm This is probably going to be frustrating. With your latest commits I was wondering what the sums represent and the large differences seen in the second and third positions. |
|
Nope for refavg. Frustrating? Nope again. I would really like to get hand on an Intel machine. These Intel problems have a long history btw. |
|
Could you check the reported coefficients in the log as described above. Should be identical for cpu and OpenCL |
|
Also try on master first. The calculated data reported are so bad, you should see wrong colors immediately. |
|
arm64 mac M1 max |
|
intel mac with AMD Radeon Pro 5500M |
|
So there is something wrong otherwise as martins report on amd shows. |
|
Will close this for now as it's not a precision issue. stay tuned ... |
See darktable-org#13632 and darktable-org#13646 In some situations the calculated chroma corrections are vastly wrong. The first idea was about precision errors because of using floats cor summing up data and the counters, that could be ruled out. The reported errors could be related (this is my current understanding) to: 1. improper data sampling because of the relaxed global memory policy, this could still be the case although unlikely because the errors keep happening after correcting data to be always in a gpu cacheline. 2. Errors on AMD often pinpointed to float data not initialized or div by zero. The code has been reviewed for strictly avoiding this. 3. The for_each_channel usage is wrong in many parts of the code as we don't calc 4-channel pixels but we do a loop over a "color-range" of 0-2 4. The dark-level was a code left-over, it's safe to use 0.2 of clip in all cases. 5. It is very good to avoid chroma corrections derived from a very small number of border pixels, we make sure now at least a "sensible" number.
See darktable-org#13632 and darktable-org#13646 In some situations the calculated chroma corrections are vastly wrong. The first idea was about precision errors because of using floats cor summing up data and the counters, that could be ruled out. The reported errors could be related (this is my current understanding) to: 1. improper data sampling because of the relaxed global memory policy, this could still be the case although unlikely because the errors keep happening after correcting data to be always in a gpu cacheline. 2. Errors on AMD often pinpointed to float data not initialized or div by zero. The code has been reviewed for strictly avoiding this. 3. The for_each_channel usage is wrong in many parts of the code as we don't calc 4-channel pixels but we do a loop over a "color-range" of 0-2 4. The dark-level was a code left-over, it's safe to use 0.2 of clip in all cases. 5. It is very good to avoid chroma corrections derived from a very small number of border pixels, we make sure now at least a "sensible" number.
See darktable-org#13632 and darktable-org#13646 In some situations the calculated chroma corrections are vastly wrong. The first idea was about precision errors because of using floats cor summing up data and the counters, that could be ruled out. The reported errors could be related (this is my current understanding) to: 1. improper data sampling because of the relaxed global memory policy, this could still be the case although unlikely because the errors keep happening after correcting data to be always in a gpu cacheline. 2. Errors on AMD often pinpointed to float data not initialized or div by zero. The code has been reviewed for strictly avoiding this. 3. The for_each_channel usage is wrong in many parts of the code as we don't calc 4-channel pixels but we do a loop over a "color-range" of 0-2 4. The dark-level was a code left-over, it's safe to use 0.2 of clip in all cases. 5. It is very good to avoid chroma corrections derived from a very small number of border pixels, we make sure now at least a "sensible" number.
See darktable-org#13632 and darktable-org#13646 In some situations the calculated chroma corrections are vastly wrong. The first idea was about precision errors because of using floats cor summing up data and the counters, that could be ruled out. The reported errors could be related (this is my current understanding) to: 1. improper data sampling because of the relaxed global memory policy, this could still be the case although unlikely because the errors keep happening after correcting data to be always in a gpu cacheline. 2. Errors on AMD often pinpointed to float data not initialized or div by zero. The code has been reviewed for strictly avoiding this. 3. The for_each_channel usage is wrong in many parts of the code as we don't calc 4-channel pixels but we do a loop over a "color-range" of 0-2 4. The dark-level was a code left-over, it's safe to use 0.2 of clip in all cases. 5. It is very good to avoid chroma corrections derived from a very small number of border pixels, we make sure now at least a "sensible" number.
See darktable-org#13632 and darktable-org#13646 In some situations the calculated chroma corrections are vastly wrong. The first idea was about precision errors because of using floats cor summing up data and the counters, that could be ruled out. The reported errors could be related (this is my current understanding) to: 1. improper data sampling because of the relaxed global memory policy, this could still be the case although unlikely because the errors keep happening after correcting data to be always in a gpu cacheline. 2. Errors on AMD often pinpointed to float data not initialized or div by zero. The code has been reviewed for strictly avoiding this. 3. The for_each_channel usage is wrong in many parts of the code as we don't calc 4-channel pixels but we do a loop over a "color-range" of 0-2 4. The dark-level was a code left-over, it's safe to use 0.2 of clip in all cases. 5. It is very good to avoid chroma corrections derived from a very small number of border pixels, we make sure now at least a "sensible" number.



There has been some difference between OpenCL / CPU code for the calculated chroma correction coeffs due to different floating point precision. In this pr we calculate the coeffs via
doubleleading to almost identical results. Downsides:doublesupport for the device (all drivers should support that)Not sure if this fixes problems for certain Intel Neo drivers though.
See #13632
@groutr could you test this on your intel device?
@MStraeten could you please test on AMD and OSX?