Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce balanced OpenCL vs CPU tiling #12545

Merged
merged 3 commits into from Sep 28, 2022

Conversation

jenshannoschwalm
Copy link
Collaborator

@jenshannoschwalm jenshannoschwalm commented Sep 24, 2022

Rationale: On many machines we find pretty fast multicore cpus and a lot of system ram but a significantly gpu memory with somewhat resrticted number crunching power. There is no problem using OpenCL code in modules as long as that doesn't lead to excessive tiling with large overlapping areas. Such a situation was rare, now we have as examples D&S or laplacian highlights which might behave bad and it would have been better to use cpu code instead. One related issue is #11955

Suggested "solutions" like per-module settings (whatever the complexity in the UI might be) are not good at all as runtime defined parameters are not taken into account.

proposed solution: follows two assumptions:

  1. we have a proper benchmarking result measuring raw processing powers cpu vs cl device. (The results we already have in opencl.c are by far not good enough as they don't reflect real-world performance.)
  2. the "overall processed memory" weighed by the tested performance in 1) is a good estimation of the resulting workload.

This pr introduces a decision in the opencl pixelpipe code

  1. If the module can't be processed without tiling
  2. and we have a user defined device specific advantage over cpu
  3. we estimate memory consumption cpu vs gpu
  4. and thus decide if cpu code has an advantage over gpu

The per cl device advantage is found as the last entry in the cldevice specific conf key, the conf key is extended automatically and the advantage is set to zero disabling the check for now.

How to set up the advantage hint?

Measure performance (-d perf) in laplacian highlights after setting "diameter of reconstruction" to something high. While doing so make sure (by -d tiling) that the cl code doesn't tile. Check for execution times with opencl on and off. The advantage option can now be set to approx (cpu-time / gpu-time)

The code presented a) adds some complexity to tiling code b) does not lead to performance drops as long as the new parameter is untouched or properly set c) leads to performance gains in some modules especially if your graphics card is not a "power horse" and you have a fast cpu and lots of ram.

Edit: third set of opencl related changes for 4.2

  1. i checked the existing benchmark code, pre-heating cpu & gpu gives somewhat better results but not of real significance. Enlarging size of data in the benchmark is not always possible for small cards.

`float dt_tiling_estimate_cpumem()` and
`dt_tiling_estimate_clmem()`

can be used in pipeline processing to estimate the use of memory for a
specific module at runtime before deciding to actually do the cl tiling.

Part of a pr introducing cpu vs. gpu balanced tiling

Also a large number of improved debugging messages related to tiling, all
dt_print messages also tell the name of the pixelpipe to make information more
usefull.
The cl device struct gets another parameter `advantage`
It's used to estimate a possible cpu advantage over tiled cl code, it
- is used together with estimated mem requirements in tiling.c
- is unused by default but can be set specifically for a device after proper benchmarking

As the default is 0.0 we don't require a version bumb but can instead just extend the device conf option.
If the cl pipeline code finds a opencl-tiling-requested situation we check for a possible
cpu advantage.
The check assumes that overall processed memory is a good enough indicator and needs
user assistance by proper benchmarking the cpu performance.
@TurboGit TurboGit added this to the 4.2 milestone Sep 24, 2022
@TurboGit TurboGit added priority: high core features are broken and not usable at all, software crashes feature: redesign current features to rewrite difficulty: hard big changes across different parts of the code base scope: performance doing everything the same but faster scope: codebase making darktable source code easier to manage release notes: pending labels Sep 24, 2022
@TurboGit
Copy link
Member

Do you think getting the number of tiles in the equation would help? I mean if the GPU card has not a lot of memory a lot of tiles will be used and in this case there is a huge drop of performances as each tile data need to be transferred to and back the GPU. In the opposite if we have lot of main memory, less tiling will be used and also avoid every GPU to/from CPU transfer. That's why I'm thinking maybe using the number of tiles could maybe help taking the "right" decision.

What do you think?

@jenshannoschwalm
Copy link
Collaborator Author

Do you think getting the number of tiles in the equation would help?

I think this is already taken into account as the "overall memory" is in fact roughly number-of-tiles * memory-for-a-tile-including-overlap"

See the new functions in tiling.c They look pretty complicated, i took the code for roi sizes in&out are different. The result is not correct as the runtime tiling code might leave out outermost if not required. But - good enough.

@gi-man
Copy link
Contributor

gi-man commented Sep 26, 2022

4. and thus decide if cpu code has an advantage over gpu

I tried to read the code and could not determine how this decision is being made. Is it mainly just the tiles?

I've been thinking about this PR since you posted it. I see 4 different potential cases:

  1. Low CPU performance / Low CPU memory - low GPU perf/memory = In this scenario, the user should not use D&S/GL with high iterations because too many tiles and it could take a very long time.
  2. High CPU perf / enough memory to avoid significant tiles - low GPU perf/memory = in this scenario, the use CPU over GPU should work since it reduces the tiles, but high iterations will still take time.
  3. High CPU perf / enough memory to avoid significant tiles - high GPU perf(nvidia +20series) but low GPU memory (2gb) = in this scenario, I think GPU path should be better because the GPU is still faster at processing since it has more processing units, even with some tiles (of course if tiles go extremely high, CPU is better).
  4. High CPU perf / enough memory to avoid significant tiles - high GPU perf(nvidia +20series) and high GPU memory (+6gb) = in this scenario, the GPU path is always better. The GPU faster at processing since it has more processing units and the GPU memory should avoid tiles.

I think biasing towards using the GPU should be the default, based on the performance benchmark but not so much on tiles. Why? The issue #11955 shows it. I have a case 4 system with a Ryzen 7 5600 CPU 16GB memory and a 3060 nvidia with 12g of memory. The CPU path of just one module was 45 to 50sec vs 1s in GPU. This is a significant difference. I think the difference is in the high iterations since I dont get tiles reported from -d tiling. Because of this, Ive set darktable to use Very Fast GPU, so it avoids the CPU path, even when it could be an advantage to use the CPU in some modules or processing (preview, thumbnail vs full).

@jenshannoschwalm
Copy link
Collaborator Author

I tried to read the code and could not determine how this decision is being made. Is it mainly just the tiles?

Not only. On your system (on my too) there is no chance for tiling in the cl path. If you want to check you could either set a fixed headroom to 11GB leaving 1GB to be used or maybe the "notebook" option or ...

Let me explain in other words - an example system with low available CL mem and lots of system ram and usage of laplacian highlights only looking at the CL code path.

The decision if we have to tile depends on a) the required overlapping and b) number & size of internal required buffers for the algo.

If tiling is required as all data don't fit in graphics memory we might end up in a situation where a single internal buffer of the algo (that's defined by the tiling factor and size of roi) can hold only 9mega of locations. The processed data might require an overlap of 1000 and we have square tiles. In this case the tiles would be 3000x3000 but the effective data processed would only be 1/9 (the centre 1000x1000 of the sqare) So the GPU would take ~9 times as long as it would if no tiling would be necessary.

This ratio - how much data do we have to process compared to untiled - is what we are interested in. Thus we don't look for the number of tiles but for "amount of data".

In the above example it would mean: If your GPU is at least 9 times faster than CPU there is no reason to do a fallback to cpu. If it would be just 3 times faster, certainly yes.

The tricky part of the pr is in calculating the ratio of required processing both for CPU and GPU code paths. (the new functions in tiling.c)

On your system - you will never endup in a cpu advantage situation. I have 64GB system ram and 8GB on my graphics card which is ~9 times faster than CPU. So only very rare here too.
But if i restrict to 2GB of opencl ram or 4, yes i certainly see a few cpu fallbacks likely in export pipe.

@gi-man
Copy link
Contributor

gi-man commented Sep 26, 2022

Thanks for the information. I did a test to compare the different options. I used a 4640 x 3472 image. It had 2 D&S and GL set to 20 iterations at 2048px diameter. I exported it in 4 different scenarios: Very Fast GPU, CPU only (turing off OpenCL), Headroom to 11000 (so 1gb of VRam) and then notebook. On Fedora 36, current master, nvidia 515.65

Here are the results:
https://pastebin.com/5cKT1K9n - GPU - total 7.7s
https://pastebin.com/wA7dLZ8i - CPU - total 61s
https://pastebin.com/Sp07WN92 - headroom to 11000 - total 25.1s
https://pastebin.com/zMFBg3SB - notebook - total 27.1s

Some observations:

  • the difference between GPU and CPU are significant and there is no tiling (total of 7.7s vs 61s). HL and one of the DS had the longest time.
  • Even with tiling, GPU notebook was faster than CPU only. For example the HL, 5s vs 9s vs 41s. Filmic RBG was 0.015s vs 0.118s vs 0.277s

It would be nice if someone can do a similar analysis but with a 10xx series card vs 30xx. I think the number of processing units in the card could yield different results. Larger image sizes could also yield different results.

@jenshannoschwalm
Copy link
Collaborator Author

I think i didn't make the idea clear enough. I think if you benchmark you should look for the module in question.

Let me try to explain in other words.

  1. The rest of arguments is valid for every module
  2. If we measure times for cpu and gpu code paths on he same data, we have the same workload for processing. So - maybe your gpu is 10 times faster for he same workload.
  3. If you now increase the workload for the gpu by a factor of 10 it's fair to assume both paths - gpu and gpu - will take the same time. If your gpu was maybe just 3 times faster we would need a 3-fold workload for same timing.
  4. So we need to know
    a) the workload ratio (gpu vs cpu) for a module depending on memory requirements from tiling factor, overhead and overlap. In a situation of tiling all available memory will be used per tile so locality and cpu/gpu cache rates will be approximately the same. This means: we can have a good estimate of the workload by calculation the amount of ram used - this is very roughly: number of tiles * available mem.
    b) a good = significant cpu vs gpu performance
  5. For every module in the pipeline we calculate the workload ratio at runtime and if gpu-workload / gpu-advantage > cpu-workload we know the cpu will be faster.

As you said right in your first comment your gpu is 50times faster than cpu, that would be an advantage of 50. (I don't know of any module in dt that would like a fallback to cpu) Many people have intel graphics or older cards. They might only have an advantage of 3, they would certainly have better results with heavy tiling.

Copy link
Member

@TurboGit TurboGit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me, let's merge this now to get more field testing as this depends really on user"s environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty: hard big changes across different parts of the code base feature: redesign current features to rewrite priority: high core features are broken and not usable at all, software crashes scope: codebase making darktable source code easier to manage scope: performance doing everything the same but faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants