[REQ] Distribute work more evenly #96

lczech · 2023-01-27T06:16:53Z

Describe the new feature

Currently, the documentation and the code tell me that the last block of a parallel loop "may be a bit longer". In practice, this can be quite punishing and is less then optimal.

Simple example: A loop with 100 iterations, across 15 threads, will result in blocks of size 6, except for the last one, which will have 16 elements, more than double that of the other blocks. In addition, it being submitted last, its likely to also start last, hence adding to the wait time.

This seems to be the result of the computation rounding down (not explicitly in the code, but it's the result of how the block sizes are computed). I'd suggest to instead round up, so that the above example results in blocks of size 7, with the last one of size 2. That would be a more even distribution of work.

bshoshany · 2023-02-09T22:42:25Z

Hi @lczech, thanks for opening this issue! You raise a good point. But I don't think rounding up works every time. Consider for example 100 iterations parallelized across 14 threads. 100/14=7.14, so with rounding down, this will result in 13 threads of 7 iterations plus one thread of 9 iterations. But if I round up, I will have 13 threads of 8 iterations, which is 104, so thread 13 will actually only have 4 iterations and thread 14 will have no iterations at all, and that is definitely not optimal.

I think the most optimal way to do this would be to not round down or up universally, but rather round to the nearest integer. That way 100/15=6.66 would round up to 7 and 100/14=7.14 would round down to 7, and both cases seem to be the most optimal.

Closing this issue since I will implement this in the next release. (But first I'll need to convince myself mathematically that this actually works no matter the number of iterations and threads; if you want to formulate a rigorous proof, that would be extremely helpful!)

Thanks for your contribution!

lczech · 2023-03-08T09:10:53Z

A better solution would be in fact to round down and distribute the remaining elements individually to the threads. Ideally towards the first ones, so that the blocks that are one larger than the others are being run first, minimizing overall wait time.

Example: 22 elements, 4 threads. Threads 1 and 2 get 6 elements each, threads 3 and 4 get 5. That seems optimal. I've implemented that in my threading code now.

lczech added the enhancement New feature or request label Jan 27, 2023

lczech assigned bshoshany Jan 27, 2023

bshoshany closed this as completed Feb 9, 2023

This was referenced Dec 1, 2023

[REQ] Balanced workload for push_loop #130

Closed

push_loop balances iterations more fairly #130 #131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQ] Distribute work more evenly #96

[REQ] Distribute work more evenly #96

lczech commented Jan 27, 2023

bshoshany commented Feb 9, 2023

lczech commented Mar 8, 2023 •

edited

[REQ] Distribute work more evenly #96

[REQ] Distribute work more evenly #96

Comments

lczech commented Jan 27, 2023

bshoshany commented Feb 9, 2023

lczech commented Mar 8, 2023 • edited

lczech commented Mar 8, 2023 •

edited