Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQ] Distribute work more evenly #96

Closed
lczech opened this issue Jan 27, 2023 · 2 comments
Closed

[REQ] Distribute work more evenly #96

lczech opened this issue Jan 27, 2023 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@lczech
Copy link

lczech commented Jan 27, 2023

Describe the new feature

Currently, the documentation and the code tell me that the last block of a parallel loop "may be a bit longer". In practice, this can be quite punishing and is less then optimal.

Simple example: A loop with 100 iterations, across 15 threads, will result in blocks of size 6, except for the last one, which will have 16 elements, more than double that of the other blocks. In addition, it being submitted last, its likely to also start last, hence adding to the wait time.

This seems to be the result of the computation rounding down (not explicitly in the code, but it's the result of how the block sizes are computed). I'd suggest to instead round up, so that the above example results in blocks of size 7, with the last one of size 2. That would be a more even distribution of work.

@lczech lczech added the enhancement New feature or request label Jan 27, 2023
@bshoshany
Copy link
Owner

Hi @lczech, thanks for opening this issue! You raise a good point. But I don't think rounding up works every time. Consider for example 100 iterations parallelized across 14 threads. 100/14=7.14, so with rounding down, this will result in 13 threads of 7 iterations plus one thread of 9 iterations. But if I round up, I will have 13 threads of 8 iterations, which is 104, so thread 13 will actually only have 4 iterations and thread 14 will have no iterations at all, and that is definitely not optimal.

I think the most optimal way to do this would be to not round down or up universally, but rather round to the nearest integer. That way 100/15=6.66 would round up to 7 and 100/14=7.14 would round down to 7, and both cases seem to be the most optimal.

Closing this issue since I will implement this in the next release. (But first I'll need to convince myself mathematically that this actually works no matter the number of iterations and threads; if you want to formulate a rigorous proof, that would be extremely helpful!)

Thanks for your contribution!

@lczech
Copy link
Author

lczech commented Mar 8, 2023

A better solution would be in fact to round down and distribute the remaining elements individually to the threads. Ideally towards the first ones, so that the blocks that are one larger than the others are being run first, minimizing overall wait time.

Example: 22 elements, 4 threads. Threads 1 and 2 get 6 elements each, threads 3 and 4 get 5. That seems optimal. I've implemented that in my threading code now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants