/ distributed Public
Increase latency overhead in stealing cost calculation #5390
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge.
In my latest dive into stealing code I investigated some of the logs and saw a lot of ridiculous steal requests. Task durations of ~10ms and occupancy differences between thief and victim of ~100ms.
Not only do we not care for such a difference but the act of stealing is guaranteed to be more expensive than letting things be.
Stealing requires at least three network bounces (steal-request, steal-confirm, compute-task) which includes code serialization if successful. It almost impossible to do this in the currently hard coded 1ms. The 100ms I propose are likely too conservative but I don't think this is necessarily a bad thing for stealing. I don't have time for large scale tests but am very confident that this should by much higher than it is right now. Thoughts, concerns?
cc @gjoseph92 @crusaderky