-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only a single build request is considered for scheduling if nextWorker returns None #6251
Comments
I agree that the code looks weird, with a loop that is only meant to be run once. I can see that even if we would continue, _getNextUnclaimedBuildRequest() will any return the same buildrequest so we would retry forever the same buildrequest, unless nextBuild is smarter that it needs and account that the build was just chosen and couldn't be assigned a worker. I think the whole API design is flawed in the failure modes indeed. What would make more sense at first sight is a prioritizeBuild hook, that would sort the buildrequests in whatever order desired by the project, and then the while loop would consider those builds one by one. if not prioritizeBuild is set, then nextBuild would be called on the remaining buildrequests. would you like to try youself at building such change? Pierre |
I'm pretty new to buildbot internals so I might need a while to figure out how all these parts fit together, but I'm willing to work on it. For my use case, I think I would want to write a I think the existing The existing Unless you feel strongly that the API should be replaced, I'd be willing to try fixing the use of |
One potential issue with my imagined fix: I cannot maintain the constraint described in the docstring of only calling |
I've opened a draft PR with a patch that fixed the issue on my buildbot instance without changing the API, please take a look and let me know if you like this approach! |
We don't care about docstrings, the public interface is the documentation. I don't see anything related to the number of times the function is called, so from that regard the constraint is not important.
Agreed. |
We're utilizing a custom nextWorker function in order to assign each build to a unique hardware resource that tests are run against. We have several pools of these with different configurations, and the pool to pull a resource from is selected based on the build request properties. Since a given resource can only be utilized by a single build's worth of tests at a time, sometimes a build request will be submitted and not have one available. In this case, nextWorker returns None in order to kick the request back into the queue for consideration later.
I have observed via watching buildbot's log output that it only seems to consider the first build request in the queue each time it tries to distribute builds. This means that even if there are requests in the queue which use a different pool with resources available, they will be stuck waiting until the pool required by the request at the head of the queue has a resource available.
After some digging, I believe this is caused by the following code:
buildbot/master/buildbot/process/buildrequestdistributor.py
Lines 184 to 187 in 1cfba03
It seems like the intent here was to skip the request if there are no workers which can build it, but instead it is bailing out on checking any of them. Further down in this method it checks over all the workers again in order to run canStartBuild, and that chunk appears to do the correct thing (continuing around the outer loop again) if no worker can be selected. Should the initial check just be eliminated, and combined into the loop below?
It also seems like build requests may be dropped entirely if canStartBuild fails for all workers in this later loop, since the request was removed on line 190, but I haven't verified this theory and I'm not sure if the orphaned request is added back by some process elsewhere.
(Side note on my particular use case: It seems like locks on the resources would be the better solution, but I think there's a separate issue involving a race with latent docker workers checking locks and then starting up which causes all the workers to get jammed in the acquiring locks phase. This results in effectively the same jamming because all the workers are consumed waiting for the same resource. I haven't had time to investigate this race yet. Hence, nextWorker as a workaround.)
The text was updated successfully, but these errors were encountered: