Only a single build request is considered for scheduling if nextWorker returns None #6251

arbor-bdoyle · 2021-10-05T16:23:49Z

We're utilizing a custom nextWorker function in order to assign each build to a unique hardware resource that tests are run against. We have several pools of these with different configurations, and the pool to pull a resource from is selected based on the build request properties. Since a given resource can only be utilized by a single build's worth of tests at a time, sometimes a build request will be submitted and not have one available. In this case, nextWorker returns None in order to kick the request back into the queue for consideration later.

I have observed via watching buildbot's log output that it only seems to consider the first build request in the queue each time it tries to distribute builds. This means that even if there are requests in the queue which use a different pool with resources available, they will be stuck waiting until the pool required by the request at the head of the queue has a resource available.

After some digging, I believe this is caused by the following code:

buildbot/master/buildbot/process/buildrequestdistributor.py

Lines 184 to 187 in 1cfba03

    
           #  2. pick a worker 
        
           worker = yield self._popNextWorker(breq) 
        
           if not worker: 
        
               break

It seems like the intent here was to skip the request if there are no workers which can build it, but instead it is bailing out on checking any of them. Further down in this method it checks over all the workers again in order to run canStartBuild, and that chunk appears to do the correct thing (continuing around the outer loop again) if no worker can be selected. Should the initial check just be eliminated, and combined into the loop below?

It also seems like build requests may be dropped entirely if canStartBuild fails for all workers in this later loop, since the request was removed on line 190, but I haven't verified this theory and I'm not sure if the orphaned request is added back by some process elsewhere.

(Side note on my particular use case: It seems like locks on the resources would be the better solution, but I think there's a separate issue involving a race with latent docker workers checking locks and then starting up which causes all the workers to get jammed in the acquiring locks phase. This results in effectively the same jamming because all the workers are consumed waiting for the same resource. I haven't had time to investigate this race yet. Hence, nextWorker as a workaround.)

tardyp · 2021-10-05T16:41:11Z

I agree that the code looks weird, with a loop that is only meant to be run once.

I can see that even if we would continue, _getNextUnclaimedBuildRequest() will any return the same buildrequest so we would retry forever the same buildrequest, unless nextBuild is smarter that it needs and account that the build was just chosen and couldn't be assigned a worker.

I think the whole API design is flawed in the failure modes indeed.

What would make more sense at first sight is a prioritizeBuild hook, that would sort the buildrequests in whatever order desired by the project, and then the while loop would consider those builds one by one.

if not prioritizeBuild is set, then nextBuild would be called on the remaining buildrequests.

would you like to try youself at building such change?

Pierre

arbor-bdoyle · 2021-10-06T13:15:23Z

I'm pretty new to buildbot internals so I might need a while to figure out how all these parts fit together, but I'm willing to work on it.

For my use case, I think I would want to write a prioritizeBuilds in such a way that each next popped build will end up needing a different worker? It seems like it could be fairly easy to write a function that accidentally stalls the queue if the worker selected for each request is implicit based on the sorting, rather than explicitly selected for each request. If the function fails to take into account some property of the request which ends up assigning it a different worker than intended, debugging would be unpleasant.

I think the existing nextBuild hook is sufficient for sorting the requests in this way as well, it could maintain a constant sort if desired by being a class implementing __call__ or similar.

The existing nextWorker API has the advantage of only considering a single request in one call, so that whatever decision it makes has no risk of invalidating the decision for other requests. This is slightly less flexible, but I think overall more user friendly. I believe the current workflow can be fixed by making the improvement you suggest to popNextBuild's method of iterating the requests and then only ultimately removing the ones that could be given to a worker.

Unless you feel strongly that the API should be replaced, I'd be willing to try fixing the use of nextWorker, and if that doesn't work out then fall back to replacing it. Does this seem like a reasonable course of action?

arbor-bdoyle · 2021-10-06T15:40:09Z

One potential issue with my imagined fix: I cannot maintain the constraint described in the docstring of only calling nextWorker once for each worker, it would have to be once per worker per build request. Is this constraint actually necessary? It seems like it made more sense prior to being able to look at the build request (54300b5).

arbor-bdoyle · 2021-10-20T14:13:13Z

I've opened a draft PR with a patch that fixed the issue on my buildbot instance without changing the API, please take a look and let me know if you like this approach!

p12tic · 2021-10-21T12:20:49Z

I cannot maintain the constraint described in the docstring of only calling nextWorker once for each worker, it would have to be once per worker per build request. Is this constraint actually necessary?

We don't care about docstrings, the public interface is the documentation. I don't see anything related to the number of times the function is called, so from that regard the constraint is not important.

It seems like it made more sense prior to being able to look at the build request (54300b5).

Agreed.

arbor-bdoyle mentioned this issue Oct 20, 2021

Reconsider nextWorker for each build request #6277

Draft

3 tasks

vit9696 mentioned this issue Apr 11, 2022

Nested triggers not executing builds but marking them success #6486

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only a single build request is considered for scheduling if nextWorker returns None #6251

Only a single build request is considered for scheduling if nextWorker returns None #6251

arbor-bdoyle commented Oct 5, 2021

tardyp commented Oct 5, 2021

arbor-bdoyle commented Oct 6, 2021

arbor-bdoyle commented Oct 6, 2021

arbor-bdoyle commented Oct 20, 2021

p12tic commented Oct 21, 2021

Only a single build request is considered for scheduling if nextWorker returns None #6251

Only a single build request is considered for scheduling if nextWorker returns None #6251

Comments

arbor-bdoyle commented Oct 5, 2021

tardyp commented Oct 5, 2021

arbor-bdoyle commented Oct 6, 2021

arbor-bdoyle commented Oct 6, 2021

arbor-bdoyle commented Oct 20, 2021

p12tic commented Oct 21, 2021