Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only a single build request is considered for scheduling if nextWorker returns None #6251

Open
arbor-bdoyle opened this issue Oct 5, 2021 · 5 comments

Comments

@arbor-bdoyle
Copy link

We're utilizing a custom nextWorker function in order to assign each build to a unique hardware resource that tests are run against. We have several pools of these with different configurations, and the pool to pull a resource from is selected based on the build request properties. Since a given resource can only be utilized by a single build's worth of tests at a time, sometimes a build request will be submitted and not have one available. In this case, nextWorker returns None in order to kick the request back into the queue for consideration later.

I have observed via watching buildbot's log output that it only seems to consider the first build request in the queue each time it tries to distribute builds. This means that even if there are requests in the queue which use a different pool with resources available, they will be stuck waiting until the pool required by the request at the head of the queue has a resource available.

After some digging, I believe this is caused by the following code:

# 2. pick a worker
worker = yield self._popNextWorker(breq)
if not worker:
break

It seems like the intent here was to skip the request if there are no workers which can build it, but instead it is bailing out on checking any of them. Further down in this method it checks over all the workers again in order to run canStartBuild, and that chunk appears to do the correct thing (continuing around the outer loop again) if no worker can be selected. Should the initial check just be eliminated, and combined into the loop below?

It also seems like build requests may be dropped entirely if canStartBuild fails for all workers in this later loop, since the request was removed on line 190, but I haven't verified this theory and I'm not sure if the orphaned request is added back by some process elsewhere.

(Side note on my particular use case: It seems like locks on the resources would be the better solution, but I think there's a separate issue involving a race with latent docker workers checking locks and then starting up which causes all the workers to get jammed in the acquiring locks phase. This results in effectively the same jamming because all the workers are consumed waiting for the same resource. I haven't had time to investigate this race yet. Hence, nextWorker as a workaround.)

@tardyp
Copy link
Member

tardyp commented Oct 5, 2021

I agree that the code looks weird, with a loop that is only meant to be run once.

I can see that even if we would continue, _getNextUnclaimedBuildRequest() will any return the same buildrequest so we would retry forever the same buildrequest, unless nextBuild is smarter that it needs and account that the build was just chosen and couldn't be assigned a worker.

I think the whole API design is flawed in the failure modes indeed.

What would make more sense at first sight is a prioritizeBuild hook, that would sort the buildrequests in whatever order desired by the project, and then the while loop would consider those builds one by one.

if not prioritizeBuild is set, then nextBuild would be called on the remaining buildrequests.

would you like to try youself at building such change?

Pierre

@arbor-bdoyle
Copy link
Author

I'm pretty new to buildbot internals so I might need a while to figure out how all these parts fit together, but I'm willing to work on it.

For my use case, I think I would want to write a prioritizeBuilds in such a way that each next popped build will end up needing a different worker? It seems like it could be fairly easy to write a function that accidentally stalls the queue if the worker selected for each request is implicit based on the sorting, rather than explicitly selected for each request. If the function fails to take into account some property of the request which ends up assigning it a different worker than intended, debugging would be unpleasant.

I think the existing nextBuild hook is sufficient for sorting the requests in this way as well, it could maintain a constant sort if desired by being a class implementing __call__ or similar.

The existing nextWorker API has the advantage of only considering a single request in one call, so that whatever decision it makes has no risk of invalidating the decision for other requests. This is slightly less flexible, but I think overall more user friendly. I believe the current workflow can be fixed by making the improvement you suggest to popNextBuild's method of iterating the requests and then only ultimately removing the ones that could be given to a worker.

Unless you feel strongly that the API should be replaced, I'd be willing to try fixing the use of nextWorker, and if that doesn't work out then fall back to replacing it. Does this seem like a reasonable course of action?

@arbor-bdoyle
Copy link
Author

One potential issue with my imagined fix: I cannot maintain the constraint described in the docstring of only calling nextWorker once for each worker, it would have to be once per worker per build request. Is this constraint actually necessary? It seems like it made more sense prior to being able to look at the build request (54300b5).

@arbor-bdoyle
Copy link
Author

I've opened a draft PR with a patch that fixed the issue on my buildbot instance without changing the API, please take a look and let me know if you like this approach!

@p12tic
Copy link
Member

p12tic commented Oct 21, 2021

I cannot maintain the constraint described in the docstring of only calling nextWorker once for each worker, it would have to be once per worker per build request. Is this constraint actually necessary?

We don't care about docstrings, the public interface is the documentation. I don't see anything related to the number of times the function is called, so from that regard the constraint is not important.

It seems like it made more sense prior to being able to look at the build request (54300b5).

Agreed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants