Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BuildChooser: refactor (slave,breq) logic into strategy object #615

Merged
merged 8 commits into from Feb 20, 2013

Conversation

Projects
None yet
4 participants
@jaredgrubb
Copy link
Member

commented Jan 23, 2013

  • moved the following logic from Builder to BuildChooser
    • slave choice
    • build choice
    • build merging
  • move the following logic from Builder to BuildRequestDistributor
    • managing the brid reclamation
  • adjusted some of the logic on how locks are saved in member variables.
    This makes some of the code more uniform and reduces some redundancy.
  • unit tests moved appropriately from process_builder to process_BRD; a few more unit tests to cover new functionality

The slave-choice logic now checks if slaves can grab build locks. Slaves that
can are preferred to slaves that cant.

Also adding a customization point that lets slaves to be chosen based on
particular builds; for example, a patch is coming that will add a ForceScheduler
parameter to force a build onto a particular buildslave.

def canStartWithSlavebuilder(self, slavebuilder):
locks = [(self.botmaster.getLockFromLockAccess(access), access)
for access in self.config.locks ]
return Build.canStartWithSlavebuilder(locks, slavebuilder)

This comment has been minimized.

Copy link
@djmitche

djmitche Jan 24, 2013

Member

Build isn't in scope here.

@tomprince

This comment has been minimized.

Copy link
Member

commented Jan 24, 2013

I've just had a chance to quickly skim this, so far.

  1. The lock stuff seems like it is simple enough/separate enough that it could be merged on its own, making this easier to review.
  2. I can imagine that people might want at some point to interleave schedulings builds between builders.
  3. The logic here seems fairly involved, at a glance. I wonder if it is possible to simplify it, and some of the complication is merely due to the original organization of the code. (I don't have any specific suggestions here though, so feel free to disregard)
for access in self.config.locks ]
return Build.canStartWithSlavebuilder(locks, slavebuilder)

def canStartBuild(self, slavebuilder, breq):

This comment has been minimized.

Copy link
@djmitche

djmitche Jan 24, 2013

Member

I'd rather have this method always return a Deferred, so using maybeDeferred while calling the user function. It takes a good bit of following back chains of function calls to find the yield self.canStartBuild(..) in botmaster.py. And in general, while I like having user-supplied functions able to return either a Deferred or a normal value, I'd prefer internal functions to always do one or the other.

This comment has been minimized.

Copy link
@jaredgrubb

jaredgrubb Jan 24, 2013

Author Member

Yeh, that makes a lot of sense. Will review the other parts of the patch and adjust for that.

@djmitche

View changes

master/buildbot/process/botmaster.py Outdated
"""Convert a lock-access object into an actual Lock instance.
@param access: a locks.LockAccess instance
@return: a locks.RealMasterLock or locks.RealSlaveLock instance
"""

This comment has been minimized.

Copy link
@djmitche

djmitche Jan 24, 2013

Member

No more docstrings, please! They tend to be interpreted as APIs. Comments are OK, with no need for the formal parameter-description style.

@jaredgrubb

This comment has been minimized.

Copy link
Member Author

commented Jan 24, 2013

  1. Yeh, the lock stuff could be split out. Let me do that.
  2. I tried thinking about the concurrency stuff we briefly discussed, but this patch was hard enough that I thought it was more important to do the refactor work and keep parity .. .and hopefully be able to approach that problem from an easier perspective.
  3. This was a tricky patch, and there are some parts of it that I'm not fully happy with. Input is definitely welcome. As a "mental map" to looking at this patch:
    • start with Builder and BuildRequestDistributor. I am happy with the results here, as I think they are fairly simple now.
    • then look at the BuildChooserBase ... the goal of this was to just provide a simple entry point to what needs done
    • then look at the BasicBuildChooser. This is where it gets a bit difficult to follow... I dont like juggling the brdicts and BuildReqeusts, but I was able to get rid of the need for weakref's. I'll be curious what input you guys can offer here!
@jaredgrubb

View changes

master/buildbot/process/build.py Outdated
buildmaster = self.builder.botmaster.parent
props.updateFromProperties(buildmaster.config.properties)
master = self.builder.botmaster.master
props.updateFromProperties(master.config.properties)

This comment has been minimized.

Copy link
@jaredgrubb

jaredgrubb Jan 24, 2013

Author Member

Regarding this change: i hit a problem in one of the unit tests as I was working, and I couldnt figure out how "self.builder.botmaster.parent" was ever supposed to get set ... everything I could find seemed to suggest that this really was supposed to be 'master' not 'parent'. But maybe someone can confirm that?

This comment has been minimized.

Copy link
@djmitche

djmitche Jan 27, 2013

Member

Parent is part of the twisted Service class, and is set when the instance is attached to the process tree. We've had problems with it in the past because, when the service hierarchy shuts down, parent gets set to None, and cleanup methods tend to fail.

In this case, the two are interchangeable, but yes, master would be better.

@djmitche

This comment has been minimized.

Copy link
Member

commented Jan 27, 2013

Thanks for the mental map. I'm settling in for a long read now :)

# until any currently-running invocations are done, so we know that
# the builder is quiescent at that time.
return self.maybeStartBuild()
d.addCallback(flushMaybeStartBuilds)
return d

This comment has been minimized.

Copy link
@djmitche

djmitche Jan 27, 2013

Member

Is this functionality implemented differently somewhere else?

This comment has been minimized.

Copy link
@djmitche

djmitche Jan 27, 2013

Member

At any rate, with this change the stopService method can just be omitted, since it just calls its superclass method.

This comment has been minimized.

Copy link
@jaredgrubb

jaredgrubb Jan 27, 2013

Author Member

No, I didnt put this functionality back... I couldnt really figure out what "flush" it was trying to actually perform, and so I wasnt sure how to re-implement it ... after the patch was done, I didnt see any "hole" that this functionality was fixing, so I left it out... please be sure to see if my logic is right!

This comment has been minimized.

Copy link
@djmitche

djmitche Jan 27, 2013

Member

I think it's not. The problem is that if the service is stopped in the middle of scheduling a build (e.g., while the claimRequests method is executing), then things can be left in disarray when the process exits. This is particularly problematic for integration tests, where we expect the master's stopService to not return until all activity has ceased.

The idea was that calls to maybeStartBuild were serialized with a DeferredLock, so that this call would automatically block until any other pending calls had completed. But it's not actually serialized, so I got that wrong :(

@djmitche

View changes

master/buildbot/process/builder.py Outdated
break
# Note that if the build fails from here on out (e.g.,
# because a slave has failed), it will be handled outside of this
# loop. TODO: test that!

This comment has been minimized.

Copy link
@djmitche

djmitche Jan 27, 2013

Member

There's no loop left here, anyway. In fact, it looks like what's left of this method only ensures that updateBigStatus is called after _startBuildFor. Could this method be merged with _startBuildFor?

This comment has been minimized.

Copy link
@jaredgrubb

jaredgrubb Jan 27, 2013

Author Member

I'll adjust that comment.

It could be added there, but that function has 6 exit points, so I'd have to put it on all those, and by then it almost seems like it's not a good change, as if someone ever adds a seventh, it might get forgotten.

This comment has been minimized.

Copy link
@djmitche

djmitche Jan 27, 2013

Member

That sounds fair. Perhaps a comment to that effect, then?

This comment has been minimized.

Copy link
@jaredgrubb

jaredgrubb Jan 27, 2013

Author Member

Actually, I think this line can be removed. The _startBuildFor has a "cleanups" mechanism it uses internally, and I realized I could just use that, rather than decorate the 6 exit points... and it turns out it's already in there. So, it looks like the call here can be removed, as all the 6 exit points already result in this call.

@djmitche

View changes

master/buildbot/process/botmaster.py Outdated
BuildChooserBase.__init__(self, bldr, master)
self.nextSlave = self.bldr.config.nextSlave
if not self.nextSlave:
self.nextSlave = lambda _,slaves: slaves and random.choice(slaves) or None

This comment has been minimized.

Copy link
@djmitche

djmitche Jan 27, 2013

Member

We can use conditional expressions now: if slaves then random.choice(slaves) else None

@djmitche

View changes

master/buildbot/process/botmaster.py Outdated
if canStart:
break
# try a different slave
recycledSlaves.append(slave)

This comment has been minimized.

Copy link
@djmitche

djmitche Jan 27, 2013

Member

why not just unpop the slave here?

This comment has been minimized.

Copy link
@jaredgrubb

jaredgrubb Jan 27, 2013

Author Member

So that I'd only evaluate each slave once per build choice, until one is chosen or none work.

This comment has been minimized.

Copy link
@djmitche

djmitche Jan 27, 2013

Member

I see - so this is digging down into the stack, rather than popping and
pushing repeatedly. Thanks.

@djmitche

View changes

master/buildbot/process/botmaster.py Outdated
@@ -330,6 +596,8 @@ class BuildRequestDistributor(service.Service):
methods.
"""

BuildChooser = BasicBuildChooser

This comment has been minimized.

Copy link
@djmitche

djmitche Jan 27, 2013

Member

We'll need to make this configurable somehow.

This comment has been minimized.

Copy link
@jaredgrubb

jaredgrubb Jan 27, 2013

Author Member

Yeh, this is very hacky. I wanted a place I could configure it at least for unit tests... but wasnt sure how much we wanted to expose yet, and how it should be configured.

@djmitche

This comment has been minimized.

Copy link
Member

commented Jan 27, 2013

OK, after a deeper read, I think this isn't quite as aggressive a refactoring as we need. That may mean reworking this patch to be more aggressive. Or it may mean merging this with an explicit warning that the BuildChooser API will be changed/replaced soon.

This is really the heart of the heart of Buildbot - Buildbot is fundamentally a batch scheduling system, with schedulers, build factories, and status orbiting around that. So I'd like to get it right, and solidly so.

Build Scheduling

For background, here's a rough rundown of how builds get scheduled with this patch.

  • Something calls BotMaster.maybeStartBuildsForXxx(..), indicating that some condition has changed and that it may be possible to start a build.
  • These methods all call the BuildRequestDistributor's maybeStartBuildsOn(bldrnames) method, indicating that any of the given builders may be able to start a build.
  • maybeStartBuildsOn adds the builders to a sorted list of builders pending a check, and if necessary kicks off a cycle of the activity loop.

The activity loop loops once for each builder, always selecting the highest-priority builder. For each such builder, it calls BuildRequestDistributor._maybeStartBuildsOnBuilder, which

  • creates and calls a BuildChooser to choose a new slave (actually a SlaveBuilder instance) and a set of build requests to be merged
  • claims the build requests
  • tries to start the build, calling Builder.maybeStartBuild
  • unclaims the build requests if the start failed
  • loops, with a new BuildChooser if the start failed, and otherwise with the same BuildChooser.
    The maybeStartBuild method creates new build process and status objects, adds build rows to the DB, talks to the slave, and sets up the Deferred that represents the entire build.

Complications

There are a bunch of complicating factors here. One will go away soon. The others can probably be handled cleanly with nice, solid, abstractions.

  1. Serialization: because everything is governed by priorities, allocations must be serialized, with the highest-priority combinations tried first. Allocations can take a long time, with lots of iterations of the reactor in between. Conditions can change as that process proceeds, so this requires some careful attention to locking. DeferredLock and @deferredLocked can help with this.
  2. Termination: when a buildmaster stops normally, it needs to finish any activities that are in progress. Failing to do so leaves an unclean reactor in tests, and can cause unnecessary failures in a clustered environment (for example, if build requests are claimed in the database and subsequently not executed). This can probably be fixed with more careful design of the activity loop, and ensuring that builders are not stopped before the BuildRequestDistributor is.
  3. Process Objects: the documented interfaces for nextBuild, nextSlave, etc. assume access to process objects which are implementation-defined. To preserve compatibility with these interfaces, there's lots of conversion from brdicts to BuildRequests, etc. In nine, all of these will change to use data API resources.
  4. Merging build requests: a single "allocation" can grab a bunch of build requests. In a busy master, this can sometimes be hundreds or thousands of build requests. Finding and claiming these requests can be pretty slow. This, too, will be better in nine, where build requests will be merged (well, collapsed) when they are created, so a new build only involves a single build request resource.
  5. Complex build starting: Builder.maybeStartBuild is quite complicated, and the details of state transitions aren't very clear. When, exactly, has the build begun, and how does that affect how errors are reported?

Complete Flexibility

Setting aside the complexity for the moment, the fundamental operation here is this. The algorithm is given a set of slaves and set of build requests. Out of the cross-product of these sets, some tuples are impossible (the builder is not configured for the slave; the slave builder is not available; the required locks are not available; or a user-defined filter denies the combination). Of the remainder, a complex, user-defined priority applies -- preferring certain builders, certain build requests, certain slaves, or some combination of those. The algorithm should select the highest-priority tuple, merge it with any compatible tuples, and start the corresponding build.

The current implementation forces a hierarchy on the priority. The tuples are first sorted by builder (using a custom priority), then sorted within that builder by nextBuild and nextSlave. This patch fixes the latter, allowing users to specify the order of tuples rather than independently specifying a build and slave, but it does not address flexibility at the builder level.

The motivation behind this priority is efficiency. In a busy Buildbot installation, there may be thousands of outstanding build requests and thousands of slaves. Considering the cross-product would be horribly inefficient in time or space.

This was the motivation behind the BuildRequestDistributor class. The current implementation leaves the hierarchy as-is, but the BRD's API doesn't assume any hierarchy. When I wrote it, I saw two possibilities: either find a way to efficiently implement a fully general distribution algorithm, or implement several versions of the algorithm with different hierarchies. For example, a slave-focused BuildRequestDistributor might start with under-utilized slaves and look for build requests they could satisfy, sorting those requests by a user-defined priority.

So, the "more aggressive" refactoring I mentioned in the beginning would mean writing one or more new BuildRequestDistributor implementations, and allowing those to be configured by the user. The existing BRD could stick around, perhaps taking its (awkward) priority configuration parameters (nextSlave, nextBuild, mergeRequests, prioritizeBuilders, ..) directly, and be deprecated with removal slated for 0.9.0, when all of those parameters will change their interfaces anyway.

Sorry this got long. Hopefully it makes sense!

@djmitche

This comment has been minimized.

Copy link
Member

commented Jan 27, 2013

A few less abstract comments:

  • the lifetime of the BuildChooser class is pretty critical to your implementation, but that's not clear on a first or second reading. From what I can tell, it basically represents a stack of potential (slavebuilder, build requests) tuples, and whenever one of those tuples fails to start, the stack will be restarted from scratch
  • the lastResort list seems a bit desperate - those are slaves for which the builder said it cannot start. Why try them? Why try them only once?
  • The "unpop" operation and the underlying "preferredSlaves" lists don't seem to go together - perhaps the list should be renamed?
  • It'd be nice to have build choosers -- and probably the BuildRequestDistributor, too, if I'm being honest with myself -- moved to different source files.
@jaredgrubb

This comment has been minimized.

Copy link
Member Author

commented Jan 27, 2013

I think your comments are spot on. My intent in this patch was to just gather existing functionality into one spot, hopefully in a way that can be experimented with. The only requirement was a "pop" function that generates (slave, [breq]), which seemed to be the minimal required solution.

Although, even this is assumes a per-build solution, whereas youre considering something even more global?

Your less-abstract comments:

  • the lifetime of BuildChooser is meant to be short ... it was the easiest way to deal with the cache problem (and avoid the original weakref solution). Maybe a slightly different name would be better... something like BuildGenerator or something that implies ephemerality. I originally had a "restart()" method, but there didnt seem to be much advantage to that over just creating a new object.
  • the algorithm, for example, prefers slaves that can grab locks to slaves that cannot. However, if no slave can grab the lock, should the build remain pending, or should it be assigned and marked "Waiting For Lock" as it does now? I tried to make it so the 'nextSlave' function would be called at most N times (N = # slaves), in case the 'nextSlave' function was expensive. However, maybe that was a silly thing to do, given that at any time, the whole process could get reset and started from scratch ... so the "unpop" and "prefered vs lastResort" was an attempt to avoid calling 'nextSlave' more than N times.
  • I originally had "chosenSlaves" as the name, but that was even worse. Depending on your thoughts on my above comment, maybe this list goes away
  • a new source file does make sense; especially if it's something we want to plug-and-experiment, or if it's something that eventually will be configurable
@djmitche

This comment has been minimized.

Copy link
Member

commented Jan 27, 2013

OK, so I think we're both looking at this as, as you say, gathering existing functionality in one spot. In that case, I'm happy to see something of this form merged, but let's be careful not to give the impression that this is the new "right way" to do things, since I expect we'll want to make additional, substantial changes in the near future (at least in nine).

I'm fine with the lifetime as-is - only that it took a pretty thorough read of the code to figure it out. So some comments or docstrings to that effect would help.

That's a good explanation of the use of the lastResort list. Some comments to that effect would help. How about calling it rejectedSlaves?

So, it looks like this is on the right track. As a very minor note, please run pyflakes over the source code and fix up what it reports -- usually it's minor stuff, but for example it catches the fact that Build is not in scope.

@jaredgrubb

This comment has been minimized.

Copy link
Member Author

commented Feb 1, 2013

The last two commits are the only real change from the prior patch:

  • ebe0e61 is a pure move of code as-is from the prior patches
  • 7280054 beefs up comments alone.

Not sure if there's a way to make this patch more reviewable now...?

@jaredgrubb

This comment has been minimized.

Copy link
Member Author

commented Feb 2, 2013

I undid my last push and removed the commit that created 'buildrequestdistributor.py' and the rename of the unit test file. It made this patch impossible for anyone to review -- and this is already a difficult patch to review :)

So, if/when this patch gets merged, I'll apply the renames at that point (either before a pull, or as a separate request)

@djmitche

This comment has been minimized.

Copy link
Member

commented Feb 3, 2013

OK, this has met all of the requests from my previous review. And we've agreed that while this isn't the full refactoring that BRD deserves, it's a step in the right direction. So I have no objections to merging as-is (with the renames applied).

@jaredgrubb

This comment has been minimized.

Copy link
Member Author

commented Feb 19, 2013

Any more review to do here, or should I push the file rename commit on top of this in prep for a merge?

@tardyp

This comment has been minimized.

Copy link
Member

commented Feb 20, 2013

FWIW I like this a lot, and it looks like it fits my weird prioritization + slave selection needs. The resulting algorithm is probably not the most efficient, but it makes a good separation of configuration concerns.

choose a slave -> make fastest slaves more busy
choose a build -> choose build with high priority first
is this build compatible with this slave -> has the chosen slave the capability to run the chosen build

good work, and ship it!

and merge it to nine?

@djmitche

This comment has been minimized.

Copy link
Member

commented Feb 20, 2013

Agreed - ship it!

Jared Grubb added some commits Jan 24, 2013

Jared Grubb
BuildChooser: refactor (slave,breq) logic into strategy object
 * moved the following logic from Builder to BuildChooser
   - slave choice
   - build choice
   - build merging
 * move the following logic from Builder to BuildRequestDistributor
   - managing the brid reclamation

The slave-choice logic now checks if slaves can grab build locks. Slaves that
can are preferred to slaves that cant.

Also adding a customization point that lets slaves to be chosen based on
particular builds; for example, a patch is coming that will add a ForceScheduler
parameter to force a build onto a particular buildslave.
@jaredgrubb

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2013

Rebased to tip of master; included the BRD rename, and also a minor unit test fix that broke when I renamed the 'rejectedSlaves' member variable -- and neglected to catch the failure. Pyflakes also passes.

@djmitche djmitche merged commit 894de08 into buildbot:master Feb 20, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.