-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fairshare: discussions on strategy, implementation #7
Comments
FYI - as a comparison, here is the list of factors Slurm uses in its multi-factor plugin: https://slurm.schedmd.com/priority_multifactor.html#mfjppintro Edit: note especially that fairshare is just one factor in a multi-factor priority calculation |
Here's a summary about what we talked about. If I missed anything/incorrectly summarized something, feel free to correct me: Instead of defining partitions in its own tables (where limits would be defined in a second location, as they are also defined in a cluster_association_table), @SteVwonder had a good idea where we could instead provide a label that's used when users are submitting jobs in order to associate it the max amount of resources it can utilize. example: a It's necessary to analyze where our gaps are in terms of tracking factors for a multi-factor priority for jobs. I plan on doing this over the next couple of days, eventually posting a table containing all of the factors and where we would include it in our software architecture. This would help us narrow down the large scope that is user/job priority 😅. |
FWIW, I think it is totally reasonable to start with this implementation as a proof-of-concept. Once you have a working version of this, you could then refactor for performance to cache certain historical values in memory, etc. |
Let's make sure we all have the same big picture. Here's the set of building blocks I have in mind, though I may not have all the information so this is just a starting point for a discussion:
The flow of data for jobs might look like:
This design can have 3 work streams going in parallel
Each of these can go in parallel once the interfaces have been agreed upon. Interfaces include:
|
@grondo: Thank you for starting up the big picture architecture discussion! We definitely need this to push forward this discussion. I have a few questions to make sure we are looking at the same page.
BTW, I love your ways to have the notion of parallel work stream. We really need this to be effective for this item. |
Yeah, you are right. My thought is that we need to get started somewhere, and this choice has the benefit of dividing up the work even further, which may have a big benefit. Another benefit is that this approach would allow a user to insert a custom priority plugin at runtime for a non-system flux instance. I'm not sure what exactly you could do with that, but it seems like it would be a nice feature.
That might be a good approach, though I think eventually maybe the advanced multi-factor priority plugin could either be its own sub-project or just included with flux-accounting... |
Had a side discussion with @grondo, in the past it was assumed that there would be two job history databases, a "core" one and a "sched" one, mostly so that we could work in parallel and not have development hindered on either path. Then we could "merge together" if necessary. @grondo's feeling is that in order to save time, we should nix that, upping the "job-info" job history DB to a higher priority. |
Can this be its own sub-project when some of the data source that it requires would come from flux-core? For example, queue time? |
If we decide to go with a unified database within flux-core, do we expect the user and account tables can be tracked there? Seems a bit monolithic... |
Like I said I certainly do hope that our reasoning on |
As part of job-manager priority plugin development we would design an interface that would allow all known information to be shared, e.g. |
No I think the flux-core job-info db could be used to store job accounting information, then the flux-accounting project would house the user/account hierarchy, and would query the job accounting db to update user banks, calculate historical usage to get fair-share priority, etc |
Agreed. The job-info module's database is effectively storing job history for its own purposes. Anyone else that wants to read from it can do so at its own discretion. But of course if the internal database changes, any scripts / fair share calculations, etc. would have to adjust. This is the risk of having just 1 job history db. |
This is a good point. But as long as the core information needed for fair share calculation remains attainable, even if the interface to get the data changes, I think it should be okay. |
Does this call for an RFC for job history database schema, then? |
Maybe ... after the coffee time talk a few questions came up. I'm putting together a discussion in flux-core. |
Sorry I couldn't join. Stuck in creating a writeup. |
I think we have pretty much settled on the design/implementation for calculating fairshare values now (a combination of using the weighted tree library introduced in #65 and fetching and calculating job usage values from the job-archive DB from #79), so I can close this issue. Don't mind re-opening if others feel otherwise. |
I figured this could serve as a place to document our discussion on the strategy for calculating fairshare. I plan to update this thread with information after our Webex meeting today. Here's my background information:
Originally, I was under the impression that fairshare values were calculated by passing in a user id, fetching its association id from the accounting database, and performing a Level Fairshare calculation based on the user's association information and current jobs in the queue. Essentially, I had thought that fairshare calculations would be constantly querying information from the accounting database in order to generate a priority value.
I've since learned that these fairshare calculations (at least, in Slurm's case), are performed in memory. The scheduler sorts all of the jobs in the queue using the fair tree algorithm, sorting users by the priority their jobs should be run. It calculates job usage that's occurred over the past couple of weeks (or some other determined amount of time), also utilizing a decay factor (to more heavily weigh the more recent jobs).
Chris Morrone had a modified fair tree implementation in flux-framework/flux-sched, but we've determined that implementation was very much a prototype/work-in-progress, and is probably not usable for our own fairshare calculation.
The text was updated successfully, but these errors were encountered: