Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fairshare: discussions on strategy, implementation #7

Closed
cmoussa1 opened this issue Mar 18, 2020 · 18 comments
Closed

fairshare: discussions on strategy, implementation #7

cmoussa1 opened this issue Mar 18, 2020 · 18 comments
Labels
discussion question further information is requested

Comments

@cmoussa1
Copy link
Member

I figured this could serve as a place to document our discussion on the strategy for calculating fairshare. I plan to update this thread with information after our Webex meeting today. Here's my background information:

Originally, I was under the impression that fairshare values were calculated by passing in a user id, fetching its association id from the accounting database, and performing a Level Fairshare calculation based on the user's association information and current jobs in the queue. Essentially, I had thought that fairshare calculations would be constantly querying information from the accounting database in order to generate a priority value.

I've since learned that these fairshare calculations (at least, in Slurm's case), are performed in memory. The scheduler sorts all of the jobs in the queue using the fair tree algorithm, sorting users by the priority their jobs should be run. It calculates job usage that's occurred over the past couple of weeks (or some other determined amount of time), also utilizing a decay factor (to more heavily weigh the more recent jobs).

Chris Morrone had a modified fair tree implementation in flux-framework/flux-sched, but we've determined that implementation was very much a prototype/work-in-progress, and is probably not usable for our own fairshare calculation.

@cmoussa1 cmoussa1 added the question further information is requested label Mar 18, 2020
@grondo
Copy link
Contributor

grondo commented Mar 18, 2020

FYI - as a comparison, here is the list of factors Slurm uses in its multi-factor plugin:

https://slurm.schedmd.com/priority_multifactor.html#mfjppintro

Edit: note especially that fairshare is just one factor in a multi-factor priority calculation

@cmoussa1
Copy link
Member Author

Here's a summary about what we talked about. If I missed anything/incorrectly summarized something, feel free to correct me:

Instead of defining partitions in its own tables (where limits would be defined in a second location, as they are also defined in a cluster_association_table), @SteVwonder had a good idea where we could instead provide a label that's used when users are submitting jobs in order to associate it the max amount of resources it can utilize. example: a debug label would limit a user to 30 minutes and can only half the nodes available on a cluster.

It's necessary to analyze where our gaps are in terms of tracking factors for a multi-factor priority for jobs. I plan on doing this over the next couple of days, eventually posting a table containing all of the factors and where we would include it in our software architecture. This would help us narrow down the large scope that is user/job priority 😅.

@SteVwonder
Copy link
Member

Originally, I was under the impression that fairshare values were calculated by passing in a user id, fetching its association id from the accounting database, and performing a Level Fairshare calculation based on the user's association information and current jobs in the queue. Essentially, I had thought that fairshare calculations would be constantly querying information from the accounting database in order to generate a priority value.

FWIW, I think it is totally reasonable to start with this implementation as a proof-of-concept. Once you have a working version of this, you could then refactor for performance to cache certain historical values in memory, etc.

@grondo
Copy link
Contributor

grondo commented Mar 19, 2020

Let's make sure we all have the same big picture. Here's the set of building blocks I have in mind, though I may not have all the information so this is just a starting point for a discussion:

  • jobinfo db (flux-core): stores data for inactive jobs so they can be purged from memory. Enables out-of-band sql queries against completed job information, etc.
  • utilization reports (external): should be able to query jobinfo db directly
  • bank/accounting db (flux-accounting): stores users, banks, accounts, and "associations", uses jobinfo db to do necessary updates of current user/bank usage.
  • priority plugin (flux-core): plugin in job-manager used to adjust or supplement the primary job priority. A plugin may be a worker or set of workers similar to implementation of job-ingest validator.
  • multi-factor priority plugin (flux-accounting): a job-manager priority plugin/script which calculates a multi-factor priority for jobs including fairshare priority

The flow of data for jobs might look like:

  1. inactive jobs are sucked into the jobinfo db, optionally purged from memory
  2. utilization reports are generated directly from this database when required
  3. accounting information is generated/derived from the jobinfo db and fed into the accounting/fairshare db on a periodic interval
  4. accounting/fairshare db is used to fetch or push fair-tree factor into multi-factor priority plugin/script
  5. priority plugin in job-manager runs multi-factor priority calculation on each job, possibly using worker script, like validator

This design can have 3 work streams going in parallel

  • accounting/fairshare db and multi-factor priority script (flux-accounting)
  • jobinfo db (flux-core) -- we'll need this anyway for system instance
  • job-manager priority plugin (flux-core)

Each of these can go in parallel once the interfaces have been agreed upon. Interfaces include:

  • job-manager priority plugin: How does script-based priority worker get jobspec, userid, t_submit etc (JSON on stdin?)
  • jobinfo db: gather requirements from flux-accounting for query interface

@dongahn
Copy link
Member

dongahn commented Mar 19, 2020

@grondo: Thank you for starting up the big picture architecture discussion! We definitely need this to push forward this discussion. I have a few questions to make sure we are looking at the same page.

  1. We still haven't decided whether multi-factor priority plugin will sort jobs at the job-manager level or the external scheduler (e.g., flux-sched) level. While my preference is to do this at the job-manager level, we have to ensure this will not lead to "ALLOC" thrashing problem. Let me open up a ticket and reason about whether the "ALLOC" thrashing will be a real issue or not.

  2. It is not immediately clear to me if flux-accounting can provide a multi-factor priority plugin in its entirety. It will only have a subset of data needed for multi-factor priority calculation. I notice you mentioned "a job-manager priority plugin/script". So perhaps flux-accounting can provide a python command that will output some factors needed for multi-factor priority plugin and the plugin itself will be implemented at the level (decided from the further discussion from point 1 above)?

BTW, I love your ways to have the notion of parallel work stream. We really need this to be effective for this item.

@grondo
Copy link
Contributor

grondo commented Mar 19, 2020

While my preference is to do this at the job-manager level, we have to ensure this will not lead to "ALLOC" thrashing problem. Let me open up a ticket and reason about whether the "ALLOC" thrashing will be a real issue or not.

Yeah, you are right. My thought is that we need to get started somewhere, and this choice has the benefit of dividing up the work even further, which may have a big benefit.

Another benefit is that this approach would allow a user to insert a custom priority plugin at runtime for a non-system flux instance. I'm not sure what exactly you could do with that, but it seems like it would be a nice feature.

So perhaps flux-accounting can provide a python command that will output some factors needed for multi-factor priority plugin and the plugin itself will be implemented at the level (decided from the further discussion from point 1 above)?

That might be a good approach, though I think eventually maybe the advanced multi-factor priority plugin could either be its own sub-project or just included with flux-accounting...

@chu11
Copy link
Member

chu11 commented Mar 19, 2020

jobinfo db (flux-core) -- we'll need this anyway for system instance

Had a side discussion with @grondo, in the past it was assumed that there would be two job history databases, a "core" one and a "sched" one, mostly so that we could work in parallel and not have development hindered on either path. Then we could "merge together" if necessary.

@grondo's feeling is that in order to save time, we should nix that, upping the "job-info" job history DB to a higher priority.

@dongahn
Copy link
Member

dongahn commented Mar 19, 2020

That might be a good approach, though I think eventually maybe the advanced multi-factor priority plugin could either be its own sub-project or just included with flux-accounting...

Can this be its own sub-project when some of the data source that it requires would come from flux-core? For example, queue time?

@dongahn
Copy link
Member

dongahn commented Mar 19, 2020

Then we could "merge together" if necessary.

If we decide to go with a unified database within flux-core, do we expect the user and account tables can be tracked there? Seems a bit monolithic...

@dongahn
Copy link
Member

dongahn commented Mar 19, 2020

Yeah, you are right. My thought is that we need to get started somewhere, and this choice has the benefit of dividing up the work even further, which may have a big benefit.

Another benefit is that this approach would allow a user to insert a custom priority plugin at runtime for a non-system flux instance. I'm not sure what exactly you could do with that, but it seems like it would be a nice feature.

Like I said I certainly do hope that our reasoning on ALLOC trashing problem can lead us to this architecture.

@grondo
Copy link
Contributor

grondo commented Mar 19, 2020

Can this be its own sub-project when some of the data source that it requires would come from flux-core? For example, queue time?

As part of job-manager priority plugin development we would design an interface that would allow all known information to be shared, e.g. t_submit (queue time), primary priority, etc.

@grondo
Copy link
Contributor

grondo commented Mar 19, 2020

If we decide to go with a unified database within flux-core, do we expect the user and account tables can be tracked there? Seems a bit monolithic...

No I think the flux-core job-info db could be used to store job accounting information, then the flux-accounting project would house the user/account hierarchy, and would query the job accounting db to update user banks, calculate historical usage to get fair-share priority, etc

@chu11
Copy link
Member

chu11 commented Mar 19, 2020

No I think the flux-core job-info db could be used to store job accounting information, then the flux-accounting project would house the user/account hierarchy, and would query the job accounting db to update user banks, calculate historical usage to get fair-share priority, etc

Agreed. The job-info module's database is effectively storing job history for its own purposes. Anyone else that wants to read from it can do so at its own discretion.

But of course if the internal database changes, any scripts / fair share calculations, etc. would have to adjust. This is the risk of having just 1 job history db.

@cmoussa1
Copy link
Member Author

But of course if the internal database changes, any scripts / fair share calculations, etc. would have to adjust. This is the risk of having just 1 job history db.

This is a good point. But as long as the core information needed for fair share calculation remains attainable, even if the interface to get the data changes, I think it should be okay.

@dongahn
Copy link
Member

dongahn commented Mar 19, 2020

But of course if the internal database changes, any scripts / fair share calculations, etc. would have to adjust. This is the risk of having just 1 job history db.

This is a good point. But as long as the core information needed for fair share calculation remains attainable, even if the interface to get the data changes, I think it should be okay.

Does this call for an RFC for job history database schema, then?

@chu11
Copy link
Member

chu11 commented Mar 19, 2020

Does this call for an RFC for job history database schema, then?

Maybe ... after the coffee time talk a few questions came up. I'm putting together a discussion in flux-core.

@dongahn
Copy link
Member

dongahn commented Mar 19, 2020

Sorry I couldn't join. Stuck in creating a writeup.

@cmoussa1
Copy link
Member Author

I think we have pretty much settled on the design/implementation for calculating fairshare values now (a combination of using the weighted tree library introduced in #65 and fetching and calculating job usage values from the job-archive DB from #79), so I can close this issue. Don't mind re-opening if others feel otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion question further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants