fairshare: discussions on strategy, implementation #7

cmoussa1 · 2020-03-18T20:51:50Z

I figured this could serve as a place to document our discussion on the strategy for calculating fairshare. I plan to update this thread with information after our Webex meeting today. Here's my background information:

Originally, I was under the impression that fairshare values were calculated by passing in a user id, fetching its association id from the accounting database, and performing a Level Fairshare calculation based on the user's association information and current jobs in the queue. Essentially, I had thought that fairshare calculations would be constantly querying information from the accounting database in order to generate a priority value.

I've since learned that these fairshare calculations (at least, in Slurm's case), are performed in memory. The scheduler sorts all of the jobs in the queue using the fair tree algorithm, sorting users by the priority their jobs should be run. It calculates job usage that's occurred over the past couple of weeks (or some other determined amount of time), also utilizing a decay factor (to more heavily weigh the more recent jobs).

Chris Morrone had a modified fair tree implementation in flux-framework/flux-sched, but we've determined that implementation was very much a prototype/work-in-progress, and is probably not usable for our own fairshare calculation.

grondo · 2020-03-18T21:13:32Z

FYI - as a comparison, here is the list of factors Slurm uses in its multi-factor plugin:

https://slurm.schedmd.com/priority_multifactor.html#mfjppintro

Edit: note especially that fairshare is just one factor in a multi-factor priority calculation

cmoussa1 · 2020-03-18T22:37:52Z

Here's a summary about what we talked about. If I missed anything/incorrectly summarized something, feel free to correct me:

Instead of defining partitions in its own tables (where limits would be defined in a second location, as they are also defined in a cluster_association_table), @SteVwonder had a good idea where we could instead provide a label that's used when users are submitting jobs in order to associate it the max amount of resources it can utilize. example: a debug label would limit a user to 30 minutes and can only half the nodes available on a cluster.

It's necessary to analyze where our gaps are in terms of tracking factors for a multi-factor priority for jobs. I plan on doing this over the next couple of days, eventually posting a table containing all of the factors and where we would include it in our software architecture. This would help us narrow down the large scope that is user/job priority 😅.

SteVwonder · 2020-03-18T22:46:01Z

Originally, I was under the impression that fairshare values were calculated by passing in a user id, fetching its association id from the accounting database, and performing a Level Fairshare calculation based on the user's association information and current jobs in the queue. Essentially, I had thought that fairshare calculations would be constantly querying information from the accounting database in order to generate a priority value.

FWIW, I think it is totally reasonable to start with this implementation as a proof-of-concept. Once you have a working version of this, you could then refactor for performance to cache certain historical values in memory, etc.

grondo · 2020-03-19T17:28:56Z

Let's make sure we all have the same big picture. Here's the set of building blocks I have in mind, though I may not have all the information so this is just a starting point for a discussion:

jobinfo db (flux-core): stores data for inactive jobs so they can be purged from memory. Enables out-of-band sql queries against completed job information, etc.
utilization reports (external): should be able to query jobinfo db directly
bank/accounting db (flux-accounting): stores users, banks, accounts, and "associations", uses jobinfo db to do necessary updates of current user/bank usage.
priority plugin (flux-core): plugin in job-manager used to adjust or supplement the primary job priority. A plugin may be a worker or set of workers similar to implementation of job-ingest validator.
multi-factor priority plugin (flux-accounting): a job-manager priority plugin/script which calculates a multi-factor priority for jobs including fairshare priority

The flow of data for jobs might look like:

inactive jobs are sucked into the jobinfo db, optionally purged from memory
utilization reports are generated directly from this database when required
accounting information is generated/derived from the jobinfo db and fed into the accounting/fairshare db on a periodic interval
accounting/fairshare db is used to fetch or push fair-tree factor into multi-factor priority plugin/script
priority plugin in job-manager runs multi-factor priority calculation on each job, possibly using worker script, like validator

This design can have 3 work streams going in parallel

accounting/fairshare db and multi-factor priority script (flux-accounting)
jobinfo db (flux-core) -- we'll need this anyway for system instance
job-manager priority plugin (flux-core)

Each of these can go in parallel once the interfaces have been agreed upon. Interfaces include:

job-manager priority plugin: How does script-based priority worker get jobspec, userid, t_submit etc (JSON on stdin?)
jobinfo db: gather requirements from flux-accounting for query interface

dongahn · 2020-03-19T18:24:23Z

@grondo: Thank you for starting up the big picture architecture discussion! We definitely need this to push forward this discussion. I have a few questions to make sure we are looking at the same page.

We still haven't decided whether multi-factor priority plugin will sort jobs at the job-manager level or the external scheduler (e.g., flux-sched) level. While my preference is to do this at the job-manager level, we have to ensure this will not lead to "ALLOC" thrashing problem. Let me open up a ticket and reason about whether the "ALLOC" thrashing will be a real issue or not.
It is not immediately clear to me if flux-accounting can provide a multi-factor priority plugin in its entirety. It will only have a subset of data needed for multi-factor priority calculation. I notice you mentioned "a job-manager priority plugin/script". So perhaps flux-accounting can provide a python command that will output some factors needed for multi-factor priority plugin and the plugin itself will be implemented at the level (decided from the further discussion from point 1 above)?

BTW, I love your ways to have the notion of parallel work stream. We really need this to be effective for this item.

grondo · 2020-03-19T18:51:11Z

While my preference is to do this at the job-manager level, we have to ensure this will not lead to "ALLOC" thrashing problem. Let me open up a ticket and reason about whether the "ALLOC" thrashing will be a real issue or not.

Yeah, you are right. My thought is that we need to get started somewhere, and this choice has the benefit of dividing up the work even further, which may have a big benefit.

Another benefit is that this approach would allow a user to insert a custom priority plugin at runtime for a non-system flux instance. I'm not sure what exactly you could do with that, but it seems like it would be a nice feature.

So perhaps flux-accounting can provide a python command that will output some factors needed for multi-factor priority plugin and the plugin itself will be implemented at the level (decided from the further discussion from point 1 above)?

That might be a good approach, though I think eventually maybe the advanced multi-factor priority plugin could either be its own sub-project or just included with flux-accounting...

chu11 · 2020-03-19T18:53:11Z

jobinfo db (flux-core) -- we'll need this anyway for system instance

Had a side discussion with @grondo, in the past it was assumed that there would be two job history databases, a "core" one and a "sched" one, mostly so that we could work in parallel and not have development hindered on either path. Then we could "merge together" if necessary.

@grondo's feeling is that in order to save time, we should nix that, upping the "job-info" job history DB to a higher priority.

dongahn · 2020-03-19T19:03:32Z

That might be a good approach, though I think eventually maybe the advanced multi-factor priority plugin could either be its own sub-project or just included with flux-accounting...

Can this be its own sub-project when some of the data source that it requires would come from flux-core? For example, queue time?

dongahn · 2020-03-19T19:07:52Z

Then we could "merge together" if necessary.

If we decide to go with a unified database within flux-core, do we expect the user and account tables can be tracked there? Seems a bit monolithic...

dongahn · 2020-03-19T19:09:10Z

Yeah, you are right. My thought is that we need to get started somewhere, and this choice has the benefit of dividing up the work even further, which may have a big benefit.

Another benefit is that this approach would allow a user to insert a custom priority plugin at runtime for a non-system flux instance. I'm not sure what exactly you could do with that, but it seems like it would be a nice feature.

Like I said I certainly do hope that our reasoning on ALLOC trashing problem can lead us to this architecture.

grondo · 2020-03-19T19:09:56Z

Can this be its own sub-project when some of the data source that it requires would come from flux-core? For example, queue time?

As part of job-manager priority plugin development we would design an interface that would allow all known information to be shared, e.g. t_submit (queue time), primary priority, etc.

grondo · 2020-03-19T19:13:20Z

If we decide to go with a unified database within flux-core, do we expect the user and account tables can be tracked there? Seems a bit monolithic...

No I think the flux-core job-info db could be used to store job accounting information, then the flux-accounting project would house the user/account hierarchy, and would query the job accounting db to update user banks, calculate historical usage to get fair-share priority, etc

chu11 · 2020-03-19T19:39:15Z

No I think the flux-core job-info db could be used to store job accounting information, then the flux-accounting project would house the user/account hierarchy, and would query the job accounting db to update user banks, calculate historical usage to get fair-share priority, etc

Agreed. The job-info module's database is effectively storing job history for its own purposes. Anyone else that wants to read from it can do so at its own discretion.

But of course if the internal database changes, any scripts / fair share calculations, etc. would have to adjust. This is the risk of having just 1 job history db.

cmoussa1 · 2020-03-19T21:23:54Z

But of course if the internal database changes, any scripts / fair share calculations, etc. would have to adjust. This is the risk of having just 1 job history db.

This is a good point. But as long as the core information needed for fair share calculation remains attainable, even if the interface to get the data changes, I think it should be okay.

dongahn · 2020-03-19T21:32:09Z

But of course if the internal database changes, any scripts / fair share calculations, etc. would have to adjust. This is the risk of having just 1 job history db.

This is a good point. But as long as the core information needed for fair share calculation remains attainable, even if the interface to get the data changes, I think it should be okay.

Does this call for an RFC for job history database schema, then?

chu11 · 2020-03-19T22:24:30Z

Does this call for an RFC for job history database schema, then?

Maybe ... after the coffee time talk a few questions came up. I'm putting together a discussion in flux-core.

dongahn · 2020-03-19T22:25:47Z

Sorry I couldn't join. Stuck in creating a writeup.

cmoussa1 · 2021-01-11T18:46:24Z

I think we have pretty much settled on the design/implementation for calculating fairshare values now (a combination of using the weighted tree library introduced in #65 and fetching and calculating job usage values from the job-archive DB from #79), so I can close this issue. Don't mind re-opening if others feel otherwise.

cmoussa1 added the question further information is requested label Mar 18, 2020

dongahn mentioned this issue Mar 19, 2020

Multi queue design flux-framework/flux-sched#529

Closed

cmoussa1 mentioned this issue Mar 19, 2020

priority plugin: factor gap analysis #8

Closed

dongahn mentioned this issue Mar 19, 2020

[discussion] where should multi-factor priorities be enforced #9

Closed

cmoussa1 mentioned this issue Mar 19, 2020

create_db.py: refactor database design for Flux accounting #6

Merged

chu11 mentioned this issue Mar 19, 2020

job-info: store historical job data in a database flux-framework/flux-core#2863

Closed

cmoussa1 added the discussion label Apr 8, 2020

cmoussa1 mentioned this issue Aug 19, 2020

[discussion] fairshare: implementation strategy of fairshare factor for submitted jobs #31

Closed

garlick mentioned this issue Oct 8, 2020

job-manager: add secondary priority value flux-framework/flux-core#3256

Closed

cmoussa1 closed this as completed Jan 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fairshare: discussions on strategy, implementation #7

fairshare: discussions on strategy, implementation #7

cmoussa1 commented Mar 18, 2020

grondo commented Mar 18, 2020 •

edited

Loading

cmoussa1 commented Mar 18, 2020

SteVwonder commented Mar 18, 2020

grondo commented Mar 19, 2020 •

edited

Loading

dongahn commented Mar 19, 2020

grondo commented Mar 19, 2020

chu11 commented Mar 19, 2020

dongahn commented Mar 19, 2020

dongahn commented Mar 19, 2020

dongahn commented Mar 19, 2020

grondo commented Mar 19, 2020

grondo commented Mar 19, 2020

chu11 commented Mar 19, 2020

cmoussa1 commented Mar 19, 2020

dongahn commented Mar 19, 2020

chu11 commented Mar 19, 2020

dongahn commented Mar 19, 2020

cmoussa1 commented Jan 11, 2021

fairshare: discussions on strategy, implementation #7

fairshare: discussions on strategy, implementation #7

Comments

cmoussa1 commented Mar 18, 2020

grondo commented Mar 18, 2020 • edited Loading

cmoussa1 commented Mar 18, 2020

SteVwonder commented Mar 18, 2020

grondo commented Mar 19, 2020 • edited Loading

dongahn commented Mar 19, 2020

grondo commented Mar 19, 2020

chu11 commented Mar 19, 2020

dongahn commented Mar 19, 2020

dongahn commented Mar 19, 2020

dongahn commented Mar 19, 2020

grondo commented Mar 19, 2020

grondo commented Mar 19, 2020

chu11 commented Mar 19, 2020

cmoussa1 commented Mar 19, 2020

dongahn commented Mar 19, 2020

chu11 commented Mar 19, 2020

dongahn commented Mar 19, 2020

cmoussa1 commented Jan 11, 2021

grondo commented Mar 18, 2020 •

edited

Loading

grondo commented Mar 19, 2020 •

edited

Loading