Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job manager: track average and maximum job wait time #5909

Open
garlick opened this issue Apr 22, 2024 · 1 comment
Open

job manager: track average and maximum job wait time #5909

garlick opened this issue Apr 22, 2024 · 1 comment

Comments

@garlick
Copy link
Member

garlick commented Apr 22, 2024

Problem: average scheduler wait time could be a useful metric for evaluating performance of different scheduling algorithms.

Tuning EASY-Backfilling Queues, Lelon et al, JSSPP 2017 use average and maximum wait time for jobs in combination with job traces from the Parallel Workload Archive to evaluate various backfill scheduling optimizations

Average and maximum wait time would be really easy to add to the job manager's flux module stats output, where the wait time for any given job is just the time spent in ALLOC state. Since the job manager replays all the stored job's eventlogs on restart, the stats could be easily kept up to date, with purged jobs dropping off each time Flux restarts.

@garlick
Copy link
Member Author

garlick commented Apr 22, 2024

Another job manager metric that would be easy to capture and could give us insight into impact of things like partial release is node level resource utilization, e.g. average fraction spent in idle / offline / running / system (where system includes time spent in CLEANUP state).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant