gather and plot data about Slurm scheduling and job statistics
Branch: master
Clone or download
John Brunelle
Latest commit 314f9c5 Sep 23, 2014
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
RPMS build version 0.0.2-fasrc04 with latest updates Sep 6, 2014
doc update rpm links from 0.0.2-fasrc03 to 0.0.2-fasrc04; minor doc Sep 23, 2014
etc formalizing whitespace reports Mar 31, 2014
lib/python/site-packages/slurmmon empty squeue output -> no jobs, not a parse error Sep 6, 2014
screenshots add whitespace report screenshot etc. Sep 9, 2014
usr/sbin
var/www add more colors to probe job graph, thanks @cinek810 Sep 22, 2014
.gitignore doc reorg Mar 30, 2014
COPYING added license Jan 23, 2014
ChangeLog correct ChangeLog Sep 22, 2014
README.md
build_rpms
slurmmon-0.0.1.tar.bz2 tweak dashboard title Jan 23, 2014
slurmmon-0.0.2.tar.bz2 build version 0.0.2-fasrc04 with latest updates Sep 6, 2014
slurmmon.spec build version 0.0.2-fasrc04 with latest updates Sep 6, 2014

README.md

Slurmmon is a system for gaining insight into Slurm and the jobs it runs. It's meant for cluster administrators looking to measure the effects of configuration changes and raise cluster utilization. Features include:

  • trending all the scheduler performance diagnostics (the numbers from sdiag)
  • measuring job turnaround time of probe jobs, as a bellwether of scheduling issues
  • creating daily whitespace reports -- identifying specific users and jobs with low utilization of their allocations (the jobs that lead to the dreaded whitespace gap in plots of total resources vs. used resources)

Slurmmon is meant to run on a RHEL/CentOS/SL 6 based system and currently uses Ganglia for data collection and Apache/mod_python for reporting. The components are:

  • slurmmon-daemon -- the daemons that query Slurm and send data to Ganglia
  • slurmmon-ganglia -- the Ganglia custom reports that use php to stack raw rrd data
  • slurmmon-web -- a set of web pages that organize all the reports and relevant plots
  • slurmmon-python -- a general python interface to Slurm, using dict-based io pipelines and lazy evaluation (but being replaced by dio and slyme)

See the doc directory for more information, specifically:

  • INSTALL for initial installation and setup
  • FAQ for answers to common questions and other details

Here is a screenshot of the basic diagnostic report from the production cluster at @fasrc:

slurmmon screenshot

It shows how something interesting happened on the 31st -- there was a spike in job turnaround and slurmctld agent queue size.


Here is an example daily whitespace (CPU waste) report:

slurmmon whitespace report screenshot

Of the jobs that completed in that day, the top CPU-waster was sophia's, and it was a case of mismatched Slurm -n (128) and mpirun -np (16) (the latter is unnecessary -- user education opportunity). Lots of other jobs show the issue of asking for many CPU cores but using only one. The job IDs are links to full details.


Here is a stack of plots from our Slurm upgrade from 2.6.9 to 14.03.4 around 10:00 a.m.:

slurm upgrade

It shows the much faster backfill scheduler runs (top plot), deeper backfill scheduler runs (middle plot), and higher job throughput (slope of completed jobs in bottom plot).