Skip to content

conradtchan/jobmon

Repository files navigation

A tool for users to see their live jobs on the OzSTAR supercomputer (https://supercomputing.swin.edu.au/monitor/)

A Python script runs periodically on the management node to collect statistics and write them to a JSON file, which is served to the browser by a minimal backend. Most of the processing is performed in the browser, which will allow user customisation further down the track.

Originally forked from https://github.com/plaguedbypenguins/bobMonitor, but now rewritten from the ground up, with an emphasis on providing information to users.

Setup

Jobmon is designed for use on the OzSTAR supercomputer (http://supercomputing.swin.edu.au), but it can be adapted for any computing cluster. Please get in touch if you would like to run this on your cluster. There will most likely be a lot of tweaking required to adapt it to your needs and I'm happy to point you in the right direction.

Every system is different, so some configuration is required to get jobmon to gather stats from your setup. backend/backend_base.py contains a base class, which should be used as a template to write a class to interfae with your system. backend/backend_ozstar.py defines a derived class that is specific to OzSTAR, which provides an example of how to set things up. OzSTAR uses the Slurm scheduler and InfluxDB to gather stats from the nodes.

The following methods are overriden to make calls to the various interfaces (pyslurm, influxdb):

  • cpu_usage
  • mem
  • swap
  • gpus
  • infiniband
  • lustre
  • jobfs
  • node_up
  • is_counted
  • n_cpus
  • n_gpus
  • hostnames
  • job_ids
  • job_name
  • job_username
  • job_ncpus
  • job_ngpus
  • job_state
  • job_layout
  • job_gpu_layout
  • job_time_limit
  • job_run_time
  • job_mem
  • job_mem_max
  • job_mem_request
  • core_usage
  • pre_update
  • calculate_backfill

You will need to make your own subclass. For example, create backend/backend_jarvis.py, then override these methods. This file should look like this:

from backend_base import BackendBase

class Backend(BackendBase):
  def cpu_usage(self):
    ...

backend/jobmon_config.py defines the configuration options for the back end. For example, if you've created backend/backend_jarvis.py, then you will need to set BACKEND = "jarvis".

Backend configuration

  • DATA_PATH: Location to write JSON files
  • FILE_NAME_PATTERN: File name pattern of snapshots (must include {:} for the timestamp)
  • FILE_NAME_HISTORY: File name of the history data file
  • FILE_NAME_BACKFILL: File name of the backfill data file
  • UPDATE_INTERVAL: Time between updating snapshots (seconds)
  • NODE_DEAD_TIMEOUT: How long a node should be non-responsive before marking as down
  • HISTORY_LENGTH: How much history to return in a list (seconds)
  • HISTORY_DELETE_AGE: The purge age for old history records (seconds)
  • CORE_COUNT_NODES: Which nodes contribute to the total count
  • COLUMN_ORDER_CPUS: CPUs that have column-ordered cores (row-ordered by default)
  • BF_NODES: Queues to display backfill for

Frontend configuration

frontend/src/config.js contains the configuration options for the front end. You'll also want to use your own logo image: frontend/src/logo.png.

  • homepage: URL to the webpage of your computing cluster
  • address: Address of the API (script serving the JSON files)
  • apiVersion: Version number of the API
  • pageTitle: Title displayed at the top of the job monitor
  • fetchFrequency: How often to fetch data (should be the same as the backend update frequency)
  • fetchHistoryFrequency: How often to update the history array (used for displaying past statistics)
  • fetchBackfillFrequency: How often to update the backfill data
  • maintenanceAge: Trigger a maintenance message if snapshots are older than this amount of time
  • cpuKeys: Map of CPU usage types to an array index
  • historyDataCountInitial: Number of data points to load initially to generate charts
  • historyResolutionMultiplier: Factor to increase resolution by in successive refinements of chart data

The thresholds for warnings can be tweaked:

  • warnSwap: Percentage of swap use
  • warnWait: Percentage of CPU wait time
  • warnUtil: Percentage of CPU utilisation (less than)
  • warnMem: Percentage of requested memory used
  • baseMem: Megabytes of memory per core not to count towards warning
  • baseMemSingle: Same as baseMem, but for the first core of the job
  • graceTime: Minutes to give jobs to get set up without warning
  • warningWindow: Seconds to scan for warnings
  • warningFraction: Only trigger a warning if greater than this fraction in the warningWindow have problems
  • terribleThreshold: Warning score threshold to mark jobs as "terrible"

The "homepage" property in package.json should be set to the URL of the job monitor page (which is most likely not the same as the computing cluster homepage).

Installation

Backend

Dependencies:

  • Python 3.9+

Dependencies for OzSTAR backend:

Create a user to run the daemon

useradd -m jobmon

Copy the backend directory to /opt/jobmon or another directory of your choice. Alternatively, you may run the backend directory from the backend directory in the repository.

Make the data directory where JSON output is written to (set in the backend config)

mkdir /var/spool/jobmon
chown jobmon:jobmon /var/spool/jobmon

Edit the systemd service file. Add/remove to the PYTHONPATH environmnent variable to suit the system being deployed to (InfluxDB and PySlurm are neede for OzSTAR). Modify WorkingDirectory to if you have placed the backend somewhere other than /opt/jobmon. Copy the systemd service file to /etc/systemd/system/jobmon.service.

Install the cgi-bin Python scripts for serving the data.

cp cgi-bin/* /var/www/cgi-bin/

Frontend

Yarn (https://yarnpkg.com) is required to build the front end on the development machine. Note: Yarn only needs to be installed on the development machine to build the frontend - it does not need to be installed on the web server. Once the .js files have been built, simply copy them to the web server.

Navigate to the frontend directory

cd frontend

Install dependencies

yarn install

Build the optimised production frontend

yarn build

Install the frontend by copying the contents of the build directory to the web server

cp -r build/* /var/www/html/jobmon

Running

Start the service

systemctl start jobmon

The backend generates gzip'd JSON files at /var/spool/jobmon. These JSON files are read by the web app.

To test one cycle without writing anything to a file:

python jobmon.py test

About

Display HPC job statistics to users

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •