The front-end web interface provides users with real-time and historical monitoring information. The interface is powered by a backend server that collects data from various sources, including our own monitoring agents, slurm database, Bright server, etc., from both local (rusty) and remote (popeye) clusters.
The server efficiently saves the real-time data into time-series databases and backup files, while also offering tools to streamline the administrator's tasks, such as sending alerts during system health issues, providing possible solutions, predicting future workload, and more. The server is designed for high performance, utilizing multiple processes and memory data caches to enhance efficiency.
An instance of the web interface is accessible at http://mon7:8126.
The following are the listed data sources for our monitoring tool, which should be accessible for use.
Slurm commands should be executable on the same node.
Birght monitoring interface should be accessible. The "bright" key can be found in the config/config.json file, which contains the configuration settings for Bright.
For data to be reported to the MQTT server, the cluster_host_mon.py host monitoring daemon needs to be installed on all nodes. The "mqtt" key can be found in the config/config.json file, which contains the configuration settings for MQTT.
Our monitoring data is stored in a time-series database called InfluxDB, with its configuration settings specified in the config/config.json file under the 'influxdb' key.
For Linux,
wget https://dl.influxdata.com/influxdb/releases/influxdb-1.8.1.x86_64.rpm
sudo yum install influxdb-1.8.1.x86_64.rpm
By default, InfluxDB uses the following network ports:
TCP port 8086 is used for client-server communication over InfluxDB’s HTTP API. And,
TCP port 8088 is used for the RPC service for backup and restore.
The configuration file is located at /etc/influxdb/influxdb.conf, and it can be customized to specify the port and directories where the data is saved.
service influxdb start
On the InfluxDB server, we can set up a monthly cron job to run the backup command
0 2 1 * * /usr/bin/influxd backup -portable /mnt/home/yliu/ceph/influxdb/backup
In some situations, you may need to perform a re-installation and re-configuration of InfluxDB. Once that is done, you can restore the data using the following command:
influxd restore -portable /mnt/home/yliu/ceph/influxdb/backup
After the data has been successfully restored, you can then proceed to restart InfluxDB.
##Environment setup
The module facilitates the installation of required packages and libraries, including
module add slurm gcc/11.2.0 python/3.10
Create and install Python packages within a Python virtual environment.
cd <dir>
python -m venv --system-site-packages env_slurm22_p310
source ./env_slurm25_p310/bin/activate
pip install -r requirements.txt
Once the Python virtual environment is set up, we install pyslurm within it.
wget https://github.com/PySlurm/pyslurm/archive/refs/tags/v22.5.1.tar.gz
tar -xzvf v22.5.1.tar.gz
Or,
git clone https://github.com/PySlurm/pyslurm.git
For the latest release information, please visit the following URL: https://github.com/PySlurm/pyslurm/releases.
Modify pyslurm/pyslurm.pyx to include additional job attributes, namely 'state_reason' and 'gres_detail'.
2139 #Yanbin: add state_reason_desc
2140 if self._record.state_desc:
2141 Job_dict['state_reason'] = self._record.state_desc.decode("UTF-8").replace(" ", "_")
2142 Job_dict['state_reason'] = slurm.stringOrNone(
2143 slurm.slurm_job_reason_string(
2144 <slurm.job_state_reason>self._record.state_reason
2145 ), ''
2146 )
2192 #Yanbin: add gres_detail
2193 gres_detail = []
2194 for x in range(min(self._record.num_nodes, self._record.gres_detail_cnt)):
2195 gres_detail.append(slurm.stringOrNone(self._record.gres_detail_str[x],''))
2196 Job_dict[u'gres_detail'] = gres_detail
cd <pyslurm_source_dir>
python setup.py --slurm-lib=$SLURM_ROOT/lib64 --slurm-inc=$SLURM_ROOT/include build
python setup.py --slurm-lib=$SLURM_ROOT/lib64 --slurm-inc=$SLURM_ROOT/include install
Clone the repository to your local machine
git clone https://github.com/flatironinstitute/SlurmUtil.git
The configuration can be done via both the shell script "StartSlurmMqtMonitoring_mon7" and the configuration file "config/config.json"
Run
StartSlurmMqtMonitoring_mon7
The script starts programs defined in "cmds" such as
declare -a cmds=("python ${ScriptDir}/sm_app.py" "python ${ScriptDir}/mqttMonStream.py" "python ${ScriptDir}/mqttMon2Influx.py" "ssh -i /mnt/home/yliu/.ssh/id_sdsc -N -R 8126:localhost:8126 popeye-login2.sdsc.edu" "python ${ScriptDir}/brightRelay.py")
The command "python ${ScriptDir}/sm_app.py" starts a web server at http://localhost:${port}, where "port" is configured in "config/config.json".
The command "python ${ScriptDir}/mqttMonStream.py" launches an MQTT client that receives monitoring data from an MQTT server, sends the data to web servers, and saves it in files. The configuration of it can be found under the "mqtt" key in the "config/config.json" configuration file."
The command "python ${ScriptDir}/mqttMon2Influx.py" launches another MQTT client that receives monitoring data from an MQTT server, formats the data and sends it to a InfluxDB server. The configuration of it can be found under the "influxdb" key in the "config/config.json" configuration file."
The command "python ${ScriptDir}/brightRelay.py" executes a proxy server that queries and caches data from a bright server.
The command "ssh -i /mnt/home/yliu/.ssh/id_sdsc -N -R 8126:localhost:8126 popeye-login2.sdsc.edu" establishes an SSH forwarding channel that enables the receipt of data from a remote cluster (popeye) where an instance of mqttMonStream.py is running.
The script and configuration file can be customized to initiate a subset of the processes mentioned earlier. For example, on popeye, "cmds" is defined as:
declare -a cmds=("python ${ScriptDir}/mqttMonStream.py -c config/config_popeye.json" "python ${ScriptDir}/mqttMon2Influx.py -c config/config_
popeye.json")
Log files are generated and serve as a valuable resource for troubleshooting in case any issues arise.
If you need to restart, make sure to clean up any remaining processes before re-running the same command.
. ./StartSlurmMqtMonitoring_mon7
Execute the daily.sh script on a daily basis to proactively save data for improved efficiency.
00 07 * * * . /mnt/home/yliu/projects/slurm/utils/daily.sh > /mnt/home/yliu/projects/slurm/utils/daily_$(date +%Y-%m-%d).log 2>&1
Web server is built using CherryPy. You can see an example of it at http://mon7:8126/. The set of user interfaces includes:
-
Summary (http://${webserver}:8126/): a tabular summary of the slurm worker nodes, jobs and users.
-
Host Utils (http://${webserver}:8126/utilHeatmap): a heatmap graph of worker nodes' and gpus' utilization.
-
Pending Jobs (http://${webserver}:8126/pending): a table of pending jobs and related cluster partition information.
-
Sunburst Graph (http://${webserver}:8126/sunburst): a sunburst graph of the slurm accounts, users, jobs and worker nodes.
-
File Usage (http://${webserver}:8126/usageGraph): a chart of the file count and byte usage of users.
-
Bulletin Board (http://${webserver}:8126/bulletinboard): a set of tables including running jobs and allocated nodes with low resource utilization, running jobs with unbalanced resourse usage, and errors reported from different components of the system, and etc.
-
Report (http://${webserver}:8126/report): generate reports of the cluster resource usage.
-
Forecast (http://${webserver}:8126/forecast): forecast the cluster usage in the future.
-
Settings (http://${webserver}:8126/settings): modify the settings to control the display of data.
-
Search (http://${webserver}:8126/search), Search the slurm entities' information.
Through the links embeded in these user inferfaces, you can also see the detailed informaiton and resource usage of a specific worker node, job, user, partition and so on.