Overview of the ATOM monitoring framework (installation, components, and documentation)
Reducing the energy consumption is a leading design constraint of current and future HPC systems. Aside from investing into energy-efficient hardware, optimizing applications is key to substantially reduce the energy consumption of HPC cluster. Software developers, however, are usually in the dark when it gets to energy consumption of their applications; HPC clusters rarely provide capabilities to monitor energy consumption on a fine granular level. Predicting the energy consumption of specific applications is even more difficult when the allocated hardware resources vary at each execution. In order to lower the hurdle of energy-aware development, we present ATOM---a light-weight neAr-real Time mOnitoring fraMework.
We maintain a public database that holds monitoring data for experiments performed on our EXCESS cluster, or submitted via the RESTful API service as a post-analysis step. Feel free to explore our data here.
- Fundamental set of diverse plugins (from network monitoring over CPU performance to embedded system monitoring)
- Offers stable update rates for individual plugins of up to 20ms
- Agent-based architecture where a light-weight agents are installed on each target system to be monitored
- Each agent can be configured individually to report system-specific metric data
- Sophisticated RESTful service to allow data exchange: API
- Framework can be used stand-alone, in combination with the HPC resource manager TORQUE, or linked to the runtime system StarPU
- Network monitoring: infiniband
- Memory usage: meminfo
- Embedded system support (Movidius): movidius
- Embedded system support (ACME): acme
- Virtual memory statistics: vmstat
- GPU support: Nvidia GPUs
- Standard performance metrics: PAPI-C
- Power and energy monitoring: RAPL
- CPU temperature data: sensors
For a more detailed introduction to plugins, please read our introductory page.
Each component of the monitoring framework comes with its own setup and start scripts. Still, we also offer a convenient installation of a small testbed composed of three virtual machines (one server, two workers) by exploiting Vagrant and Ansible. For more information, please refer to our README. If you should have all dependencies installed, a simple
git clone https://github.com/excess-project/monitoring-setup-ansible.git
cd monitoring-setup-ansible
vagrant up
starts the setup process.
The API can be used to send and retrieve metric data to the monitoring database through the monitoring server. We have compiled a set of resources to get you up to speed:
- HTML-based documentation
- Sample API clients written in C and Python