monitoring-framework

Overview of the ATOM monitoring framework (installation, components, and documentation)

Introduction

Reducing the energy consumption is a leading design constraint of current and future HPC systems. Aside from investing into energy-efficient hardware, optimizing applications is key to substantially reduce the energy consumption of HPC cluster. Software developers, however, are usually in the dark when it gets to energy consumption of their applications; HPC clusters rarely provide capabilities to monitor energy consumption on a fine granular level. Predicting the energy consumption of specific applications is even more difficult when the allocated hardware resources vary at each execution. In order to lower the hurdle of energy-aware development, we present ATOM---a light-weight neAr-real Time mOnitoring fraMework.

Experiment Database

We maintain a public database that holds monitoring data for experiments performed on our EXCESS cluster, or submitted via the RESTful API service as a post-analysis step. Feel free to explore our data here.

Features

Fundamental set of diverse plugins (from network monitoring over CPU performance to embedded system monitoring)
Offers stable update rates for individual plugins of up to 20ms
Agent-based architecture where a light-weight agents are installed on each target system to be monitored
Each agent can be configured individually to report system-specific metric data
Sophisticated RESTful service to allow data exchange: API
Framework can be used stand-alone, in combination with the HPC resource manager TORQUE, or linked to the runtime system StarPU

Plugins

Network monitoring: infiniband
Memory usage: meminfo
Embedded system support (Movidius): movidius
Embedded system support (ACME): acme
Virtual memory statistics: vmstat
GPU support: Nvidia GPUs
Standard performance metrics: PAPI-C
Power and energy monitoring: RAPL
CPU temperature data: sensors

For a more detailed introduction to plugins, please read our introductory page.

Get Started

Each component of the monitoring framework comes with its own setup and start scripts. Still, we also offer a convenient installation of a small testbed composed of three virtual machines (one server, two workers) by exploiting Vagrant and Ansible. For more information, please refer to our README. If you should have all dependencies installed, a simple

git clone https://github.com/excess-project/monitoring-setup-ansible.git
cd monitoring-setup-ansible
vagrant up

starts the setup process.

RESTful API

The API can be used to send and retrieve metric data to the monitoring database through the monitoring server. We have compiled a set of resources to get you up to speed:

HTML-based documentation
Sample API clients written in C and Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

monitoring-framework

Introduction

Experiment Database

Features

Plugins

Get Started

RESTful API

Files

README.md

Latest commit

History

README.md

File metadata and controls

monitoring-framework

Introduction

Experiment Database

Features

Plugins

Get Started

RESTful API