Overview of the EXCESS monitoring framework (installation, components, and documentation)
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
README.md

README.md

monitoring-framework

Overview of the ATOM monitoring framework (installation, components, and documentation)

Introduction

Reducing the energy consumption is a leading design constraint of current and future HPC systems. Aside from investing into energy-efficient hardware, optimizing applications is key to substantially reduce the energy consumption of HPC cluster. Software developers, however, are usually in the dark when it gets to energy consumption of their applications; HPC clusters rarely provide capabilities to monitor energy consumption on a fine granular level. Predicting the energy consumption of specific applications is even more difficult when the allocated hardware resources vary at each execution. In order to lower the hurdle of energy-aware development, we present ATOM---a light-weight neAr-real Time mOnitoring fraMework.

Experiment Database

We maintain a public database that holds monitoring data for experiments performed on our EXCESS cluster, or submitted via the RESTful API service as a post-analysis step. Feel free to explore our data here.

Features

  • Fundamental set of diverse plugins (from network monitoring over CPU performance to embedded system monitoring)
  • Offers stable update rates for individual plugins of up to 20ms
  • Agent-based architecture where a light-weight agents are installed on each target system to be monitored
  • Each agent can be configured individually to report system-specific metric data
  • Sophisticated RESTful service to allow data exchange: API
  • Framework can be used stand-alone, in combination with the HPC resource manager TORQUE, or linked to the runtime system StarPU

Plugins

For a more detailed introduction to plugins, please read our introductory page.

Get Started

Each component of the monitoring framework comes with its own setup and start scripts. Still, we also offer a convenient installation of a small testbed composed of three virtual machines (one server, two workers) by exploiting Vagrant and Ansible. For more information, please refer to our README. If you should have all dependencies installed, a simple

git clone https://github.com/excess-project/monitoring-setup-ansible.git
cd monitoring-setup-ansible
vagrant up

starts the setup process.

RESTful API

The API can be used to send and retrieve metric data to the monitoring database through the monitoring server. We have compiled a set of resources to get you up to speed: