Moneo

Description

Moneo is a distributed GPU system monitor for AI workflows.

Moneo orchestrates metric collection (DCGMI + Prometheus DB) and visualization (Grafana) across multi-GPU/node systems. This provides useful insights into workflow and system level characterization.

Metrics

There five categories of metrics that Moneo monitors:

GPU Counters
- Compute/Memory Utilization
- SM and Memory Clock frequency
- Temperature
- Power
- ECC Counts (Nvidia)
- GPU Throttling (Nvidia)
- XID code (Nvidia)
GPU Profiling Counters
- SM Activity
- Memory Dram Activity
- NVLink Activity
- PCIE Rate
InfiniBand Network Counters
- IB TX/RX rate
- IB Port errors
- IB Link FLap
CPU Counters
- Utilization
- Clock frequency
Memory
- Utilization

Grafana Dashboards

Menu: List of available dashboards.

Note: When viewing GPU dashboards make sure to note whether you are using Nvidia or AMD GPU nodes and select the proper dashboard.
Cluster View: contains min, max, average across devices for GPU/IB metrics per VM.
GPU Device Counters: Detailed view of node level GPU counters.
GPU Profiling Counters: Node level profiling metrics require additional overhead which may affect workload performance. Tensor, FP16, FP32, and FP64 activity are disabled by default but can be switched on by CLI command.
InfiniBand Network Counters: Detailed view of node level IB network metrics.
Node View: Detailed view of node level CPU, Memory, and Network metrics.

Minimum Requirements

python >=3.7 installed
docker installed
ansible installed (python module)
OS Support:
- Ubuntu 18.04, 20.04
- AlmaLinux 8.6
Nvidia Architecture supported (only for Nvidai GPU monitoring):
- Volta
- Ampere

Setup

Run following commands on dev box (could be one of the master/worker nodes or a local node):

# get the code
git clone https://github.com/Azure/Moneo.git
cd Moneo

# install dependencies
python3 -m pip install ansible

Configuration

Prepare a config file host.ini for all master/worker nodes, here's an example:

[master]
192.168.0.100

[worker]
192.168.0.100
192.168.0.101
192.168.0.110

[all:vars]
ansible_user=username
ansible_ssh_private_key_file=/path/to/key
ansible_ssh_common_args='-o StrictHostKeyChecking=no'

If you have configured passwordless SSH already, [all:vars] section can be skipped.

Please refer to Ansible Inventory docs for more complex cases.

Usage

Moneo CLI

To make deploying and shutting down easier we provide the Moneo CLI.

Which can be accessed as such:

```
python3 moneo.py --help
```

CLI Usage

python3 moneo.py [-d/--deploy] [-c HOST_INI] {manager,workers,full}
python3 moneo.py [-s/--shutdown] [-c HOST_INI] {manager,workers,full}
python3 moneo.py [-j JOB_ID ] [-c HOST_INI]
i.e. python3 moneo.py -d -c ./host.ini full

Flag	Options/arguments	Description
-d, --deploy	None	Deploy option selection. Requires config file to be specified (i.e. -c host.ini) or file to be in Moneo directory.
-s, --shutdown	None	Shutdown option selection. Requires config file to be specified (i.e. -c host.ini) or file to be in Moneo directory.
-c, --host_ini	path + file name	Provide filepath and name of ansible config file. The default is host.ini in the Moneo directory.
-j , --job_id	Job ID	Job ID for filtering metrics by job group. Host.ini file required. Cannot be specified during deployment and shutdown.
-p, --profiler_metrics	None	Enable profile metrics (Tensor Core,FP16,FP32,FP64 activity). Addition of profile metrics encurs additional overhead on computer nodes.
-f, --fork_processes	number of processes	The number of processes used to deploy/shutdown/update Moneo. Increasing process count can reduce the latency when deploying to large number of nodes. Default is 16.
-r, --container	None	Deploy Moneo-worker inside the container. Supported Platform: {nvidia}
	{manager,workers,full}	Type of deployment/shutdown. Choices: {manager,workers,full}. Default: full.

Access the Portal

The Prometheus and Grafana services will be started on master nodes after deployment. You can access the Grafana portal to visualize collected metrics.

There are several cases based on the networking configuration:

If the master node has a public IP address or domain, you can access the portal through http://master-ip-or-domain:3000 directly.

For example, if you are deploying for Azure VM or VMSS, you can associate a public IP address to the master node, then create a fully qualified domain name (FQDN) for it.
If the master node does not have a public IP address to access, e.g., the VMSS is created behind a load balancer, you will need to create a proxy to access.

For example, you can create a socks5 proxy at socks5://localhost:1080 through ssh -D 1080 -p PORT USER@IP, then install Proxy SwitchyOmega in Edge/Chrome browser and configure the proxy to protocol socks5, server localhost, port 1080 for all schemes, you will be able to navigate portal using master node's hostname at http://master-hostname:3000.
Default Grafana access:
- username: azure
- password: azure
This can be changed in the "src/master/grafana/grafana.env" file.

User Docs

Quick Start
To get started with job level filtering see: Job Level Filtering
Slurm epilog/prolog integration: Slurm example
To add your own metrics see: Adding custom Metrics
To deploy moneo-worker inside container: Moneo-exporter
To integrate Moneo with Azure Insights dashboard see: Azure Application Insights for Metric Visualization

Known Issues

NVIDIA exporter may conflict with DCGMI

There're two modes for DCGM: embedded mode and standalone mode.

If DCGM is started as embedded mode (e.g., nv-hostengine -n, using no daemon option -n), the exporter will use the DCGM agent while DCGMI may return error.

It's recommended to start DCGM in standalone mode in a daemon, so that multiple clients like exporter and DCGMI can interact with DCGM at the same time, according to NVIDIA.

Generally, NVIDIA prefers this mode of operation, as it provides the most flexibility and lowest maintenance cost to users.
Moneo will attempt to install a tested version of DCGM if it is not present on the worker nodes. However, this step is skipped if DCGM is already installed. In instances DCGM installed may be too old.

This may cause the Nvidia exporter to fail. In this case it is recommended that DCGM be upgrade to atleast version 2.4.4. To view which exporters are running on a worker just run ps -eaf | grep python3

Troubleshooting

Verifying Grafana and Prometheus containers are running:
- Check browser http://master-ip-or-domain:3000 (Grafana), http://master-ip-or-domain:9090 (Prometheus)
- On Manager node terminal run sudo docker container ls
Verifying exporters on worker node:
- ps -eaf | grep python3

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.azure-pipelines		.azure-pipelines
.github/workflows		.github/workflows
dashboard_templates		dashboard_templates
dockerfile		dockerfile
docs		docs
examples/slurm		examples/slurm
src		src
tests		tests
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.bib		CITATION.bib
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
moneo.py		moneo.py
node_view.json		node_view.json

License

eax/Moneo

Folders and files

Latest commit

History

Repository files navigation

Moneo

Description

Minimum Requirements

Setup

Configuration

Usage

Moneo CLI

CLI Usage

Access the Portal

User Docs

Known Issues

Troubleshooting

Contributing

Trademarks

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages