Skip to content

besnardjb/proxy_v2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Metric Proxy

Metric Proxy is a Prometheus Aggregating Push Gateway designed for use in the ADMIRE project. It acts as an instrumentation proxy, collecting and aggregating metrics from various sources to provide a centralized view for monitoring and analysis

Install Rust

:::info Do this step only the first time you install :::

Simply run:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
#Answer "1" proceed to install to the question (or customize accordingly)

Compile Proxy V2

Dependencies

You need:

  • a python in env (python / python3)
  • mpicc in env (any MPI but same as your app)
  • Rust installed

Get Sources

git clone https://github.com/besnardjb/proxy_v2.git

Compile

Simply enter inside the source directory cd ./proxy_v2 and run ./install [PREFIX].

:::info You may need to tweak the OpenSSL directory using export OPENSSL_DIR="/usr/local/ssl if there is an issue compiling the OpenSSL crate. :::

TL;DR

git clone https://github.com/besnardjb/proxy_v2.git
cd proxy_v2
# Install the proxy (requires mpicc in the path)
./install.sh $HOME/metric-proxy
# Add the prefix to your path
export PATH=$HOME/metric-proxy/bin:$PATH
# Run the server (and keep it running)
proxy-v2
# Run the client in another shell1
proxy_run -j testls -- ls -R /

Then open http://localhost:1337

Proxy Design

The new proxy (also called V2) shares several design approaches with the first iteration of its kind with the following changes:

  • The proxy now includes an internal web-gui (by default at http://localhost:1337)
  • The proxy is capable of creating a reduction tree for metrics
  • Data are tracked per-job basis in real-time
  • The proxy now includes the capability of tracking node state (node-level metrics)

Running the Proxy

Running Manually

The proxy_run command is to be used to run instrumented programs. This program supposed that a metric proxy is running on each of the target nodes.

Help is the following:

Usage: proxy_run [OPTIONS] [-- <COMMAND>...]

Arguments:
  [COMMAND]...  A command to run (passed after --)

Options:
  -l, --listexporters            List detected exporters
  -e, --exporter <EXPORTER>      List of exporters to activate
  -j, --jobid <JOBID>            Optionnal JOBID (MPI/SLURM may generate one automatically)
  -u, --unixsocket <UNIXSOCKET>  Optionnal path to proxy UNIX Socket
  -h, --help                     Print help
  -V, --version                  Print version

A basic usage is:

# MPIRUN
mpirun -np 8 proxy_run -- ./myprogram -XXX
# srun
srun -n 8 proxy_run -- ./myprogram -XXX

The metric proxy now supports jobs, you can force a job using the -j flag or resort to the slurm / mpirun default. Note, a job may have no jobid in some circumstances and thus not appear as a job if you open the job interface in http://localhost:1337/jobs.html.

For example if you do:

proxy_run -j testjob -- yes

now have a testjob line:

You can then click on testjob to see the metrics flowing:

:::info Both job metrics and node level metrics are collated in these metrics to ease the blaming of node performance on job performance. Note that node level metrics are transposed indiferently of a possible partial allocation. :::

Note on Job-Related Data Endpoints

Unlike previous proxy which exposed only Prometheus endpoint, we have reworked our approach to expose more structured data including for each job:

Job management offers the following endpoints:

[
    {
        "jobid": "main",
        "command": "Sum of all Jobs",
        "size": 0,
        "nodelist": "",
        "partition": "",
        "cluster": "",
        "run_dir": "",
        "start_time": 0,
        "end_time": 0
    },
    {
        "jobid": "Node: deneb",
        "command": "Sum of all Jobs running on deneb",
        "size": 0,
        "nodelist": "deneb",
        "partition": "",
        "cluster": "",
        "run_dir": "",
        "start_time": 0,
        "end_time": 0
    }
]
[
{
    "desc": {
        "jobid": "main",
        "command": "Sum of all Jobs",
        "size": 0,
        "nodelist": "",
        "partition": "",
        "cluster": "",
        "run_dir": "",
        "start_time": 0,
        "end_time": 0
    },
    "counters": [
        {
            "name": "proxy_cpu_total",
            "doc": "Number of tracked CPUs by individual proxies",
            "ctype": {
                "Gauge": {
                    "min": 8,
                    "max": 8,
                    "hits": 1,
                    "total": 8
                }
            }
        },
        ...
    ]
},
{
    "desc": {
        "jobid": "Node: deneb",
        "command": "Sum of all Jobs running on deneb",
        "size": 0,
        "nodelist": "deneb",
        "partition": "",
        "cluster": "",
        "run_dir": "",
        "start_time": 0,
        "end_time": 0
    },
    "counters": [
        {
            "name": "proxy_network_receive_packets_total{interface=\"docker0\"}",
            "doc": "Total number of packets received on the given device",
            "ctype": {
                "Counter": {
                    "value": 0
                }
            }
        },
        ...
    ]
}
]

Setting Alarms

You may set alarms to track values see the example GUI at http://127.0.0.1:1337/alarms.html.

We propose the following endpoints:

{
  "main": [
    {
      "name": "My Alarm",
      "metric": "proxy_cpu_load_average_percent",
      "operator": {
        "More": 33
      },
      "current": 51.56660318374634,
      "active": true,
      "pretty": "My Alarm : proxy_cpu_load_average_percent (Average load on all the CPUs) = 51.56660318374634 (Min: 51.56660318374634, Max : 51.56660318374634, Hits: 1, Total : 51.56660318374634) GAUGE > 33"
    }
  ]
}
{
  "main": [
    {
      "name": "My Other Alarm",
      "metric": "proxy_cpu_load_average_percent",
      "operator": {
        "Less": 33
      },
      "current": 6.246700286865234,
      "active": true,
      "pretty": "My Other Alarm : proxy_cpu_load_average_percent (Average load on all the CPUs) = 6.246700286865234 (Min: 6.246700286865234, Max : 6.246700286865234, Hits: 1, Total : 6.246700286865234) GAUGE < 33"
    },
    {
      "name": "My Alarm",
      "metric": "proxy_cpu_load_average_percent",
      "operator": {
        "More": 33
      },
      "current": 6.246700286865234,
      "active": false,
      "pretty": "My Alarm : proxy_cpu_load_average_percent (Average load on all the CPUs) = 6.246700286865234 (Min: 6.246700286865234, Max : 6.246700286865234, Hits: 1, Total : 6.246700286865234) GAUGE > 33"
    }
  ]
}

:::info Takes a JSON object

{
    "name": "My Alarm",
    "target": "main",
    "metric": "proxy_cpu_load_average_percent",
    "operation": ">",
    "value": 33
}

You can make it with curl:

curl -s http://localhost:1337/alarms/add\
  -H "Content-Type: application/json" \
  -d '{ "name": "My Alarm", "target": "main", "metric": "proxy_cpu_load_average_percent", "operation": ">", "value": 33 }'

Operation can be "<" ">" and "=" to w.r.t. value.

:::

:::info

  • Using the GET protocol:

    • targetjob : name of the job
    • name : name of the alarm

    Example with Curl

    # Be careful with the & in bash/sh
    curl "http://localhost:1337/alarms/del?targetjob=main&name=My%20Alarm"
  • Using the POST protocol:

    Send the following JSON:

    {
        "target": "main",
        "name": "My Alarm",
    }

    Example with Curl:

    curl -s http://localhost:1337/alarms/del \
      -H "Content-Type: application/json" \
      -d '{"target": "main", "name": "My Alarm"}'

:::

Scanning Finished Jobs (Profiles)

As exposed in the example GUI, for manipulating profiles (final snapshot of jobs) the folowing JSON endpoints are provided:

{
    "./command_a ": [
            {
                "jobid": "test2",
                "command": "./command_a ",
                "size": 1,
                "nodelist": "",
                "partition": "",
                "cluster": "",
                "run_dir": "/XXX/proxy_v2/client",
                "start_time": 1699020416,
                "end_time": 1699020421
            }
        ],
    "./command_b ": [
            {
                "jobid": "test1",
                "command": "./command_b ",
                "size": 1,
                "nodelist": "",
                "partition": "",
                "cluster": "",
                "run_dir": "/XXX/proxy_v2/client",
                "start_time": 1699020317,
                "end_time": 1699020398
            }
        ]
}

http://127.0.0.1:1337/get?jobid=XXX allows to get a given profile, layout is identical to a job JSON snapshot as exposed in http://localhost:1337/job/?job=main.

Adding New Scrapes using /join

It is possible to request a proxy to scrape a given target. Currently the following targets are supported:

  • Another proxy meaning you may pass the url to another proxy to have it collected by the current proxt
  • A prometheus exporter, meaning the /metric endpoint will be harvested, currently only counters and gauges are handled. In the case of prometheus scrapes, they are aggregated only in "main" and inside the "node" specific job.

Only the GET requests are supported using the to argument, for example:

http://localhost:1337/join?to=localhost:9100 will add the node exporter running on localhost (classically on http://localhost:9100) and the proxy is able to scrape such metrics.

You can get the list of current scrapes at http://localhost:1337/join/list

It consists in such JSON:

[
  {
    "target_url": "http://localhost:9100/metrics",
    "ttype": "Prometheus",
    "period": 5,
    "last_scrape": 1699010032
  },
  {
    "target_url": "/system",
    "ttype": "System",
    "period": 5,
    "last_scrape": 1699010032
  }
]

## Acknowledgments

This project has received funding from the European Union’s Horizon 2020 JTI-EuroHPC research and innovation programme with grant Agreement number: 956748

About

A rework of the metric proxy with Job Support

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published