Skip to content

Latest commit

 

History

History
902 lines (690 loc) · 30 KB

ec_telemetry_demo.rst

File metadata and controls

902 lines (690 loc) · 30 KB

EdgeConnect Telemetry API Demo

Important

THIS DEMO IS NOT MEANT TO BE A PRODUCTION MONITORING SOLUTION

This example code serves to demonstrate the possibilities in retrieving telemetry data from Orchestrator and EdgeConnect and visuzalize it. The demo is simple to run, however, modifying and/or incorporating components of this work into your own solution requires a greater understanding of Python, Docker, InfluxDB, Grafana, and Redis.

All of these components are not necessarily required for developing your own solution as there may be pieces of data you don't need to collect, some additional others that you want to add, substitude alternative tools for task queuing, database, and visualization or alerting needs.

Note

The code referenced in this document and all published examples with pyedgeconnect are available from the GitHub repository within the examples folder. Each example script contains logic to authenticate to the Orchestrator as documented in the authentication example.

Clone the repository and download the examples with:

$ git clone https://github.com/aruba/pyedgeconnect.git

Overview and System Prerequisites

This example code is meant to launch an external telemetry collection and visualization demonstration for EdgeConnect SDWAN appliances.

A user can run this demo on any host system with Docker containers and point it towards an Orchestrator instance for EdgeConnect SDWAN and instantly begin visualizing many different pieces of data from discovered appliances. Knowledge of the underlying systems, Python, or other open-source software solutions leveraged in this demo is required.

For those looking for examples on how to collect this data, all of the Python scripts used in the ec-telemetry and ec-telemetry-worker containers can be reviewed and all the API endpoints used to collect the data in the demo are documented in this overview.

If you want to get right to running it, jump to Required Configuration To Run Demo

To skip to example screenshots and dashboard overviews, jump to Dashboards

Note

References for open-source tools used in this demo:

Architecture Overview

Demo Application Architecture

EdgeConnect Telemetry Demo Architecture

Demo Directory Structure

├── app
│   ├── ec-telemetry
│   │   ├── ec_discover.py
│   │   ├── ec_telemetry.py
│   │   ├── limit_appliances.json
│   │   └── requirements.txt
│   ├── grafana
│   │   ├── dashboards
│   │   │   ├── boost_utilization.json
│   │   │   ├── demo_monitor.json
│   │   │   ├── ec_appliance.json
│   │   │   └── wan_utilization.json
│   │   └── provisioning
│   │       ├── dashboards
│   │       │   └── dashboard_provider.yml
│   │       └── datasources
│   │           └── datasources.yml
│   ├── grafana_render
│   │   └── config.json
│   └── prometheus
│       └── prometheus.yml
├── demo_cleanup.sh
├── docker-compose.monitor.yml
├── docker-compose.yml
├── Dockerfile
└── example.env

System & Environment Requirements

Orchestrator & EdgeConnect

Due to many of the statistics collection data retrieval and formats, the demo requires Orchestrator and EdgeConnect to be on 9.1 releases. The data model for collecting bandwidth licensing levels is written for the current tiered bandwidth model, appropriate numeric values will not be displayed for legacy Base & Plus licenses.

Supported Versions
Orchestrator EdgeConnect ECOS
9.1.3+ 9.1.1.3+

Note

The API endpoint for collecting CPU metrics from an appliance requires read/write access to the appliance. As such the specified user for the containers to login to the EdgeConnect appliances needs to have and admin role rather than monitor.

Demo Host & Docker

The system running the demo must have Docker Engine and Docker Compose installed. The demo was tested using Docker Engine 20.10.17 and Docker Compose 1.29.2.

This host system must have IP reachability to Orchestrator and EdgeConnect appliances. The ec-telemetry-worker containers will attempt to connect to the EdgeConnect management addresses as known by Orchestrator.

See below links for installing Docker Engine and Docker Compose:

There are many guides online for installing Docker & Docker Compose on different systems both from Docker directly as well as different providers, blogs, etc.

CPU & Memory

The cpu and memory utilized by the demo will vary based on scale. The InfluxDB container will use the most as it's memory-intensive and will grow further based on volume of data stored and queries being performed against it.

The other scale factor is the number of ec-telemetry-worker containers chosen to be deployed. Each will consume 25-100MB of memory. The worker containers have resource limits of 0.50 CPU and 300M Memory per replica in the Docker Compose file included and can be modified if necessary. Note that overly constraining cpu or memory to the worker nodes will impact the telemetry collection.

In the included example.env file, the InfluxDB bucket data retention for the demo is set to 3 days (3d). This can easily be altered the user's preference to store data for longer or shorter periods following syntax of h for hours, d for days, w for weeks, etc.

Data Granularity & Accuracy

While the minute stats data are stored on an appliance for approximately 24 hours, other metrics included in this demo are gathered at the point in time when the appliance is collected. While the discovery container runs every minute to queue appliances to the ec-telemetry-worker containers, there is no guarantee that the worker containers will complete the task queue before the next round by the discovery container.

Factors contributing to completing the appliance queue within a minute include, but are not limited to, latency to reach appliances, the number of deployed ec-telemetry-worker containers to work on the queue, and overall host system resources where this demo is deployed determining the maximum number of ec-telemetry-worker containers.

Note

Data manipulated in the dashboard queries included in this demo (bytes to bits, bits to megabits, summing or multiplying values, etc.) is best effort, but Orchestrator should be consulted as an authoritative source.

Resource Utilization Examples

Demo VM Resource Configuration
OS CPU Memory
On-Prem VM Ubuntu 20.04 2 vCPU 4GB

Equivalent cloud instances would be an AWS t2.medium or Azure Standard_B2s.

All below system utilization stats are based on monitoring of on-premise Unbuntu VM and are not guaranteed performance metrics. Please note that the larger the dataset and/or time window being queried will impact performance as these are per-minute statistics.

For example, a single statistic for one appliance in a 3-day window would be returning 4,320 datapoints.

Demo VM Observed Utilization
16 workers / 16 appliances* 4 workers / 6 appliances
CPU 20-25%, ~500-600GHz 10-13%,
System Memory 3GB 2GB
Container Memory ~1-2.5GB ~1-1.6GB
Disk I/O Max ~7-8MB/s, average ~0.3MB/s tbd
Docker Images ~2.4GB ~2.4GB
Docker Volumes ~1G with 3d of data** ~800M with 3d of data**
Total Disk Utilization (incl OS) ~30GB ~30GB

* ~1,000 total tunnels being captured in this demo environment

** 3d is default value in example.env file for bucket retention on Influx and the value for Prometheus data retention in docker-compose.monitor.yml

Note

The above estimates include ~7% of cpu and ~400M of memory being used by cAdvisor and Prometheus monitoring the stack itself. These containers are optional for monitoring the container performance and only deployed with the additional docker-compose.monitor.yml file.

Utilization can greatly vary by use, higher interaction with the Grafana dashboards, increasing refresh rate, leads to higher queries and increase utilization for Grafana and InfluxDB containers.

Containers & Versioning

Demo Containers
Container Base Image Host-mapped Ports Intra-Container Ports
ec-telemetry Ubuntu 20.04 ** -- --
redis 4.0.6-alpine -- 6379
ec-telemetry-worker * Ubuntu 20.04 ** -- --
influxdb 2.4.0 8086 8086
grafana 9.0.4 3000 3000
grafana-renderer 3.6.1 -- 8081
Optional Monitoring Containers
Container Base Image Host-mapped Ports Intra-Container Ports
redis_q_dashboard Ubuntu 20.04 ** 9181 9181
cadvisor cadvisor 8080 8080
prometheus prometheus 9090 9090

Note

All containers in docker-compose.yml and docker-compose.monitor.yml are configured for restart unless-stopped

* Replica instances of ec-telemetry-worker container determined by environment variable $WORKER_COUNT. All other containers are single instance.

** All Ubuntu images are built from the Dockerfile in the root directory of the demo folder. In addition to the base image, iputils-ping, dnsutils, Python3, pip, and the following additional Python packages installed (as documented in app/ec-telemetry/requirements.txt):

pyedgeconnect
influxdb_client == 1.30.0
click >=7,<8
Redis == 4.3.4
rq == 1.11.0
rq-dashboard== 0.6.1

Collection Process Logic

Discover Appliances from Orchestrator & Queue Collection

  1. ec_discover.py runs on the ec-telemetry container, and first checks if InfluxDB and Orchestrator are reachable (if not retries every 3 seconds)
  2. If reachable, proceeds to connect to Orchestrator and retrieves list of appliances with pyedgeconnect.Orchestrator.get_appliances
  3. Performs additional API calls to Orchestrator for interface label mapping, overlay configuration, and appliance licensing status/values

    • pyedgeconnect.Orchestrator.get_all_interface_labels
    • pyedgeconnect.Orchestrator.get_all_overlays_config
    • pyedgeconnect.Orchestrator.get_portal_licensed_appliances
  4. Checks limit_appliances.json file to limit telemetry collection to specified hostnames
  5. The output from the additional API calls to Orchestrator is added to each appliance item in the appliance list
  6. Each appliance in the appliance object in the appliance list is enqueued as a job to the Redis Queue
  7. The process will repeat again in 60 seconds

Collect Data from EdgeConnect Appliances & Write to Database

Note

All tasks below are performed on an instance of the ec-telemetry-worker container for each appliance item they take from the Redis Queue

  1. Write appliance metadata and reachability status to InfluxDB
  2. Check if appliance management IP is reachable via ping (a response code of 0 is successful, 256 indicates unreachable)
  3. If ping succeeds, login to the appliance with pyedgeconnect.EdgeConnect.login
  4. If login succeeds, begin telemetry collection with the following API calls:

    • pyedgeconnect.EdgeConnect.get_appliance_stats_minute_range
    • pyedgeconnect.EdgeConnect.get_appliance_stats_minute_file
    • pyedgeconnect.EdgeConnect.get_appliance_deployment
    • pyedgeconnect.EdgeConnect.get_appliance_memory
    • pyedgeconnect.EdgeConnect.get_appliance_cpu
    • pyedgeconnect.EdgeConnect.get_appliance_subnets
    • pyedgeconnect.EdgeConnect.get_appliance_alarms
    • pyedgeconnect.EdgeConnect.get_appliance_all_tunnel_ids
    • pyedgeconnect.EdgeConnect.get_appliance_all_bonded_tunnel_ids
    • pyedgeconnect.EdgeConnect.get_appliance_all_3rdparty_tunnel_ids
    • pyedgeconnect.EdgeConnect.get_appliance_disk_usage
    • pyedgeconnect.EdgeConnect.get_appliance_system_info
    • pyedgeconnect.EdgeConnect.get_appliance_realtime_stats
  5. On completion or tasks or if exceptions occur in communicating with appliance, the worker will log out of appliance via pyedgeconnect.EdgeConnect.logout
  6. All the data gathered from the appliance is written to the respective measurements in the InfluxDB database

Visualize Data in Grafana

  • The Grafana container comes pre-populated with multiple dashboards to visualize the data collected. The JSON for the dashboards is in the app/grafana/dashboards directory
  • The dashboards perform Flux queries to InfluxDB datasource to visualize the appliance data
  • For more information on working with the InfluxDB Flux query language visit https://docs.influxdata.com/flux/

Required Configuration To Run Demo

Note

If data is not populating in the Dashboards, check the recent log file ec-telemetry.log in app/logging on the demo host. Here you can find log messages related to the data collection worker containers.

Environment Variables

Warning

If fully populated, the .env file could contain sensitive token information. To refrain from commiting token values to the local file, system environment values will take precedence over those referenced in the .env file. This host system can be setup with environment variables for ORCH_API_KEY, EC_PW, DB_TOKEN, and DB_PW, or any of the other environment variables referenced below for the containers without including them in the included .env file.

Docker-compose will automatically export variables in file named .env. Existing environment variables of the host system will always take precedence. Reference values included in example.env file as shown below.

Docker-compose can use an alternate file with the --env-file option e.g., $ docker-compose --env-file example.env up -d

# It is recommended to not store sensitive auth/token information
# in a static file, the below values are only present to document
# the envrionment variable keys the containers will reference.
# Environment variables on the host itself will take precedence
# over any included here

# Orch and EdgeConnect Connection Details
ORCH_URL=<orchestrator_ip_or_fqdn>
ORCH_API_KEY=<orch_api_key>
EC_USER=<ec_username>
EC_PW=<edgeconnect_password>
# Token for Grafana/InfluxDB Communication
# DB_USER and DB_PW used to login to both Grafana/InfluxDB
DB_TOKEN=<database_token>
DB_USER=admin
# DB_PW must be > 8 characters to meet minimum requirements
DB_PW=<database_password>

# Number of replica containers for ec-telemetry-worker
WORKER_COUNT=4

# OPTIONAL SETTINGS, KEEP DEFAULTS IF UNSURE

# Project name and local logging settings
COMPOSE_PROJECT_NAME=EC-Telemetry
LOG_LEVEL=WARNING
LOG_MAX_BYTES=1000000
LOG_MAX_BACKUPS=5
# Timeout to check reachability to EdgeConnect
EC_TIMEOUT=2
# Database Connection Details
DB_ORG=ECTELEM
DB_URL=http://influxdb:8086
DB_BUCKET=ECTELEM
DB_RETENTION=3d
# OPTIONAL CUSTOM DNS SERVER FOR CONTAINERS
# To use, un-comment the `dns:` option for containers in docker-compose.yml
# DNS_SERVER=8.8.8.8

Note

The API endpoint for collecting CPU metrics from an appliance requires read/write access to the appliance. As such the specified user for the containers to login to the EdgeConnect appliances needs to have and admin role rather than monitor

Limit Appliances JSON File

By default, the telemetry script will gather data from the first four reachable appliances returned by pyedgeconnect.Orchestrator.get_appliances and write the hostnames to the file limit_appliances.json (located in app/ec-telemetry/limit_appliances.json).

To specify which appliances to collect telemetry for in advance, add the hostnames to the list appliance_subset in the limit_appliances.json file.

Default contents of limit_appliances.json file:

{
    "appliance_subset": [],
    "example": [
        "Hostname-01",
        "SITE-HQ-02",
        "MyAppliance-A"
    ],
    "comment": "List of appliance hostnames to limit telemetry collection to"
}

Example contents below where the telemetry would only be collected from two appliances REAL-HOSTNAME-01 and REAL-HOSTNAME-02 as shown under the appliance_subset key:

{
    "appliance_subset": [
      "REAL-HOSTNAME-01",
      "REAL-HOSTNAME-02"
    ],
    "example": [
        "Hostname-01",
        "SITE-HQ-02",
        "MyAppliance-A"
    ],
    "comment": "List of appliance hostnames to limit telemetry collection to"
}

The ec-telemetry directory is copied to the containers on build and not a mounted volume. If you wish to edit the limit_appliances.json file for collecting telemetry of different appliances, either modify the copy within the `ec-telemetry-discover container or modify the local file and rebuild the containers with the --build option.

Logging & Monitoring

The python scripts for the ec-telemetry and ec-telemetry-worker containers will write to a centralized local log file located at app/logging/ec-telemetry.log on the host running the demo.

Check this file when running the demo to see the containers are properly reaching Orchestrator, InfluxDB, and the EdgeConnect appliances.

The message levels and max file size will be determined by the environment variables outlined in the Environment Variables section. These default to logging evel of ERROR, max file size of 1MB, and rotating maximum of 5 log files as documented in the .env file.

LOG_LEVEL=ERROR
LOG_MAX_BYTES=1000000
LOG_MAX_BACKUPS=5

This will capture errors in the telemetry collection from Orchestrator and target appliances.

example log message below for two appliances, EC-01 and EC-02:

2022-09-12 22:03:33,033 - ERROR - cid: de11e3d0706e - EC-01 @ 10.41.2.125 - 2.8259: FAILED TO LOGIN
2022-09-12 22:03:34,271 - ERROR - cid: 1aa54222ea6b - EC-02 @ 10.41.0.70 - 4.0705: FAILED TO PING APPLIANCE -- ping response code: 256

The structure of the log messages are as follows:

<date & time> - <log severity> - cid: <container id> -<EdgeConnect Hostname> @ <EdgeConnect IP> -<process timer>: <log message>

Monitoring the container performance/utilization can be performed with the additional docker-compose.monitor.yml file as noted below which will launch cAdvisor, Prometheus, and the data will be visualized in an included dashboard in the Grafana instance.

In docker-compose.monitor.yml the included option for Prometheus data retention period is set to 3 days as seen in --storage.tsdb.retention=3d.

Run the demo

Once you've filled out the required details in your .env file or configured respective environment variables on the host system to point the demo towards an Orchestrator and EdgeConnect appliances, the following commands will start the demo infrastructure and begin telemetry collection.

$ docker-compose up -d

Note

If the user is not part of the docker group and you have to run the docker commands with sudo, host environment variables will not be passed to the sub-process. To pass them you must add the -E flag to sudo, e.g., $ sudo -E docker-compose up -d

Both methods can still be combined of host environment variables and a environment variables file as follows: $ sudo -E docker-compose --env-file example.env up -d Where overlapping host environment variables take precedence with those in the example.env file.

If you make changes to the scripts or files in in the app/ec-telemetry directory (e.g. limit_appliances.json) you'll want to have the containers rebuilt with the --build option.

$ docker-compose up -d --build

There is an optional compose file including cAdvisor and Prometheus to monitor the resource utilization of the containers themselves. A dashboard is also included in the grafana container to visualize this data. To run the additional containers you can use the following syntax to specify multiple compose files

$ docker-compose -f docker-compose.yml -f docker-compose.monitor.yml up -d

To stop all containers

$ docker-compose stop

To stop all telemetry collection workers but keep other containers running to continue to explore collected data

$ docker container stop $(docker container ls -q --filter name="ec-telemetry*")

To stop all containers and clean up volumes use the included demo_cleanup.sh script

$ bash demo_cleanup.sh

Telemetry Documentation and Dashboards

Stats Collected / Processed in this Demo

  1. Orchestrator to Appliance state/reachability over time
  2. Minute Stats Telemetry
    • tunnel stats
    • flow stats
    • boost stats
    • interface stats
    • drops stats (filtered for certain drop types)
    • interface overlay stats
  3. Current memory use
  4. Current CPU use (returns 5sec intervals for whole minute)
  5. Current route table size
  6. Number of active tunnels by type (underlay, overlay, third party)
  7. Number of total down tunnels (max of 1000 logged per type)
  8. Retrieve deployment - get configured WAN BW on interfaces
  9. Static info -- HW Model, SW Ver, License, System BW
  10. Alarm counts by type (Critical, Major, etc.)
  11. Disk utilization
  12. System info to retrieve appliance uptime
  13. Realtime stats to calculate realtime packets per second at the time of collection

Important

Alarms are collected directly from appliance, so if alarms have been modified or disabled in Orchestrator that will not be reflected in the data on this dashboard. E.g. Tunnel software version mis-match is disabled in Orchestrator, but the appliance locally has 4 minor alarms for tunnel software version mis-match, the dashboard will show 4 minor alarms while Orchestrator will show 0.

Note

There are additional additional Minute stats files downloaded from the pyedgeconnect.EdgeConnect.get_appliance_stats_minute_file API endpoint, and more continue to be added. While this demo does not use that data, more detail about the returned data for this function is available on DevHub https://developer.arubanetworks.com/edgeconnect/docs/examples

Dashboards

When navigating the included dashboards, most panels have an info description in the upper left corner with additional description details of the panel. Click the (i) for description of the data being presented.

EdgeConnect Telemetry

Review many detailed metrics and statistics for a single appliance.

There are many sections of data on this dashboard, including overall System stats, WAN and LAN utilization analysis, data over overlays, tunnel counts and distribution, overlay tunnel pre & post FEC & POC, as well as a dashboard variable to look at details for a specific tunnel from the selected appliance.

Dashboard variables
Variable Chained to Description
Database Bucket -- Lists discovered buckets in InfluxDB
Appliance Database Bucket Choose appliance to populate dashboard data
Peer Tunnel to Analyze Appliance Visualize tunnel details in Tunnel Analysis row
WAN Labels * Appliance Repeats WAN label graphs WAN Interfaces row
LAN Interfaces * Appliance Repeats LAN interface graphs LAN Interfaces row

* These dashboard variables are hidden as they are meant to be left to the All value

EdgeConnect Telemetry Dashboard

System WAN BW For All Appliances

System BW utilization graph for all appliances compared to deployment values & license levels. This does not currently support legacy Base & Plus licensing. Appliances with Unlimited licenses will not have a utilization % calculated against the license, but all appliances will calculate utilization % against the deployment system values.

This can help identify outliers of appliances that are running at high utilization against their configured bandwidths or licenses.

The two lower histograms help identify where a majority of appliances are transmiting as a max value within the speicifed time window as well as their 95th percentile distribution in the same period.

Dashboard variables
Variable Chained to Description
Database Bucket -- Lists discovered buckets in InfluxDB
Appliance Database Bucket Leave All to visualize data for all appliances

WAN Utilization Dashboard

Boost Utilization for All Appliances

Cumulative Boost utilization for all appliances. One heatmap to show a normalized view of % utilization of Boost used vs. amount configured, as well as a heatmap of absolute values seeing which appliances are using the most.

Notably the pie chart shows the cumulative utilization of boost for all appliances in the time period, while the time series graph is able to quickly pinpoint an outling appliances that experienced insufficient Boost in the time period.

Dashboard variables
Variable Chained to Description
Database Bucket -- Lists discovered buckets in InfluxDB
Appliance Database Bucket Leave All to visualize data for all appliances

Boost Utilization Dashboard

Docker Monitoring

Monitor resource utilization of docker containers for demo including total and per-container view of memory, cpu, and network activity

Docker Utilization Dashboard

How To Render/Export Dashboard Screenshots

With dashboards sometimes including large amounts of panels that would not fit on a screen, capturing screenshots of example data is difficult. The grafana-renderer container is included to help render an image of a full dashboard or panel.

Render a dashboard or panel to a PNG image from cURL:

1. Create a API key or Service Account with token on Grafana (viewer permissions suffice)

2. Use the following syntax in a cURL command from a system that can reach the grafana interface

$ curl "http://${DOCKER_HOST}:3000/render/d/${DASHBOARD_UID}/${DASHBOARD_NAME}?orgId=1&var-db_bucket=ECTELEM&var-ec_hostname=${APPLIANCE_HOSTNAME}&var-tunnel_analysis_peer=${PEER_TUNNEL_NAME}&var-wan_labels=All&var-lan_interfaces=All&from=now-1h&to=now&width=1640&height=5400" -H "Authorization: Bearer ${GRAFANA_API_TOKEN}" --compressed > ${FILENAME}

Example with values filled in

$ curl "http://my.docker.lab:3000/render/d/ec-appliance/edgeconnect-telemetry?orgId=1&var-db_bucket=ECTELEM&var-ec_hostname=SantaClara1-Jaurigui&var-tunnel_analysis_peer=to_WEST2-AWS_RECREATIONAL&var-wan_labels=All&var-lan_interfaces=All&from=now-1h&to=now&width=1640&height=5400" -H 'Authorization: Bearer abc123456789' --compressed > EC_DASHBOARD.png
  1. The rendered dashboard will be downloaded

Render a dashboard or panel to a PNG image from the GUI:

Replacing appropriate values of the Grafana URL, dashboard ID and name, along with any required dashboard variables, use the following syntax in the browser to navigate to

http://${DOCKER_HOST}:3000/render/d/${DASHBOARD_UID}/${DASHBOARD_NAME}?orgId=1&var-db_bucket=ECTELEM&var-ec_hostname=${APPLIANCE_HOSTNAME}&var-tunnel_analysis_peer=${PEER_TUNNEL_NAME}&var-wan_labels=All&var-lan_interfaces=All&from=now-1h&to=now&width=1640&height=5400