Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Providing useful Ansible metrics with ara #187

Closed
dmsimard opened this issue Nov 14, 2020 · 6 comments
Closed

RFC: Providing useful Ansible metrics with ara #187

dmsimard opened this issue Nov 14, 2020 · 6 comments

Comments

@dmsimard
Copy link
Contributor

dmsimard commented Nov 14, 2020

o/

Thank you for your interest !

TL;DR: ARA Records Ansible playbooks and we can use this playbook data to provide various metrics. Would these be useful to you ? How would you use them ? Would you change anything ? Please check the examples below, try it out and let us know what you think.

Use cases

At a high level, some of the things I expect this could be useful for:

  • general purpose time-series monitoring (i.e, fancy grafana dashboards backed by prometheus, influxdb, etc.)
  • monitoring for changes or failures (maybe you want an alert that tells you whenever there is a change or failure)
  • troubleshooting performance issues or bottlenecks in playbooks (maybe your playbook is slow because of a particular task)
  • identify if failures are occurring in particular places (i.e, there's this one playbook, task or host that fails much more than the others)
  • calculating the rate of success/change/failure for each playbook, task or host

The proposal and implementation

It's not just about code, it's also about UX and providing the right data in the right format.

The patches landing these new metrics commands have not been merged yet, you can find them in gerrit: https://review.opendev.org/#/q/project:%255Erecordsansible/ara.*+topic:metrics

If you'd like to install from source including these patches to try it out:

git clone https://github.com/ansible-community/ara
cd ara
# Check out the latest patch from https://review.opendev.org/#/c/761034/
git fetch https://review.opendev.org/recordsansible/ara refs/changes/34/761034/5 && git checkout FETCH_HEAD

# install ansible and ara, including api server dependencies
python3 -m venv /tmp/ara
source /tmp/ara/bin/activate
pip install .[server] ansible

# enable ansible callback plugin and run a playbook
export ANSIBLE_CALLBACK_PLUGINS=$(python3 -m ara.setup.callback_plugins)
ansible-playbook -i 'localhost,' -c local tests/integration/benchmark.yaml \
  -e benchmark_host_count=10 \
  -e benchmark_task_count=10

# get metrics (see more info and examples below)
ara playbook metrics
ara task metrics
ara host metrics

# get metrics in json, yaml or csv
ara playbook metrics -f json
ara task metrics -f yaml
ara host metrics -f csv

These commands are safe to use against your own deployment or you can also try out with https://api.demo.recordsansible.org:

export ARA_API_CLIENT=http
export ARA_API_SERVER=https://api.demo.recordsansible.org

ara playbook metrics

Examples:

# Return metrics about more than the last 1000 playbooks
ara playbook metrics --limit 10000

# Return playbook metrics in json or csv
ara playbook metrics -f json
ara playbook metrics -f csv

# Return metrics about playbooks matching a (full or partial) path
ara playbook metrics --path site.yml

# Return metrics for playbooks matching a label
ara playbook metrics --label "check:False"

# Return additional metrics without truncating paths
ara playbook metrics --long

Click to expand to full resolution

ara-playbook-metrics

Help:

$ ara playbook metrics --help
usage: ara playbook metrics [-h] [-f {csv,json,table,value,yaml}] [-c COLUMN]
                            [--quote {all,minimal,none,nonnumeric}]
                            [--noindent] [--max-width <integer>] [--fit-width]
                            [--print-empty] [--sort-column SORT_COLUMN]
                            [--client <client>] [--server <url>]
                            [--timeout <seconds>] [--username <username>]
                            [--password <password>] [--insecure]
                            [--aggregate {name,path}] [--label <label>]
                            [--name <name>] [--path <path>]
                            [--status <status>] [--long] [--order <order>]
                            [--limit <limit>]

Provides metrics about playbooks

optional arguments:
  -h, --help            show this help message and exit
  --client <client>     API client to use, defaults to ARA_API_CLIENT or
                        'offline'
  --server <url>        API server endpoint if using http client, defaults to
                        ARA_API_SERVER or 'http://127.0.0.1:8000'
  --timeout <seconds>   Timeout for requests to API server, defaults to
                        ARA_API_TIMEOUT or 30
  --username <username>
                        API server username for authentication, defaults to
                        ARA_API_USERNAME or None
  --password <password>
                        API server password for authentication, defaults to
                        ARA_API_PASSWORD or None
  --insecure            Ignore SSL certificate validation, defaults to
                        ARA_API_INSECURE or False
  --aggregate {name,path}
                        Aggregate playbooks by path or by name. Defaults to
                        path.
  --label <label>       List playbooks matching the provided label
  --name <name>         List playbooks matching the provided name (full or
                        partial)
  --path <path>         List playbooks matching the provided path (full or
                        partial)
  --status <status>     List playbooks matching a specific status
                        ('completed', 'running', 'failed')
  --long                Don't truncate paths and include additional fields:
                        name, plays, files, records
  --order <order>       Orders playbooks by a field ('id', 'created',
                        'updated', 'started', 'ended', 'duration')
                        Defaults to '-started' descending so the most recent
                        playbook is at the top.
                        The order can be reversed by omitting the '-': ara
                        playbook list --order=started
  --limit <limit>       Returns the first <limit> determined by the ordering.
                        Defaults to ARA_CLI_LIMIT or 1000.

output formatters:
  output formatter options

  -f {csv,json,table,value,yaml}, --format {csv,json,table,value,yaml}
                        the output format, defaults to table
  -c COLUMN, --column COLUMN
                        specify the column(s) to include, can be repeated to
                        show multiple columns
  --sort-column SORT_COLUMN
                        specify the column(s) to sort the data (columns
                        specified first have a priority, non-existing columns
                        are ignored), can be repeated

CSV Formatter:
  --quote {all,minimal,none,nonnumeric}
                        when to include quotes, defaults to nonnumeric

json formatter:
  --noindent            whether to disable indenting the JSON

table formatter:
  --max-width <integer>
                        Maximum display width, <1 to disable. You can also use
                        the CLIFF_MAX_TERM_WIDTH environment variable, but the
                        parameter takes precedence.
  --fit-width           Fit the table to the display width. Implied if --max-
                        width greater than 0. Set the environment variable
                        CLIFF_FIT_WIDTH=1 to always enable
  --print-empty         Print empty table if there is no data to show.

ara task metrics

Examples:

# Return metrics about more than the last 1000 tasks
ara task metrics --limit 10000

# Return task metrics in json or csv
ara task metrics -f json
ara task metrics -f csv

# Don't truncate paths and include additional task status fields
ara task metrics --long

# Return metrics about tasks from a specific playbook
ara task metrics --playbook 9001

# Return metrics for tasks matching a (full or partial) path
ara task metrics --path ansible-role-foo

# Only return metrics about a specific action
ara task metrics --action package

# Return metrics for tasks matching a name
ara task metrics --name apache

# Return metrics about the longest tasks and then sort them by total duration
ara task metrics --order=-duration --sort-column duration_total

# Aggregate metrics by task name rather than action
ara task metrics --aggregate name

# Aggregate metrics by task file rather than action
ara task metrics --aggregate path

Click to expand to full resolution

ara-task-metrics

Help:

$ ara task metrics --help
usage: ara task metrics [-h] [-f {csv,json,table,value,yaml}] [-c COLUMN]
                        [--quote {all,minimal,none,nonnumeric}] [--noindent]
                        [--max-width <integer>] [--fit-width] [--print-empty]
                        [--sort-column SORT_COLUMN] [--client <client>]
                        [--server <url>] [--timeout <seconds>]
                        [--username <username>] [--password <password>]
                        [--insecure] [--aggregate {action,name,path}]
                        [--playbook <playbook_id>] [--status <status>]
                        [--name <name>] [--path <path>] [--action <action>]
                        [--long] [--order <order>] [--limit <limit>]

Provides metrics about actions in tasks

optional arguments:
  -h, --help            show this help message and exit
  --client <client>     API client to use, defaults to ARA_API_CLIENT or
                        'offline'
  --server <url>        API server endpoint if using http client, defaults to
                        ARA_API_SERVER or 'http://127.0.0.1:8000'
  --timeout <seconds>   Timeout for requests to API server, defaults to
                        ARA_API_TIMEOUT or 30
  --username <username>
                        API server username for authentication, defaults to
                        ARA_API_USERNAME or None
  --password <password>
                        API server password for authentication, defaults to
                        ARA_API_PASSWORD or None
  --insecure            Ignore SSL certificate validation, defaults to
                        ARA_API_INSECURE or False
  --aggregate {action,name,path}
                        Aggregate tasks by action, name or path. Defaults to
                        action.
  --playbook <playbook_id>
                        Filter for tasks for a specified playbook id
  --status <status>     Filter for tasks matching a specific status
                        ('completed', 'expired', 'running' or 'unknown')
  --name <name>         Filter for tasks matching the provided name (full or
                        partial)
  --path <path>         Filter for tasks matching the provided path (full or
                        partial)
  --action <action>     Filter for tasks matching a specific action/ansible
                        module (ex: 'debug', 'package', 'set_fact')
  --long                Don't truncate paths and include additional status
                        fields: completed, running, expired, unknown
  --order <order>       Orders tasks by a field ('id', 'created', 'updated',
                        'started', 'ended', 'duration')
                        Defaults to '-started' descending so the most recent
                        task is at the top.
                        The order can be reversed by omitting the '-': ara
                        task metrics --order=started
                        This influences the API request, not the ordering of
                        the metrics.
  --limit <limit>       Return metrics for the first <limit> determined by the
                        ordering. Defaults to ARA_CLI_LIMIT or 1000.

output formatters:
  output formatter options

  -f {csv,json,table,value,yaml}, --format {csv,json,table,value,yaml}
                        the output format, defaults to table
  -c COLUMN, --column COLUMN
                        specify the column(s) to include, can be repeated to
                        show multiple columns
  --sort-column SORT_COLUMN
                        specify the column(s) to sort the data (columns
                        specified first have a priority, non-existing columns
                        are ignored), can be repeated

CSV Formatter:
  --quote {all,minimal,none,nonnumeric}
                        when to include quotes, defaults to nonnumeric

json formatter:
  --noindent            whether to disable indenting the JSON

table formatter:
  --max-width <integer>
                        Maximum display width, <1 to disable. You can also use
                        the CLIFF_MAX_TERM_WIDTH environment variable, but the
                        parameter takes precedence.
  --fit-width           Fit the table to the display width. Implied if --max-
                        width greater than 0. Set the environment variable
                        CLIFF_FIT_WIDTH=1 to always enable
  --print-empty         Print empty table if there is no data to show.

ara host metrics

Examples:

# Return metrics about more than the last 1000 hosts
ara host metrics --limit 10000

# Return host metrics in json or csv
ara host metrics -f json
ara host metrics -f csv

# Return metrics for hosts matching a name
ara host metrics --name localhost

# Return metrics for hosts involved in a specific playbook
ara host metrics --playbook 9001

# Return metrics only for hosts with changed, failed or unreachable results
ara host metrics --with-changed
ara host metrics --with-failed
ara host metrics --with-unreachable

# Return metrics only for hosts without changed, failed or unreachable results
ara host metrics --without-changed
ara host metrics --without-failed
ara host metrics --without-unreachable

Click to expand to full resolution

ara-host-metrics

Help:

$ ara host metrics --help
usage: ara host metrics [-h] [-f {csv,json,table,value,yaml}] [-c COLUMN]
                        [--quote {all,minimal,none,nonnumeric}] [--noindent]
                        [--max-width <integer>] [--fit-width] [--print-empty]
                        [--sort-column SORT_COLUMN] [--client <client>]
                        [--server <url>] [--timeout <seconds>]
                        [--username <username>] [--password <password>]
                        [--insecure] [--name <name>]
                        [--playbook <playbook_id>]
                        [--with-changed | --without-changed]
                        [--with-failed | --without-failed]
                        [--with-unreachable | --without-unreachable]
                        [--order <order>] [--limit <limit>]

Provides metrics about hosts

optional arguments:
  -h, --help            show this help message and exit
  --client <client>     API client to use, defaults to ARA_API_CLIENT or
                        'offline'
  --server <url>        API server endpoint if using http client, defaults to
                        ARA_API_SERVER or 'http://127.0.0.1:8000'
  --timeout <seconds>   Timeout for requests to API server, defaults to
                        ARA_API_TIMEOUT or 30
  --username <username>
                        API server username for authentication, defaults to
                        ARA_API_USERNAME or None
  --password <password>
                        API server password for authentication, defaults to
                        ARA_API_PASSWORD or None
  --insecure            Ignore SSL certificate validation, defaults to
                        ARA_API_INSECURE or False
  --name <name>         Filter for hosts matching the provided name (full or
                        partial)
  --playbook <playbook_id>
                        Filter for hosts for a specified playbook id
  --with-changed        Filter for hosts with changed results
  --without-changed     Filter out hosts without changed results
  --with-failed         Filter for hosts with failed results
  --without-failed      Filter out hosts without failed results
  --with-unreachable    Filter for hosts with unreachable results
  --without-unreachable
                        Filter out hosts without unreachable results
  --order <order>       Orders hosts by a field ('id', 'created', 'updated',
                        'name')
                        Defaults to '-updated' descending so the most recent
                        host is at the top.
                        The order can be reversed by omitting the '-': ara
                        host list --order=updated
                        This influences the API request, not the ordering of
                        the metrics.
  --limit <limit>       Return metrics for the first <limit> determined by the
                        ordering. Defaults to ARA_CLI_LIMIT or 1000.

output formatters:
  output formatter options

  -f {csv,json,table,value,yaml}, --format {csv,json,table,value,yaml}
                        the output format, defaults to table
  -c COLUMN, --column COLUMN
                        specify the column(s) to include, can be repeated to
                        show multiple columns
  --sort-column SORT_COLUMN
                        specify the column(s) to sort the data (columns
                        specified first have a priority, non-existing columns
                        are ignored), can be repeated

CSV Formatter:
  --quote {all,minimal,none,nonnumeric}
                        when to include quotes, defaults to nonnumeric

json formatter:
  --noindent            whether to disable indenting the JSON

table formatter:
  --max-width <integer>
                        Maximum display width, <1 to disable. You can also use
                        the CLIFF_MAX_TERM_WIDTH environment variable, but the
                        parameter takes precedence.
  --fit-width           Fit the table to the display width. Implied if --max-
                        width greater than 0. Set the environment variable
                        CLIFF_FIT_WIDTH=1 to always enable
  --print-empty         Print empty table if there is no data to show.
@dmsimard
Copy link
Contributor Author

dmsimard commented Nov 14, 2020

Things on the to-do list based on feedback so far:

  • rename occurrences to count (saving 6 characters in width!)
  • actually test if the format of the data is appropriate to be consumed by something like a prometheus exporter
  • experiment with what the data visualization might look like in grafana
  • consider how we provide a duration average vs mean

@dmsimard
Copy link
Contributor Author

A quick iteration this morning:

  • Changed occurrences to count
  • Added missing docs for ara playbook metrics and ara host metrics
  • Added a results column for ara task metrics because the API tells us how many results there are for a task

@paulfantom
Copy link

actually test if the format of the data is appropriate to be consumed by something like a prometheus exporter

Why not add prometheus metrics exposition into ara itself and reduce the complexity of the stack?

@dmsimard
Copy link
Contributor Author

@paulfantom o/

I wouldn't say no if someone contributed a prometheus metrics exporter but the data has to come from somewhere.

The first objective is to make something that is simple and useful for humans, hence the CLI with pretty tables, querying, searching, ordering and filtering. The implementation is short and hopefully simple enough, for example ara task metrics: https://review.opendev.org/c/recordsansible/ara/+/760736/11/ara/cli/task.py

Then, the CLI framework that we happen to use (cliff) provides flags out of the box for exporting data to csv, json or yaml instead of pretty tables. The exercise is to see if and how we can leverage this to export data to a monitoring system -- prometheus is an example. If the data is not presented in a way that's useful to those systems, I'd like to know so we can see if there are opportunities to tweak it before it lands.

I'm no expert in data or metrics and so I appreciate the feedback, thanks :)

@dmsimard dmsimard added this to the 1.5.4 milestone Dec 18, 2020
@dmsimard
Copy link
Contributor Author

I spent a limited amount of time exploring what a prometheus implementation might look like with the python client and I would had liked to get two birds with one stone with the current CLI proposal but it's specialized enough to warrant it's own integration.

The proposed CLI implementation is useful enough for humans to ship as is and we can re-visit integration with monitoring and TSDB systems later.

@dmsimard
Copy link
Contributor Author

The first iteration of metrics commands landed in 1.5.4, we can close this issue and revisit it if need be.

@dmsimard dmsimard unpinned this issue Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants