Skip to content

Commit

Permalink
Merge pull request #514 from earthgecko/SNAB
Browse files Browse the repository at this point in the history
flux - telegraf
  • Loading branch information
earthgecko committed Feb 14, 2022
2 parents da2f5ed + 4b5cc93 commit dcfdddf
Show file tree
Hide file tree
Showing 14 changed files with 1,062 additions and 560 deletions.
23 changes: 19 additions & 4 deletions docs/alerts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -218,16 +218,31 @@ an example of the alert json POST data:
"data": {
"alert": {
"timestamp": "1579638755",
"metric": "stats.sites.graphite_access_log.httpd.rpm.total",
"value": "75.0",
"expiry": "30",
"metric": "stats.statsd.processing_time",
"value": "6.25",
"expiry": "3600",
"source": "ionosphere",
"token": "None",
"full_duration": "604800"
"full_duration": "604800",
"id": "None",
"anomaly_id": "359355",
"anomalyScore": "0.8",
"3sigma_upper": 1.1540154218897731,
"3sigma_lower": -1.0334135700379212,
"3sigma_real_lower": 0,
"yhat_upper": 1.1540154218897731,
"yhat_lower": -1.0334135700379212,
"yhat_real_lower": 0
}
}
}
The above `id` field would be populated with the `external_alerter_id` if the
alert was related to a metric that was part of a :mod:`settings.EXTERNAL_ALERTS`
alert group.


Failures
--------

Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ def setup(app):
# The short X.Y version.
version = u'2.1'
# The full version, including alpha/beta/rc tags.
release = u'2.1.0-4416'
release = u'2.1.0-4432'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
77 changes: 74 additions & 3 deletions docs/flux.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ submits them to Graphite, so they can be pickled to Skyline for analysis in near
real time via the normal Skyline pipeline.

Flux uses falcon the bare-metal web API framework for Python to serve the API
via gunicorn. The normal Apache/nginx reverse proxy Skyline vhost is used to
serve the /flux endpoint and proxy requests to flux.
via gunicorn. The normal nginx reverse proxy Skyline vhost is used to serve the
/flux endpoint and proxy requests to flux.

It is preferable to use the POST Flux endpoint to submit metrics so that the
Skyline flux API key can be encrypted via SSL in the POST data.
Expand Down Expand Up @@ -99,7 +99,7 @@ As below.

.. code-block:: python
ALLOWED_CHARS = ['+', '-', '%', '.', '_', '/', '=']
ALLOWED_CHARS = ['+', '-', '%', '.', '_', '/', '=', ':']
for char in string.ascii_lowercase:
ALLOWED_CHARS.append(char)
for char in string.ascii_uppercase:
Expand All @@ -125,6 +125,8 @@ POST requests which submit multiple metrics **should be limited** to a maximum
of **450 metrics per request**.

A successful POST will respond with no content and a 204 HTTP response code.
Unsuccessful requests will respond with a 4xx code and a json body with
additional information about why the request failed.

Here is an example of the data a single metric POST requires and an example POST
request.
Expand Down Expand Up @@ -230,6 +232,75 @@ parameters as defined below:
# For example:
curl -v -u username:password "https://skyline.example.org/flux/metric_data?metric=vista.nodes.skyline-1.cpu.user&timestamp=1478021700&value=1.0&key=YOURown32charSkylineAPIkeySecret"
telegraf
--------

Flux can accept metrics from directly from telegraf in the telegraf json format.
To configure telegraf to send metrics to Flux you can enable the `outputs.http`
plugin in telegraf as shown below.

Firstly a few things must be pointed out:

- Flux applies the same logic that is used by the telegraf Graphite template
pattern e.g. `template = "host.tags.measurement.field"` - https://github.com/influxdata/telegraf/tree/master/plugins/serializers/graphite
- Flux replaces `.` for `_` and `/` for `-` in any tags, measurement or fields
in which these characters are found.
- The additional `outputs.http.headers` **must** be specified.
- `content_encoding` must be set to gzip
- There are a number of status codes that telegraf should not resubmit data on
these are to ensure telegraf does not attempt to send the same bad data or
too much data over and over again.
- If there is a long network partition between telegraf agents and Flux,
sometimes some data may be dropped, but this can often be preferable to the
thundering herd swamping I/O

To use the telegraf Graphite template pattern the following options in the
telegraf `[agent]` configuration section are required for telegraf to add the
host to the tags:

.. code-block::
## Maximum number of unwritten metrics per output.
# Ideally change this from 10000 to 1000 to minimise thundering herd issues
# metric_buffer_limit = 10000
metric_buffer_limit = 1000
## Override default hostname, if empty use os.Hostname()
hostname = ""
## If set to true, do no set the "host" tag in the telegraf agent.
omit_hostname = false
Also add the `[[outputs.http]]` to the telegraf config as below replacing
`<YOUR_SKYLINE_HOST>` and `<settings.FLUX_SELF_API_KEY>`:

.. code-block::
[[outputs.http]]
## URL is the Skyline flux address to send metrics to
url = "https://<YOUR_SKYLINE_HOST>/flux/metric_data_post"
## Set a long timeout because if a network partition occurs between telegraf
## and the flux, telegraf will keep and batch the metrics to send through AND
## 10s of 1000s of metrics can then be sent when the network partition is
## resolved, these can take a while to process as often many other telegraf
## instances may have been partitioned at the same time, so a thundering herd
## is sent to flux.
timeout = "120s"
method = "POST"
data_format = "json"
use_batch_format = true
json_timestamp_units = "1s"
content_encoding = "gzip"
## A list of statuscodes (<200 or >300) upon which requests should not be retried
non_retryable_statuscodes = [400, 403, 409, 413, 499, 500, 502, 503]
[outputs.http.headers]
Content-Type = "application/json"
key = "<settings.FLUX_SELF_API_KEY>"
telegraf = "true"
## Optionally you can pass a prefix e.g.
# prefix = "telegraf"
populate_metric endpoint
------------------------

Expand Down
56 changes: 23 additions & 33 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ and considerations relating to the following components:
- Graphite
- Redis
- MySQL
- Apache/nginx
- nginx
- memcached

However, these types of assumptions even if stated are not useful or helpful to
Expand Down Expand Up @@ -64,9 +64,9 @@ What the components do
fingerprints for matching and learning things that are not anomalous. mariadb
can run on the same host as Skyline or it can be remote. Running the DB
remotely will make the Skyline a little bit slower.
- Apache/nginx - Skyline serves the webapp via gunicorn and Apache/nginx
handles endpoint routing, SSL termination and basic http auth. Ideally the
reverse proxy should be run on the same host as Skyline.
- nginx - Skyline serves the webapp via gunicorn and nginx handles endpoint
routing, SSL termination and basic http auth. Ideally the reverse proxy
should be run on the same host as Skyline.
- memcached - caches Ionosphere MySQL data, memcached should ideally be run on
the same host as Skyline.

Expand Down Expand Up @@ -291,47 +291,36 @@ Skyline and dependencies install
cp /opt/skyline/github/skyline/etc/skyline.conf /etc/skyline/skyline.conf
vi /etc/skyline/skyline.conf # Set USE_PYTHON as appropriate to your setup
Apache reverse proxy
~~~~~~~~~~~~~~~~~~~~

- OPTIONAL but **recommended**, serving the Webapp via gunicorn with Apache or
nginx reverse proxy. Below highlights Apache but similar steps are required
with nginx.

- Setup Apache (httpd) and see the example configuration file in your cloned
directory ``/opt/skyline/github/skyline/etc/skyline.httpd.conf.d.example``
for nginx see ``/opt/skyline/github/skyline/etc/skyline.nginx.conf.d.example``
modify all the ``<YOUR_`` variables as appropriate for you environment - see
`Apache and gunicorn <webapp.html#apache-and-gunicorn>`__
- Create a SSL certificate and update the SSL configurations in the Skyline
Apache config (or your reverse proxy)

::

SSLCertificateFile "<YOUR_PATH_TO_YOUR_CERTIFICATE_FILE>"
SSLCertificateKeyFile "<YOUR_PATH_TO_YOUR_KEY_FILE>"
SSLCertificateChainFile "<YOUR_PATH_TO_YOUR_CHAIN_FILE_IF_YOU_HAVE_ONE_OTHERWISE_COMMENT_THIS_LINE_OUT>"

- Update your reverse proxy config with the X-Forwarded-Proto header.

::
nginx reverse proxy
~~~~~~~~~~~~~~~~~~~

RequestHeader set X-Forwarded-Proto "https"
Serving the Webapp via gunicorn with nginx as a reverse proxy. Below highlights
the nginx resources and set up the is required.

- Install and set up nginx. You will need also need the `htpasswd` program as
well, depending on your distribution that may be provided by `httpd-tools`
for rpm based distributions or `apache2-utils` on deb based distributions.
- Create the htpasswd password file, modify the path/name here and in the
nginx config if you wish to use a different path or name
- Add a user and password for HTTP authentication, the user does not have to
be admin it can be anything, e.g.

.. code-block:: bash
htpasswd -c /etc/httpd/conf.d/.skyline_htpasswd admin
htpasswd -c /etc/nginx/conf.d/.skyline_htpasswd admin
.. note:: Ensure that the user and password for Apache match the user and
password that you provide in `settings.py` for
:mod:`settings.WEBAPP_AUTH_USER` and :mod:`settings.WEBAPP_AUTH_USER_PASSWORD`

- Deploy your Skyline Apache or nginx configuration file and restart httpd or
nginx

- Create a SSL certificate to use in the SSL configuration in the nginx
configuration file.
- See the example configuration file in your cloned directory
``/opt/skyline/github/skyline/etc/skyline.nginx.conf.d.example`` modify all
the ``<YOUR_`` variables as appropriate for you environment - see
`nginx and gunicorn <webapp.html#nginx-and-gunicorn>`__
- Deploy your Skyline nginx configuration file ready to restart nginx later
when the Skyline services are started.

Skyline database
~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -534,6 +523,7 @@ Starting and testing the Skyline installation
"bin/python${PYTHON_MAJOR_VERSION}" /opt/skyline/github/skyline/utils/seed_data.py
deactivate
- Restart nginx with the new config.
- Check the Skyline Webapp frontend on the Skyline machine's IP address and the
appropriate port depending whether you are serving it proxied or direct, e.g
``https://YOUR_SKYLINE_IP``. The ``horizon.test.pickle`` metric anomaly should
Expand Down
3 changes: 3 additions & 0 deletions docs/ionosphere.rst
Original file line number Diff line number Diff line change
Expand Up @@ -747,6 +747,9 @@ programmatic decisions based on the data it is provided with, things a human
operator tells it are not anomalous. Ionosphere is an attempt to give Skyline
an Apollo Program refit. Enabling the pilots to take control, have inputs.

A benefit here is that the method does not suffer from underspecification that
is prevalent in machine learning models.

For Humans
----------

Expand Down
51 changes: 22 additions & 29 deletions docs/webapp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,17 +38,15 @@ Deploying the Webapp
Originally the Webapp was deployed behind the simple Flask development server,
however for numerous reasons, this is less than ideal. Although the Webapp can
still be run with Flask only, the recommended way to run the Webapp is via
gunicorn, which can be HTTP proxied by Apache or nginx, etc. The gunicorn
Webapp can be exposed just like the Flask Webapp, but it is recommended to run
it HTTP proxied.
gunicorn, which is HTTP proxied by nginx. The gunicorn Webapp can be exposed
just like the Flask Webapp, but it is recommended to run it HTTP proxied.

Using a production grade HTTP application
-----------------------------------------

It must be noted and stated that you **should** consider running the Skyline
Webapp behind a production grade HTTP application, regardless of the
implemented basic security measures. Something like Apache or nginx serving the
Webapp via gunicorn.
implemented basic security measures with nginx serving the Webapp via gunicorn.

This may seem like overkill, however there are a number of valid reasons for
this.
Expand All @@ -66,33 +64,28 @@ production environments. See http://flask.pocoo.org/docs/0.11/deploying/
In addition to that, considering that the Webapp has MySQL in the mix, this
element adds further reason to properly secure the environment.

There is potential for XSS and SQL injection via the Webapp, ensure TRUSTED
access only.
There might be possible potential for XSS and SQL injection via the Webapp,
therefore ensure TRUSTED access only.

Apache and gunicorn
-------------------
nginx and gunicorn
------------------

Although there are a number of options to run a production grade wsgi frontend,
the example here will document serving gunicorn via Apache reverse proxy with
authentication. Although Apache mod_wsgi may seem like the natural fit here, in
terms of virtualenv and Python ``make_altinstall``, gunicorn has much less
external dependencies. gunicorn can be easily installed and run in any
virtualenv, therefore it keeps it within the Skyline Python environment, rather
than offloading very complex Python and mod_wsgi compiles to the user,
orchestration and package management.

Apache nd ngiinx are common webservers and gunicorn can be handled within the
Skyline package and requirements.txt

See ``etc/skyline.httpd.conf.d.example`` or ``etc/skyline.nginx.conf.d.example``
for examples Apache aand ngnx conf.d configuration files to serve the Webapp via
gunicorn and reverse proxy on port 443 with basic HTTP authentication and
restricted IP access. Note that your username and password must match in both
the Apache htpasswd and the :mod:`settings.WEBAPP_AUTH_USER`/
:mod:`settings.WEBAPP_AUTH_USER_PASSWORD` contexts as Apache/nginx will
authenticate the user and forward on the authentication details to gunicorn for
the Webapp to also authenticate the user. Authentication is enabled by default
in ``settings.py``.
the example here will document serving gunicorn via nginx reverse proxy with
authentication.

nginx is a common webserver and gunicorn is handled within the Skyline package
and requirements.txt

See ``etc/skyline.nginx.conf.d.example`` for an example of the nginx conf.d
configuration file to serve the Webapp via gunicorn and reverse proxy on port
443 with basic HTTP authentication and restricted IP access. The config is set
to read the htpasswd file from /etc/nginx/conf.d/.skyline.htpasswd. Note that
your username and password must match in both the Apache htpasswd and the
:mod:`settings.WEBAPP_AUTH_USER`/:mod:`settings.WEBAPP_AUTH_USER_PASSWORD`
contexts as nginx will authenticate the user and forward on the authentication
details to gunicorn for the Webapp to also authenticate the user.
Authentication is enabled by default in ``settings.py``.

Securing the Webapp
===================
Expand Down
32 changes: 24 additions & 8 deletions etc/skyline.nginx.conf.d.example
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
#
# THIS IS AN EXAMPLE
# User variables are all prefixed with <YOUR_
# This example assumes you have this Skyline host binding to port 443
# and serving SSL on the machine
# This example assumes you only have this Skyline host binding to port 443
# and serving SSL on the machine and no other nginx sites being served.
#

upstream skyline-webapp {
Expand Down Expand Up @@ -32,7 +32,7 @@
listen 443 ssl http2;
listen [::]:443 ssl http2;

# nginx requires the certificate plus_intermediates be in the same single
# nginx requires the certificate plus intermediates be in the same single
# file with all certificates included.
ssl_certificate <YOUR_PATH_TO_YOUR_SIGNED_CERTIFICATE_PLUS_INTERMEDIATES_FILE>;
ssl_certificate_key <YOUR_PATH_TO_YOUR_KEY_FILE>;
Expand Down Expand Up @@ -80,6 +80,14 @@
gzip_proxied no-cache no-store private expired auth;
gzip_min_length 1000;

## A long timeout is set specifically for flux if it needs to ingest 1000s
## of metrics from any network partitioned telegraf collectors AND if the
## user makes any long queries, very large plots or many image renders via
## the webapp.
proxy_read_timeout 120s;
proxy_connect_timeout 120s;
proxy_send_timeout 120s;

# Allow authorized IPs to access /api without auth
location /api {
satisfy any;
Expand All @@ -95,14 +103,17 @@
proxy_pass http://skyline-webapp;
}

# Only required if you are posting metrics to Skyline via HTTP
## Only required if you are posting metrics to Skyline flux via HTTP
# location /flux/ {
# # You can uncomment this in order to turn authentication off
# ## You can uncomment this in order to turn authentication off
# # auth_basic "off";
# # allow all;
# # Limit the max body size as falcon does not, the json for 450 metrics
# # with LONG names and LONG values was 92K
# client_max_body_size 120K;
# ## Limit the max body size as falcon does not, the ungzipped json for 450
# ## metrics with LONG names and LONG values was 92K. However a partitioned
# ## telegraf collector can send 10s of 1000s of metrics in a single payload
# ## when the network partition is resolved so very large payloads can be
# ## expected.
# client_max_body_size 1024K;
#
# proxy_set_header Host $host;
# proxy_set_header Forwarded $proxy_add_forwarded;
Expand All @@ -111,6 +122,11 @@
# proxy_pass http://skyline-flux/;
# }


## To create the htpasswd password file
## htpasswd -c /etc/nginx/conf.d/.skyline_htpasswd <YOUR_WEBAPP_AUTH_USER>
## Enter <YOUR_WEBAPP_AUTH_USER_PASSWORD>

auth_basic "Authorization required";
auth_basic_user_file /etc/nginx/conf.d/.skyline_htpasswd;

Expand Down

0 comments on commit dcfdddf

Please sign in to comment.