Skip to content

Commit

Permalink
Merge pull request #548 from earthgecko/SNAB
Browse files Browse the repository at this point in the history
Update docs
  • Loading branch information
earthgecko committed Apr 30, 2022
2 parents f4d18d4 + 744f8b4 commit fa42ece
Show file tree
Hide file tree
Showing 11 changed files with 188 additions and 59 deletions.
30 changes: 16 additions & 14 deletions docs/alerts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ alert settings. This is due to the two classes of alerts being different,
with Analyzer, Mirage and Ionosphere alerts being related to anomalies and
Boundary alerts being related to breaches of the static and dynamic thresholds
defined for Boundary metrics. Further to this there are alert related settings
for each alert output route, namely smtp, slack, pagerduty and sms.
for each alert output route, namely smtp, slack, pagerduty, http and sms.

Required smtp alerter for Analyzer and Mirage metrics
=====================================================
Expand All @@ -24,15 +24,17 @@ analysis of metrics that have no alerts configured for them makes no sense as no
one wants to know. Therefore only metrics in namespaces that are defined with a
stmp alerter in :mod:`settings.ALERTS` get analysed by Mirage and Ionosphere.
It is via the smtp alert tuple that metrics get configured to be Mirage metrics
by declaring the SECOND_ORDER_RESOLUTION_HOURS in the tuple.
by declaring the ``SECOND_ORDER_RESOLUTION_HOURS`` in the tuple.

However if you do not want to be SMTP alerted, you can set the
:mod:`settings.SMTP_OPTS` to `'no_email'` as shown in an example below, but you
must still declare the namespace with a SMTP alert tuple in
However if you do not want emails from the SMTP alerts, you can set the
:mod:`settings.SMTP_OPTS` to ``'no_email'`` as shown in an example below, but you
still **must declare** the namespace with a SMTP alert tuple in
:mod:`settings.ALERTS`

The following example, we want to alert via Slack, your :mod:`settings.ALERTS`
and :mod:`settings.SMTP_OPTS` would need to look like this.
The following example, we want to alert via Slack only and not receive email
alerts, your :mod:`settings.ALERTS` and :mod:`settings.SMTP_OPTS` would need to
look like this. It is important to note that all smtp alerts **must** be
defined before other alerts, e.g. slack.

.. code-block:: python
Expand Down Expand Up @@ -97,7 +99,7 @@ Alert settings
==============

For each 3rd party alert service e.g. Slack, PagerDuty, http_alerters, there is
a setting to enable the specific alerter which must be set to `True` to enable
a setting to enable the specific alerter which must be set to ``True`` to enable
the alerter:

- :mod:`settings.SYSLOG_ENABLED`
Expand All @@ -109,8 +111,8 @@ the alerter:

Analyzer, Mirage and Ionosphere related alert settings (anomaly detection) are:

- :mod:`settings.ENABLE_ALERTS` - must be set to `True` to enable alerting
- :mod:`settings.ENABLE_FULL_DURATION_ALERTS` - should be set to `False` if
- :mod:`settings.ENABLE_ALERTS` - must be set to ``True`` to enable alerting
- :mod:`settings.ENABLE_FULL_DURATION_ALERTS` - should be set to ``False`` if
enable Mirage is enabled. If this is set to ``True`` Analyzer will alert
on all checks sent to Mirage, even if Mirage does not find them anomalous,
this is mainly for testing.
Expand All @@ -123,9 +125,9 @@ Analyzer, Mirage and Ionosphere related alert settings (anomaly detection) are:
- :mod:`settings.SYSLOG_OPTS` - can be used to change syslog settings
- :mod:`settings.HTTP_ALERTERS_OPTS` - must be defined if you want to push
alerts to a http endpoint
- :mod:`settings.MIRAGE_ENABLE_ALERTS` - must be set to `True` to enable alerts
- :mod:`settings.MIRAGE_ENABLE_ALERTS` - must be set to ``True`` to enable alerts
from Mirage
- :mod:`settings.AWS_SNS_SMS_ALERTS_ENABLED` - must be set to `True` if you want
- :mod:`settings.AWS_SNS_SMS_ALERTS_ENABLED` - must be set to ``True`` if you want
to send alerts via SMS. boto3 also needs to be set up and AWS/IAM resource
that boto3 uses needs permissions to publish to AWS SNS. See boto3
documentation - https://github.com/boto/boto3)
Expand All @@ -134,7 +136,7 @@ Analyzer, Mirage and Ionosphere related alert settings (anomaly detection) are:

Boundary related alert settings (static and dynamic thresholds) are:

- :mod:`settings.BOUNDARY_ENABLE_ALERTS` - must be set to `True` to enable
- :mod:`settings.BOUNDARY_ENABLE_ALERTS` - must be set to ``True`` to enable
alerting
- :mod:`settings.BOUNDARY_METRICS` - must be defined to enable checks and alerts
for Boundary
Expand All @@ -149,7 +151,7 @@ Boundary related alert settings (static and dynamic thresholds) are:
Slack
- :mod:`settings.BOUNDARY_HTTP_ALERTERS_OPTS` - must be defined if you want to
push alerts to a http endpoint
- :mod:`settings.AWS_SNS_SMS_ALERTS_ENABLED` - must be set to `True` if you want
- :mod:`settings.AWS_SNS_SMS_ALERTS_ENABLED` - must be set to ``True`` if you want
to send alerts via SMS. boto3 also needs to be set up and AWS/IAM resource
that boto3 uses needs permissions to publish to AWS SNS. See boto3
documentation - https://github.com/boto/boto3)
Expand Down
6 changes: 4 additions & 2 deletions docs/algorithms/custom-algorithms.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ given the methods and modes that can be configured.
:mod:`settings.CUSTOM_ALGORITHMS`
---------------------------------

Custom algorithms are only available in analyzer, analyzer_batch and mirage.

Custom algorithms are defined in the :mod:`settings.CUSTOM_ALGORITHMS`
dictionary. The format and key values of the dictionary are shown in the
following **example**:
Expand All @@ -72,7 +74,7 @@ following **example**:
'run_before_3sigma': True,
'run_only_if_consensus': False,
'trigger_history_override': 0,
'use_with': ['analyzer', 'analyzer_batch', 'mirage', 'crucible'],
'use_with': ['analyzer', 'analyzer_batch', 'mirage'],
'debug_logging': False,
},
'last_same_hours': {
Expand Down Expand Up @@ -107,7 +109,7 @@ following **example**:
'run_before_3sigma': True,
'run_only_if_consensus': False,
'trigger_history_override': 0,
'use_with': ['analyzer', 'crucible'],
'use_with': ['analyzer'],
'debug_logging': True,
},
'skyline_matrixprofile': {
Expand Down
39 changes: 35 additions & 4 deletions docs/boundary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,11 @@ Boundary currently has 3 defined algorithms:

Boundary is run as a separate process just like Analyzer, horizon and
mirage. It was not envisaged to analyze all your metrics, but rather
your key metrics in additional dimension/s. If it was run across all of your
your key metrics with additional analysis. If it was run across all of your
metrics it would probably be:

- VERY noisy
- CPU intensive
- VERY CPU intensive

If deployed only key metrics it has a very low footprint (9 seconds on
150 metrics with 2 processes assigned) and a high return. If deployed as
Expand All @@ -54,8 +54,9 @@ Configuration and running Boundary
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``settings.py`` has an independent setting blocks and has detailed information
on each setting in its docstring, the main difference from Analyzer being in
terms of number of variables that have to be declared in the alert tuples, e.g:
on each setting and the parameter in its docstring, the main difference from
Analyzer being in terms of number of variables that have to be declared in the
alert tuples, e.g:

.. code-block:: python
Expand All @@ -76,6 +77,36 @@ Boundary:
/opt/skyline/github/skyline/bin/boundary.d start
An example
~~~~~~~~~~

Here is an example of what you can use Boundary for. If you look at the graph
below, you can see that the minimum value is around 1000. Let us say that this
metric is a fairly reliable and important global metric, like the number of page
impression per minute in your shop.

.. image:: images/boundary/boundary_example.png

We can configure Boundary to monitor this metric. Although you can use Boundary
to monitor any metric, it works best if only monitoring your important and
reliable global metrics with Boundary.

:: .. code-block:: python

('example_org.shop.total.page_impressions', 'detect_drop_off_cliff', 1800, 800, 300, 0, 1, 'smtp|slack|pagerduty'),
('example_org.shop.total.page_impressions', 'less_than', 1800, 0, 0, 1000, 7, 'smtp|slack|pagerduty'),


The above :mod:`settings.BOUNDARY_METRICS` enables 2 algorithms to check this
metric against every minute, :func:`.boundary_algorithms.detect_drop_off_cliff`
and :func:`.boundary_algorithms.less_than`. Although the ``less_than`` check
should normally be sufficient on it's own, the ``detect_drop_off_cliff`` check
will ensure that nothing is missed. For instance that metric could drop to
between 2 and 15 for 6 minutes and then go back up to 1600 and then drop again
for 5 minutes and go back up again and enter a "flapping" state. There are
instances where ``less_than`` may not fire, but ``detect_drop_off_cliff``
would.

detect\_drop\_off\_cliff algorithm
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
1 change: 1 addition & 0 deletions docs/development/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Development
tsfresh
pytz


DRY
###

Expand Down
85 changes: 78 additions & 7 deletions docs/flux.rst
Original file line number Diff line number Diff line change
Expand Up @@ -243,13 +243,24 @@ Firstly a few things must be pointed out:

- Flux applies the same logic that is used by the telegraf Graphite template
pattern e.g. `template = "host.tags.measurement.field"` - https://github.com/influxdata/telegraf/tree/master/plugins/serializers/graphite
- Flux replaces ``.` for ``_`` and ``/`` for ``-`` in any tags, measurement or fields
- Flux replaces ``.`` for ``_`` and ``/`` for ``-`` in any tags, measurement or fields
in which these characters are found.
- Some telegraf outputs send string/boolean values, by default Flux drops
metrics with string/boolean values, e.g.
``{'fields': {..., 'aof_last_bgrewrite_status': 'ok', ...}, 'name': 'redis', ...}``
One way to deal with string/boolean values metrics if you want to record them
numerically as metrics is to convert them in telegraf itself using the
appropriate telegraf processors and filters (see telegraf documentation). If
Flux receives string/boolean value metrics it will report what metrics were
dropped in a HTTP status code 207 response in the json body and not the normal
204 response. However string/boolean value metrics will not cause errors and
all other metrics with numerical values in the POST will be accepted.
- The additional `outputs.http.headers` **must** be specified.
- ``content_encoding`` must be set to gzip
- There are a number of status codes that telegraf should not resubmit data on
these are to ensure telegraf does not attempt to send the same bad data or
too much data over and over again.
too much data over and over again (coming soon in the upcoming telegraf March
2022 release).
- If there is a long network partition between telegraf agents and Flux,
sometimes some data may be dropped, but this can often be preferable to the
thundering herd swamping I/O
Expand All @@ -260,10 +271,6 @@ host to the tags:

.. code-block:: ini
## Maximum number of unwritten metrics per output.
# Ideally change this from 10000 to 1000 to minimise thundering herd issues
# metric_buffer_limit = 10000
metric_buffer_limit = 1000
## Override default hostname, if empty use os.Hostname()
hostname = ""
## If set to true, do no set the "host" tag in the telegraf agent.
Expand Down Expand Up @@ -291,7 +298,8 @@ Also add the ``[[outputs.http]]`` to the telegraf config as below replacing
json_timestamp_units = "1s"
content_encoding = "gzip"
## A list of statuscodes (<200 or >300) upon which requests should not be retried
non_retryable_statuscodes = [400, 403, 409, 413, 499, 500, 502, 503]
## Coming soon to a version of telegraf March 2022.
# non_retryable_statuscodes = [400, 403, 409, 413, 499, 500, 502, 503]
[outputs.http.headers]
Content-Type = "application/json"
key = "<settings.FLUX_SELF_API_KEY>"
Expand All @@ -300,6 +308,69 @@ Also add the ``[[outputs.http]]`` to the telegraf config as below replacing
# prefix = "telegraf"
Ideally telegraf would be configured for optimum Flux performance, however
seeing as the ``[[outputs.http]]`` may simply be added to existing telegraf
instances, you may already have the following ``[agent]`` configuration options
tuned to how you like them and performs well for you. However the following are
commented suggestions for the optimal settings to send telegraf data to Flux.
Bear in mind that it is also possible to run another independent instance over
telegraf on the same machine, although this adds another overhead and collection
processing it does allow for isolation of your existing telegraf and one
specifically running for Flux. To reduce collection processing your current
telegraf instance could additionally send to ``[[outputs.file]]`` and your Flux
telegraf could use ``[[inputs.file]]``.

The following ``[agent]`` configuration options are recommended for sending
teleraf data to flux.

.. code-block:: ini
# Configuration for telegraf agent
[agent]
## Default data collection interval for all inputs
## IDEALLY for Skyline and Flux change 10s to 60s
# interval = "10s"
interval = "60s"

## Rounds collection interval to 'interval'
## ie, if interval="10s" then always collect on :00, :10, :20, etc.
round_interval = true

## Telegraf will send metrics to outputs in batches of at most
## metric_batch_size metrics.
## This controls the size of writes that Telegraf sends to output plugins.
metric_batch_size = 1000

## Maximum number of unwritten metrics per output. Increasing this value
## allows for longer periods of output downtime without dropping metrics at the
## cost of higher maximum memory usage.
metric_buffer_limit = 10000

## Collection jitter is used to jitter the collection by a random amount.
## Each plugin will sleep for a random time within jitter before collecting.
## This can be used to avoid many plugins querying things like sysfs at the
## same time, which can have a measurable effect on the system.
## IDEALLY for your own devices change this from 0s to 5s
# collection_jitter = "0s"
collection_jitter = "5s"

## Collection offset is used to shift the collection by the given amount.
## This can be be used to avoid many plugins querying constraint devices
## at the same time by manually scheduling them in time.
# collection_offset = "0s"

## Default flushing interval for all outputs. Maximum flush_interval will be
## flush_interval + flush_jitter
## IDEALLY for Skyline and Flux change 10s to 60s
# flush_interval = "10s"
flush_interval = "60s"
## Jitter the flush interval by a random amount. This is primarily to avoid
## large write spikes for users running a large number of telegraf instances.
## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
## IDEALLY for Skyline and Flux change 0s to 5s
flush_jitter = "5s"


populate_metric endpoint
------------------------

Expand Down
20 changes: 13 additions & 7 deletions docs/getting-data-into-skyline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Getting data into Skyline
=========================

Firstly a note on time snyc
===========================
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Although it may seems obvious, it is important to note that any metrics
coming into Graphite and Skyline should come from synchronised sources.
Expand All @@ -15,7 +15,7 @@ algorithms. In terms of machine related metrics, normal production grade
time synchronisation will suffice.

Secondly a note on the reliability of metric data
=================================================
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There are many ways to get data to Graphite and Skyline, however some are better
than others. The first and most important point is that your metric pipeline
Expand All @@ -34,13 +34,13 @@ your metric pipeline is fully TCP transported. statsd now has a TCP listener,
there is telegraf, sensu, etc there are lots of options.

Now getting the data in
=======================
~~~~~~~~~~~~~~~~~~~~~~~

You currently have a number of options to get data into Skyline, via the
Horizon, Vista and Flux services and via file upload:

Horizon - TCP pickles
~~~~~~~~~~~~~~~~~~~~~
=====================

Horizon was designed to support a stream of pickles from the Graphite
carbon-relay service, over port 2024 by default. Carbon relay is a
Expand Down Expand Up @@ -101,7 +101,7 @@ pack and pickle your data correctly (you'll need to look at the source
code for the exact protocol), you'll be able to stream to this listener.

Horizon - UDP messagepack
~~~~~~~~~~~~~~~~~~~~~~~~~
=========================

Generally do not use this. It is UDP, but has not been removed.

Expand All @@ -113,8 +113,14 @@ as messagepack and send them on their way.
However a quick note, on the transport any metrics data over UDP....
sorry if did you not get that.

Telegraf
========

Skyline Flux can ingest data from the influxdata Telegraf collector, see Flux
below, see the `Flux <flux.html>`__ page

Flux
~~~~
====

Metrics can be submitted to Flux via HTTP/S which feeds Graphite with pickles to
Skyline, see the `Flux <flux.html>`__ page.
Expand All @@ -125,7 +131,7 @@ upload_data to Flux
See the `upload_data to Flux <upload-data-to-flux.html>`__ page.

Vista
~~~~~
=====

Metrics to be fetched by Vista which submits to Flux, see the
`Vista <vista.html>`__ page.
Expand Down
6 changes: 5 additions & 1 deletion docs/ionosphere.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,10 @@ Overview
time series using |tsfresh| to "fingerprint" the data set.
- This allows the operator to fingerprint and profile what is not anomalous, but
normal and to be expected, even if 3sigma will always detect it as anomalous.
- Further to this, Ionosphere uses the trained data to create 1000s of motifs
(small patterns) of what normal is and supplements the feature profile
comparisons by evaluating the current potentially anomaly against the 1000s of
trained normal motifs from the features profiles.
- Ionosphere also allows for the creation of layers rules which allow the
operator to define ad hoc boundaries and conditions that can be considered as
not anomalous for each features profile that is created.
Expand Down Expand Up @@ -316,7 +320,7 @@ The validate features profile page is useful for this. See
<a href="_static/ionosphere_demo/fp_validate_demo_page/validate_features_profiles.demo.html" target="_blank">Ionosphere validate features profile demo page</a>

.. warning:: A **important** note on learning. When you let Ionosphere learn
you create a lot of work for yourself in terms of validating every learnt
you create work for yourself in terms of validating every learnt
profile that Ionosphere learns. If Ionosphere learns badly and you do not
keep up to date with validating learnt features profiles, Ionosphere could end
up silencing genuine anomalies which you would **want** to be alerted on.
Expand Down
3 changes: 3 additions & 0 deletions docs/ionosphere_echo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ The Ionosphere echo analysis is run between the normal Mirage features profiles
analysis and the layers analysis, so the analysis pipeline with Ionosphere echo
enabled is as follows:

- Ionosphere finding similar matching motifs for existing
``SECOND_ORDER_RESOLUTION_SECONDS`` features profiles time series and
:mod:`settings.FULL_DURATION` features profiles
- Ionosphere comparison to Mirage ``SECOND_ORDER_RESOLUTION_SECONDS`` features
profiles and minmax scaled features profiles
- Ionosphere echo comparison to Mirage :mod:`settings.FULL_DURATION` features
Expand Down

0 comments on commit fa42ece

Please sign in to comment.