Merge pull request #548 from earthgecko/SNAB

Update docs
earthgecko · Apr 30, 2022 · fa42ece · fa42ece
2 parents f4d18d4 + 744f8b4
commit fa42ece
Show file tree

Hide file tree

Showing 11 changed files with 188 additions and 59 deletions.
diff --git a/docs/alerts.rst b/docs/alerts.rst
@@ -8,7 +8,7 @@ alert settings.  This is due to the two classes of alerts being different,
 with Analyzer, Mirage and Ionosphere alerts being related to anomalies and
 Boundary alerts being related to breaches of the static and dynamic thresholds
 defined for Boundary metrics.  Further to this there are alert related settings
-for each alert output route, namely smtp, slack, pagerduty and sms.
+for each alert output route, namely smtp, slack, pagerduty, http and sms.
 
 Required smtp alerter for Analyzer and Mirage metrics
 =====================================================
@@ -24,15 +24,17 @@ analysis of metrics that have no alerts configured for them makes no sense as no
 one wants to know.  Therefore only metrics in namespaces that are defined with a
 stmp alerter in :mod:`settings.ALERTS` get analysed by Mirage and Ionosphere.
 It is via the smtp alert tuple that metrics get configured to be Mirage metrics
-by declaring the SECOND_ORDER_RESOLUTION_HOURS in the tuple.
+by declaring the ``SECOND_ORDER_RESOLUTION_HOURS`` in the tuple.
 
-However if you do not want to be SMTP alerted, you can set the
-:mod:`settings.SMTP_OPTS` to `'no_email'` as shown in an example below, but you
-must still declare the namespace with a SMTP alert tuple in
+However if you do not want emails from the SMTP alerts, you can set the
+:mod:`settings.SMTP_OPTS` to ``'no_email'`` as shown in an example below, but you
+still **must declare** the namespace with a SMTP alert tuple in
 :mod:`settings.ALERTS`
 
-The following example, we want to alert via Slack, your :mod:`settings.ALERTS`
-and :mod:`settings.SMTP_OPTS` would need to look like this.
+The following example, we want to alert via Slack only and not receive email
+alerts, your :mod:`settings.ALERTS` and :mod:`settings.SMTP_OPTS` would need to
+look like this.  It is important to note that all smtp alerts **must** be
+defined before other alerts, e.g. slack.
 
 .. code-block:: python
 
@@ -97,7 +99,7 @@ Alert settings
 ==============
 
 For each 3rd party alert service e.g. Slack, PagerDuty, http_alerters, there is
-a setting to enable the specific alerter which must be set to `True` to enable
+a setting to enable the specific alerter which must be set to ``True`` to enable
 the alerter:
 
 - :mod:`settings.SYSLOG_ENABLED`
@@ -109,8 +111,8 @@ the alerter:
 
 Analyzer, Mirage and Ionosphere related alert settings (anomaly detection) are:
 
-- :mod:`settings.ENABLE_ALERTS` - must be set to `True` to enable alerting
-- :mod:`settings.ENABLE_FULL_DURATION_ALERTS` - should be set to `False` if
+- :mod:`settings.ENABLE_ALERTS` - must be set to ``True`` to enable alerting
+- :mod:`settings.ENABLE_FULL_DURATION_ALERTS` - should be set to ``False`` if
   enable Mirage is enabled.  If this is set to ``True`` Analyzer will alert
   on all checks sent to Mirage, even if Mirage does not find them anomalous,
   this is mainly for testing.
@@ -123,9 +125,9 @@ Analyzer, Mirage and Ionosphere related alert settings (anomaly detection) are:
 - :mod:`settings.SYSLOG_OPTS` - can be used to change syslog settings
 - :mod:`settings.HTTP_ALERTERS_OPTS` - must be defined if you want to push
   alerts to a http endpoint
-- :mod:`settings.MIRAGE_ENABLE_ALERTS` - must be set to `True` to enable alerts
+- :mod:`settings.MIRAGE_ENABLE_ALERTS` - must be set to ``True`` to enable alerts
   from Mirage
-- :mod:`settings.AWS_SNS_SMS_ALERTS_ENABLED` - must be set to `True` if you want
+- :mod:`settings.AWS_SNS_SMS_ALERTS_ENABLED` - must be set to ``True`` if you want
   to send alerts via SMS.  boto3 also needs to be set up and AWS/IAM resource
   that boto3 uses needs permissions to publish to AWS SNS.  See boto3
   documentation - https://github.com/boto/boto3)
@@ -134,7 +136,7 @@ Analyzer, Mirage and Ionosphere related alert settings (anomaly detection) are:
 
 Boundary related alert settings (static and dynamic thresholds) are:
 
-- :mod:`settings.BOUNDARY_ENABLE_ALERTS` - must be set to `True` to enable
+- :mod:`settings.BOUNDARY_ENABLE_ALERTS` - must be set to ``True`` to enable
   alerting
 - :mod:`settings.BOUNDARY_METRICS` - must be defined to enable checks and alerts
   for Boundary
@@ -149,7 +151,7 @@ Boundary related alert settings (static and dynamic thresholds) are:
   Slack
 - :mod:`settings.BOUNDARY_HTTP_ALERTERS_OPTS` - must be defined if you want to
   push alerts to a http endpoint
-- :mod:`settings.AWS_SNS_SMS_ALERTS_ENABLED` - must be set to `True` if you want
+- :mod:`settings.AWS_SNS_SMS_ALERTS_ENABLED` - must be set to ``True`` if you want
   to send alerts via SMS.  boto3 also needs to be set up and AWS/IAM resource
   that boto3 uses needs permissions to publish to AWS SNS.  See boto3
   documentation - https://github.com/boto/boto3)

diff --git a/docs/algorithms/custom-algorithms.rst b/docs/algorithms/custom-algorithms.rst
@@ -54,6 +54,8 @@ given the methods and modes that can be configured.
 :mod:`settings.CUSTOM_ALGORITHMS`
 ---------------------------------
 
+Custom algorithms are only available in analyzer, analyzer_batch and mirage.
+
 Custom algorithms are defined in the :mod:`settings.CUSTOM_ALGORITHMS`
 dictionary.  The format and key values of the dictionary are shown in the
 following **example**:
@@ -72,7 +74,7 @@ following **example**:
             'run_before_3sigma': True,
             'run_only_if_consensus': False,
             'trigger_history_override': 0,
-            'use_with': ['analyzer', 'analyzer_batch', 'mirage', 'crucible'],
+            'use_with': ['analyzer', 'analyzer_batch', 'mirage'],
             'debug_logging': False,
         },
         'last_same_hours': {
@@ -107,7 +109,7 @@ following **example**:
             'run_before_3sigma': True,
             'run_only_if_consensus': False,
             'trigger_history_override': 0,
-            'use_with': ['analyzer', 'crucible'],
+            'use_with': ['analyzer'],
             'debug_logging': True,
         },
         'skyline_matrixprofile': {

diff --git a/docs/boundary.rst b/docs/boundary.rst
@@ -31,11 +31,11 @@ Boundary currently has 3 defined algorithms:
 
 Boundary is run as a separate process just like Analyzer, horizon and
 mirage. It was not envisaged to analyze all your metrics, but rather
-your key metrics in additional dimension/s. If it was run across all of your
+your key metrics with additional analysis. If it was run across all of your
 metrics it would probably be:
 
 - VERY noisy
-- CPU intensive
+- VERY CPU intensive
 
 If deployed only key metrics it has a very low footprint (9 seconds on
 150 metrics with 2 processes assigned) and a high return. If deployed as
@@ -54,8 +54,9 @@ Configuration and running Boundary
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 ``settings.py`` has an independent setting blocks and has detailed information
-on each setting in its docstring, the main difference from Analyzer being in
-terms of number of variables that have to be declared in the alert tuples, e.g:
+on each setting and the parameter in its docstring, the main difference from
+Analyzer being in terms of number of variables that have to be declared in the
+alert tuples, e.g:
 
 .. code-block:: python
 
@@ -76,6 +77,36 @@ Boundary:
     /opt/skyline/github/skyline/bin/boundary.d start
 
 
+An example
+~~~~~~~~~~
+
+Here is an example of what you can use Boundary for.  If you look at the graph
+below, you can see that the minimum value is around 1000.  Let us say that this
+metric is a fairly reliable and important global metric, like the number of page
+impression per minute in your shop.
+
+.. image:: images/boundary/boundary_example.png
+
+We can configure Boundary to monitor this metric.  Although you can use Boundary
+to monitor any metric, it works best if only monitoring your important and
+reliable global metrics with Boundary.
+
+:: .. code-block:: python
+
+  ('example_org.shop.total.page_impressions', 'detect_drop_off_cliff', 1800, 800, 300, 0, 1, 'smtp|slack|pagerduty'),
+  ('example_org.shop.total.page_impressions', 'less_than', 1800, 0, 0, 1000, 7, 'smtp|slack|pagerduty'),
+
+
+The above :mod:`settings.BOUNDARY_METRICS` enables 2 algorithms to check this
+metric against every minute, :func:`.boundary_algorithms.detect_drop_off_cliff`
+and :func:`.boundary_algorithms.less_than`.  Although the ``less_than`` check
+should normally be sufficient on it's own, the ``detect_drop_off_cliff`` check
+will ensure that nothing is missed.  For instance that metric could drop to
+between 2 and 15 for 6 minutes and then go back up to 1600 and then drop again
+for 5 minutes and go back up again and enter a "flapping" state.  There are
+instances where ``less_than`` may not fire, but ``detect_drop_off_cliff``
+would.
+
 detect\_drop\_off\_cliff algorithm
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 

diff --git a/docs/development/index.rst b/docs/development/index.rst
@@ -13,6 +13,7 @@ Development
   tsfresh
   pytz
 
+
 DRY
 ###
 

diff --git a/docs/flux.rst b/docs/flux.rst
@@ -243,13 +243,24 @@ Firstly a few things must be pointed out:
 
 - Flux applies the same logic that is used by the telegraf Graphite template
   pattern e.g. `template = "host.tags.measurement.field"` - https://github.com/influxdata/telegraf/tree/master/plugins/serializers/graphite
-- Flux replaces  ``.` for ``_`` and ``/`` for ``-`` in any tags, measurement or fields
+- Flux replaces  ``.`` for ``_`` and ``/`` for ``-`` in any tags, measurement or fields
   in which these characters are found.
+- Some telegraf outputs send string/boolean values, by default Flux drops
+  metrics with string/boolean values, e.g.
+  ``{'fields': {..., 'aof_last_bgrewrite_status': 'ok', ...}, 'name': 'redis', ...}``
+  One way to deal with string/boolean values metrics if you want to record them
+  numerically as metrics is to convert them in telegraf itself using the
+  appropriate telegraf processors and filters (see telegraf documentation).  If
+  Flux receives string/boolean value metrics it will report what metrics were
+  dropped in a HTTP status code 207 response in the json body and not the normal
+  204 response.  However string/boolean value metrics will not cause errors and
+  all other metrics with numerical values in the POST will be accepted.
 - The additional `outputs.http.headers` **must** be specified.
 - ``content_encoding`` must be set to gzip
 - There are a number of status codes that telegraf should not resubmit data on
   these are to ensure telegraf does not attempt to send the same bad data or
-  too much data over and over again.
+  too much data over and over again (coming soon in the upcoming telegraf March
+  2022 release).
 - If there is a long network partition between telegraf agents and Flux,
   sometimes some data may be dropped, but this can often be preferable to the
   thundering herd swamping I/O
@@ -260,10 +271,6 @@ host to the tags:
 
 .. code-block:: ini
 
-  ## Maximum number of unwritten metrics per output.
-  # Ideally change this from 10000 to 1000 to minimise thundering herd issues
-  # metric_buffer_limit = 10000
-  metric_buffer_limit = 1000
   ## Override default hostname, if empty use os.Hostname()
   hostname = ""
   ## If set to true, do no set the "host" tag in the telegraf agent.
@@ -291,7 +298,8 @@ Also add the ``[[outputs.http]]`` to the telegraf config as below replacing
     json_timestamp_units = "1s"
     content_encoding = "gzip"
     ## A list of statuscodes (<200 or >300) upon which requests should not be retried
-    non_retryable_statuscodes = [400, 403, 409, 413, 499, 500, 502, 503]
+    ## Coming soon to a version of telegraf March 2022.
+    # non_retryable_statuscodes = [400, 403, 409, 413, 499, 500, 502, 503]
     [outputs.http.headers]
       Content-Type = "application/json"
       key = "<settings.FLUX_SELF_API_KEY>"
@@ -300,6 +308,69 @@ Also add the ``[[outputs.http]]`` to the telegraf config as below replacing
       # prefix = "telegraf"
 
 
+Ideally telegraf would be configured for optimum Flux performance, however
+seeing as the ``[[outputs.http]]`` may simply be added to existing telegraf
+instances, you may already have the following ``[agent]`` configuration options
+tuned to how you like them and performs well for you.  However the following are
+commented suggestions for the optimal settings to send telegraf data to Flux.
+Bear in mind that it is also possible to run another independent instance over
+telegraf on the same machine, although this adds another overhead and collection
+processing it does allow for isolation of your existing telegraf and one
+specifically running for Flux.  To reduce collection processing your current
+telegraf instance could additionally send to ``[[outputs.file]]`` and your Flux
+telegraf could use ``[[inputs.file]]``.
+
+The following ``[agent]`` configuration options are recommended for sending
+teleraf data to flux.
+
+.. code-block:: ini
+
+# Configuration for telegraf agent
+[agent]
+  ## Default data collection interval for all inputs
+  ## IDEALLY for Skyline and Flux change 10s to 60s
+  # interval = "10s"
+  interval = "60s"
+
+  ## Rounds collection interval to 'interval'
+  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
+  round_interval = true
+
+  ## Telegraf will send metrics to outputs in batches of at most
+  ## metric_batch_size metrics.
+  ## This controls the size of writes that Telegraf sends to output plugins.
+  metric_batch_size = 1000
+
+  ## Maximum number of unwritten metrics per output.  Increasing this value
+  ## allows for longer periods of output downtime without dropping metrics at the
+  ## cost of higher maximum memory usage.
+  metric_buffer_limit = 10000
+
+  ## Collection jitter is used to jitter the collection by a random amount.
+  ## Each plugin will sleep for a random time within jitter before collecting.
+  ## This can be used to avoid many plugins querying things like sysfs at the
+  ## same time, which can have a measurable effect on the system.
+  ## IDEALLY for your own devices change this from 0s to 5s
+  # collection_jitter = "0s"
+  collection_jitter = "5s"
+
+  ## Collection offset is used to shift the collection by the given amount.
+  ## This can be be used to avoid many plugins querying constraint devices
+  ## at the same time by manually scheduling them in time.
+  # collection_offset = "0s"
+
+  ## Default flushing interval for all outputs. Maximum flush_interval will be
+  ## flush_interval + flush_jitter
+  ## IDEALLY for Skyline and Flux change 10s to 60s
+  # flush_interval = "10s"
+  flush_interval = "60s"
+  ## Jitter the flush interval by a random amount. This is primarily to avoid
+  ## large write spikes for users running a large number of telegraf instances.
+  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
+  ## IDEALLY for Skyline and Flux change 0s to 5s
+  flush_jitter = "5s"
+
+
 populate_metric endpoint
 ------------------------
 

diff --git a/docs/getting-data-into-skyline.rst b/docs/getting-data-into-skyline.rst
@@ -3,7 +3,7 @@ Getting data into Skyline
 =========================
 
 Firstly a note on time snyc
-===========================
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Although it may seems obvious, it is important to note that any metrics
 coming into Graphite and Skyline should come from synchronised sources.
@@ -15,7 +15,7 @@ algorithms. In terms of machine related metrics, normal production grade
 time synchronisation will suffice.
 
 Secondly a note on the reliability of metric data
-=================================================
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 There are many ways to get data to Graphite and Skyline, however some are better
 than others.  The first and most important point is that your metric pipeline
@@ -34,13 +34,13 @@ your metric pipeline is fully TCP transported.  statsd now has a TCP listener,
 there is telegraf, sensu, etc there are lots of options.
 
 Now getting the data in
-=======================
+~~~~~~~~~~~~~~~~~~~~~~~
 
 You currently have a number of options to get data into Skyline, via the
 Horizon, Vista and Flux services and via file upload:
 
 Horizon - TCP pickles
-~~~~~~~~~~~~~~~~~~~~~
+=====================
 
 Horizon was designed to support a stream of pickles from the Graphite
 carbon-relay service, over port 2024 by default. Carbon relay is a
@@ -101,7 +101,7 @@ pack and pickle your data correctly (you'll need to look at the source
 code for the exact protocol), you'll be able to stream to this listener.
 
 Horizon - UDP messagepack
-~~~~~~~~~~~~~~~~~~~~~~~~~
+=========================
 
 Generally do not use this.  It is UDP, but has not been removed.
 
@@ -113,8 +113,14 @@ as messagepack and send them on their way.
 However a quick note, on the transport any metrics data over UDP....
 sorry if did you not get that.
 
+Telegraf
+========
+
+Skyline Flux can ingest data from the influxdata Telegraf collector, see Flux
+below, see the `Flux <flux.html>`__ page
+
 Flux
-~~~~
+====
 
 Metrics can be submitted to Flux via HTTP/S which feeds Graphite with pickles to
 Skyline, see the `Flux <flux.html>`__ page.
@@ -125,7 +131,7 @@ upload_data to Flux
 See the `upload_data to Flux <upload-data-to-flux.html>`__ page.
 
 Vista
-~~~~~
+=====
 
 Metrics to be fetched by Vista which submits to Flux, see the
 `Vista <vista.html>`__ page.

diff --git a/docs/ionosphere.rst b/docs/ionosphere.rst
@@ -60,6 +60,10 @@ Overview
   time series using |tsfresh| to "fingerprint" the data set.
 - This allows the operator to fingerprint and profile what is not anomalous, but
   normal and to be expected, even if 3sigma will always detect it as anomalous.
+- Further to this, Ionosphere uses the trained data to create 1000s of motifs
+  (small patterns) of what normal is and supplements the feature profile
+  comparisons by evaluating the current potentially anomaly against the 1000s of
+  trained normal motifs from the features profiles.
 - Ionosphere also allows for the creation of layers rules which allow the
   operator to define ad hoc boundaries and conditions that can be considered as
   not anomalous for each features profile that is created.
@@ -316,7 +320,7 @@ The validate features profile page is useful for this. See
    <a href="_static/ionosphere_demo/fp_validate_demo_page/validate_features_profiles.demo.html" target="_blank">Ionosphere validate features profile demo page</a>
 
 .. warning:: A **important** note on learning.  When you let Ionosphere learn
-  you create a lot of work for yourself in terms of validating every learnt
+  you create work for yourself in terms of validating every learnt
   profile that Ionosphere learns.  If Ionosphere learns badly and you do not
   keep up to date with validating learnt features profiles, Ionosphere could end
   up silencing genuine anomalies which you would **want** to be alerted on.

diff --git a/docs/ionosphere_echo.rst b/docs/ionosphere_echo.rst
@@ -21,6 +21,9 @@ The Ionosphere echo analysis is run between the normal Mirage features profiles
 analysis and the layers analysis, so the analysis pipeline with Ionosphere echo
 enabled is as follows:
 
+- Ionosphere finding similar matching motifs for existing
+  ``SECOND_ORDER_RESOLUTION_SECONDS`` features profiles time series and
+  :mod:`settings.FULL_DURATION` features profiles
 - Ionosphere comparison to Mirage ``SECOND_ORDER_RESOLUTION_SECONDS`` features
   profiles and minmax scaled features profiles
 - Ionosphere echo comparison to Mirage :mod:`settings.FULL_DURATION` features