Skip to content

Commit

Permalink
Merge pull request #223 from earthgecko/py3
Browse files Browse the repository at this point in the history
upload_data - flux and webapp
  • Loading branch information
earthgecko committed May 20, 2020
2 parents 64103a4 + 9b49b0d commit 57ab0f1
Show file tree
Hide file tree
Showing 18 changed files with 2,576 additions and 36 deletions.
5 changes: 5 additions & 0 deletions dev-requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -519,3 +519,8 @@ urllib3==1.25.3
graphyte==1.6.0
statsd==3.3.0
falcon==2.0.0

# @added 20200517 - Feature #3550: flux.uploaded_data_worker
# Feature #3538: webapp - upload_data endpoint
xlrd==1.2.0
pandas_log==0.1.7; python_version >= '3.2'
26 changes: 26 additions & 0 deletions docs/analyzer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,32 @@ to check if the anomalous time series is known to be NOT ANOMALOUS due to it's
features profile matching a known NOT ANOMALOUS trained or learnt features
profile see `Ionosphere <ionosphere.html>`__.

analyzer_batch
==============

analyzer_batch is a submodule of Analyzer that runs as an independent service
with the sole purpose of handling and analysing metrics that are not streamed in
real time but are updated in batches, every x minutes or hour/s.
It is a "lite" and slightly modified version of analyzer that works in
conjunction with analyzer to handle batch metrics.
By default analyzer_batch and :mod:`settings.BATCH_PROCESSING` related settings
are disabled.

It should only be enabled if you have metrics that are received in infrequent
batches, metrics feed per minute do not require batch processing. For example
if a metric/s are sent to Skyline every 15 minutes with a data point for each
minute in the period, Analyzer's default analysis would only analyse the latest
data point against the metric time series data and not all the 14 data points
since the last analysis, analyzer_batch does.

It is not default Analyzer behaviour as it adds unnecessary computational
overhead on analysing real time metrics, therefore it is only implemented if
required.

analyzer_batch and :mod:`settings.BATCH_PROCESSING` needs to be enabled if you
wish to use Flux to process uploaded data files, see
`Flux <flux.html>`__.

Analyzer :mod:`settings.ALERTS`
===============================

Expand Down
51 changes: 49 additions & 2 deletions docs/flux.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ submits them to Graphite, so they can be pickled to Skyline for analysis in near
real time via the normal Skyline pipeline.

Flux uses falcon the bare-metal web API framework for Python to serve the API
via gunicorn. The normal Apache reverse proxy Skyline vhost is used to serve
the /flux endpoint and proxy requests to flux.
via gunicorn. The normal Apache/nginx reverse proxy Skyline vhost is used to
serve the /flux endpoint and proxy requests to flux.

It is preferable to use the POST Flux endpoint to submit metrics so that the
Skyline flux API key can be encrypted via SSL in the POST data.
Expand Down Expand Up @@ -87,3 +87,50 @@ the Skyline node to connect to the Graphite node on this port.

The populate_metric_worker applies resampling at 1Min, but see Vista
populate_at_resolutions for more detailed information.

Process uploaded data
---------------------

Skyline Flux can be enabled to process data uploaded via the webapp and submit
data to Graphite. This allows for the automated uploading and processing of
batched measurements data and reports to time series data which is analysed in
the normal Skyline workflow. An example use case would be if you had an hourly
report of wind related metrics that had a reading every 5 minutes for an hour
period, for x number of stations. As long as the data is in uploaded in an
acceptable format, it can be preprocessed by flux and submitted to Graphite.
The metric namespace/s need be declared as batch processing metrics in
:mod:`settings.BATCH_PROCESSING_NAMESPACES` and :mod:`settings.BATCH_PROCESSING`
has to be enabled.

By default flux is not enabled to process uploaded data and the webapp is not
configured to accept uploaded data.

To enable Flux to process uploaded data the following settings need to be set
and services running:

- analyzer_batch needs to be enabled and running, see `Analyzer - analyzer_batch <analyzer.html#analyzer_batch>`__.
- :mod:`settings.BATCH_PROCESSING` need to be set to `True`
- The `parent_metric_namespace` or all the metric namespace in question relating
to the specific data being uploaded need to be declared in
:mod:`settings.BATCH_PROCESSING_NAMESPACES`
- :mod:`settings.DATA_UPLOADS_PATH` is required
- :mod:`settings.WEBAPP_ACCEPT_DATA_UPLOADS` must be enabled
- :mod:`settings.FLUX_PROCESS_UPLOADS` must be enabled
- If the data is being uploaded ia an automated process, curl, etc the
`parent_metric_namespace` needs a key set in the
:mod:`settings.FLUX_UPLOADS_KEYS` dictionary e.g.

.. code-block:: python
FLUX_UPLOADS_KEYS = {
'temp_monitoring.warehouse.2.012383': '484166bf-df66-4f7d-ad4a-9336da9ef620',
}
- Optionally :mod:`settings.FLUX_SAVE_UPLOADS` and
:mod:`settings.FLUX_SAVE_UPLOADS_PATH` can be used if you wish to save the
uploaded data.

For specific details about the data formats and methods for uploading and
processing data files see the `upload_data to Flux <uploa-data-to-flux.html>`__
page.
9 changes: 7 additions & 2 deletions docs/getting-data-into-skyline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Now getting the data in
=======================

You currently have a number of options to get data into Skyline, via the
Horizon, Vista and Flux services:
Horizon, Vista and Flux services and via file upload:

Horizon - TCP pickles
~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -112,9 +112,14 @@ sorry if did you not get that.
Flux
~~~~

Metrics to be submitted to Flux via HTTP/S which feeds Graphite which pickles to
Metrics to be submitted to Flux via HTTP/S which feeds Graphite with pickles to
Skyline, see the `Flux <flux.html>`__ page.

upload_data to Flux
~~~~~~~~~~~~~~~~~~~

See the `upload_data to Flux <uploa-data-to-flux.html>`__ page.

Vista
~~~~~

Expand Down
14 changes: 14 additions & 0 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -180,13 +180,27 @@ Skyline directories
mkdir -p /opt/skyline/crucible/check
mkdir -p /opt/skyline/crucible/data
mkdir -p /opt/skyline/ionosphere/check
mkdir -p /opt/skyline/flux/processed_uploads
mkdir /etc/skyline
mkdir /tmp/skyline
.. note:: Ensure you provide the appropriate ownership and permissions to the
above specified directories for the user you wish to run the Skyline process
as.

.. code-block:: bash
# Example using user and group Skyline
chown skyline:skyline /var/log/skyline
chown skyline:skyline /var/run/skyline
chown skyline:skyline /var/dump
chown -R skyline:skyline /opt/skyline/panorama
chown -R skyline:skyline /opt/skyline/mirage
chown -R skyline:skyline /opt/skyline/crucible
chown -R skyline:skyline /opt/skyline/ionosphere
chown -R skyline:skyline /opt/skyline/flux
chown skyline:skyline /tmp/skyline
Skyline and dependencies install
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
12 changes: 10 additions & 2 deletions docs/ionosphere.rst
Original file line number Diff line number Diff line change
Expand Up @@ -230,6 +230,14 @@ Skyline does not know all the contexts to the data, you do. Ionosphere lets
us teach Bob **that is not an earthquake!!!** and enables Bob to look and ask,
"Did Alice say this was not an earthquake, let me look".

Negative values
^^^^^^^^^^^^^^^

It needs to be noted that the current implementation of the algorithm is only
valid for positive time series. Any anomaly in time series that is identified
as having a negative value in that specific time series period will not be
trainable.

"Create" or "Create and LEARN"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand All @@ -255,7 +263,7 @@ to see the relevant Analyzer :mod:`settings.FULL_DURATION` or Mirage the
``SECOND_ORDER_RESOLUTION_HOURS`` data as not anomalous and **not** learn at
the fuller duration of the metric's ``learn_full_duration``.

You can teach Ionosphere badly, but to unteach it is just a click of the Disable
You can teach Ionosphere badly, but to "unteach" it is just a click of the Disable
features profile button.

Use Ionosphere learning sparingly, although it is a feature, it will also
Expand All @@ -268,7 +276,7 @@ fall into this category. Do not go overboard on learning do it slowly and
incrementally. All LEARNT features profiles have the ability to be validated,
however this is not a hard requirement, unvalidated features profiles will still
be used in analysis, the validated flag is currently simply there to give the
operation a view of what LEARNT features profiles have not been assessed to
operator a view of what LEARNT features profiles have not been assessed to
determine their accuracy. Skyline will also LEARN from an unvalidated features
profile. Therefore the operator needs to review and validate or disable
features profiles every so often, this can be achieved when reviewing and
Expand Down
197 changes: 197 additions & 0 deletions docs/upload-data-to-flux.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
==================================
upload_data to Flux - EXPERIMENTAL
==================================

**SIMPLE** data files can be uploaded via the /upload_data HTTP/S endpoint for
Flux to process and fed to Graphite. A number of things need to be enabled and
running to allow for processing data file uploads, which are not enabled by
default.

For information regarding configuring Skyline to allow Flux to process uploaded
data files see the `Process uploaded data <flux.html#process-uploaded-data>`__
section on the Flux page.

Skyline currently allows for the uploading of the following format data files:

- csv (tested)
- xlsx (tested)

Seeing as data files can be large, the following archive formats are accepted:

- gz (tested)
- zip (tested)

A single file or archive can be uploaded or many data files can be uploaded in
a single archive. A `info.csv` must also be included in the archive, more
on that below.

So you could upload `data.csv`, `data.csv.gz` or `data.zip` with the `data.csv`
file inside the zip archive.

Any files in an archive that are not of an allowed format are not extracted or
they are deleted. Try and submit one data file per upload as ordering has not
been tested.

Multiple data files should be uploaded and processed sequentially. At the
moment upload the oldest data file first and then after validating the status of
the upload (the /api?upload_status link is return on the Flux_frontend page or
in the json response), continue with the next upload, etc.

Skyline automates the conversion of SIMPLE columnar data into time series data,
however seeing as not all data is defined or created equally, some information
about the data needs to be passed with the data file/s to inform Skyline about
the metric structure. This makes setting up a upload process more work, but if
you are uploading frequently then it is one off work.

To ensure that naive data in the datetime column can be handled, meaning not
timezone aware date times e.g. `16-05-2020 11:00` you always need to pass a
valid pytz timezone as listed in `Timezones list for pytz version <development/pytz.html#timezones-list-for-pytz-version>`__.
A timezone aware timestamp either specifies the UTC offset or the timezone
itself.

- 2020-05-16 14:00:00 -04:00
- 2020-05-16 07:46:25 BST 2020

However the timezone itself may not be a valid pytz timezone, so a valid pytz
timezone must always be passed.

In many instances it is possible that the column names and format in a data file
will not describe the measurements in terms of metrics name or be in a suitable
metric name format. Take a csv example of a very normal type of data structure,
lets call it 2020-05-16-11.device_id.012383.csv

::

Device ID,012383
Serial Number,1234579853
Location,Warehouse 2
From,16/05/2020
To,16/05/2020
Date,Roof Temperature,Floor Temperature
16-05-2020 11:00,45.72,22.78
16-05-2020 11:15,45.94,22.92
16-05-2020 11:30,46.13,22.98
16-05-2020 11:45,46.34,23.06

This data informs us of the times and values and things, but it does not tell us
what metrics they should represent. Skyline also needs to be informed about the
header row and rows to ignore. A info file is used to inform Skyline how to
read and metric the data, take 2020-05-16-11.device_id.012383.info.json
Note, in processing rows are 0-indexed

info.json

::

{
"parent_metric_namespace": "temp_monitoring.warehouse.2.012383",
"timezone": "GMT",
"skip_rows": 5,
"header_row": 0,
"date_orientation": "rows",
"columns_to_metrics": "date,roof,floor"
}

Your date time column MUST be named date in the columns_to_metrics mapping.

For convenience sake you can also add two additional elements to the info.json:

- `"debug": "true"` which outputs additional information regarding the imported
dataframe in the flux.log to aid with debugging.
- `"dryrun": "true"` which runs through the processing but does not submit data
to Graphite.

This tells Skyline what the parent metric namespace should be which would
result in metrics:

- temp_monitoring.warehouse.2.012383.roof
- temp_monitoring.warehouse.2.012383.floor

xlsx files are 0-indexed, csv files are not 0-indexed.

It tells Skyline to ignore rows 1, 2, 3, 4, 5 (but if it were 0-indexed skip_row
would be set to 4).
It tells Skyline to use row 0 as the header row, e.g. column names. Note that
if you skip_rows your header row must be 0.
It tells Skyline how to map the column names to metric names. A one to one
mapping it required for every column. Once again, your date time column MUST be
named date in the columns_to_metrics mapping.

Only alphanumeric chars and '.', '_', '-' are allowed in the metric name, e.g.
the parent_metric_namespace and columns_to_metrics that you pass.

Requirements of the data file. The data file must have a header row.

The required information elements are in the POST variables are:

- key (str)
- parent_metric_namespace (str)
- archive (str) - gz, zip or tar_gz
- format (str)
- skip_rows ('none' or int)
- header_row (int)
- date_orientation (str) - currently only 'rows' is supported
- data_file (required in the post variables)
- columns_to_metrics (str) - comma separated list of names (no spaces)
- data_file (file)
- info_file (file)
- json_response

An example of how to POST the above csv and info.csv with curl be would be as
follows. Note that in this instance you would need a your
:mod:`settings.FLUX_UPLOADS_KEYS` to be set with:

.. code-block:: python
FLUX_UPLOADS_KEYS = {
'temp_monitoring.warehouse.2.012383': '484166bf-df66-4f7d-ad4a-9336da9ef620',
}
curl request.

.. code-block:: bash
curl \
-F "key=484166bf-df66-4f7d-ad4a-9336da9ef620" \
-F "timezone=GMT" \
-F "parent_metric_namespace=temp_monitoring.warehouse.2.012383" \
-F "archive=none" \
-F "format=csv" \
-F "skip_rows=5" \
-F "header_row=0" \
-F "date_orientation=rows" \
-F "columns_to_metrics=date,roof,floor" \
-F "data_file=@<FULL_PATH_TO_FILE>/2020-05-16-11.device_id.012383.csv" \
-F "info_file=@<FULL_PATH_TO_FILE>/info.json" \
-F "json_response=true" \
https://$SKYLINE_HOST/upload_data
Vista
~~~~~

Metrics to be fetched by Vista which submits to Flux, see the
`Vista <vista.html>`__ page.

Adding a Listener
=================

If neither of these listeners are acceptable, it's easy enough to extend
them. Add a method in listen.py and add a line in the horizon-agent that
points to your new listener.

:mod:`settings.FULL_DURATION`
=============================

Once you get real data flowing through your system, the Analyzer will be
able start analyzing for anomalies.

.. note:: Do not expect to see anomalies or anything in the Webapp immediately
after starting the Skyline services. Realistically :mod:`settings.FULL_DURATION`
should have been passed, before you begin to assess any triggered anomalies,
after all :mod:`settings.FULL_DURATION` is the baseline. Although not all
algorithms utilize all the :mod:`settings.FULL_DURATION` data points, some do
and some use only 1 hour's worth. However the Analyzer log should still report
values in the exception stats, reporting how many metrics were boring, too
short, etc as soon as it is getting data for metrics that Horizon is populating
into Redis.
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,5 @@ urllib3==1.25.3
graphyte==1.6.0
statsd==3.3.0
falcon==2.0.0
xlrd==1.2.0
pandas_log==0.1.7; python_version >= '3.2'

0 comments on commit 57ab0f1

Please sign in to comment.