Merge pull request #223 from earthgecko/py3

upload_data - flux and webapp
earthgecko · May 20, 2020 · 57ab0f1 · 57ab0f1
2 parents 64103a4 + 9b49b0d
commit 57ab0f1
Show file tree

Hide file tree

Showing 18 changed files with 2,576 additions and 36 deletions.
diff --git a/dev-requirements.txt b/dev-requirements.txt
@@ -519,3 +519,8 @@ urllib3==1.25.3
 graphyte==1.6.0
 statsd==3.3.0
 falcon==2.0.0
+
+# @added 20200517 - Feature #3550: flux.uploaded_data_worker
+#                   Feature #3538: webapp - upload_data endpoint
+xlrd==1.2.0
+pandas_log==0.1.7; python_version >= '3.2'
diff --git a/docs/analyzer.rst b/docs/analyzer.rst
@@ -121,6 +121,32 @@ to check if the anomalous time series is known to be NOT ANOMALOUS due to it's
 features profile matching a known NOT ANOMALOUS trained or learnt features
 profile see `Ionosphere <ionosphere.html>`__.
 
+analyzer_batch
+==============
+
+analyzer_batch is a submodule of Analyzer that runs as an independent service
+with the sole purpose of handling and analysing metrics that are not streamed in
+real time but are updated in batches, every x minutes or hour/s.
+It is a "lite" and slightly modified version of analyzer that works in
+conjunction with analyzer to handle batch metrics.
+By default analyzer_batch and :mod:`settings.BATCH_PROCESSING` related settings
+are disabled.
+
+It should only be enabled if you have metrics that are received in infrequent
+batches, metrics feed per minute do not require batch processing.  For example
+if a metric/s are sent to Skyline every 15 minutes with a data point for each
+minute in the period, Analyzer's default analysis would only analyse the latest
+data point against the metric time series data and not all the 14 data points
+since the last analysis, analyzer_batch does.
+
+It is not default Analyzer behaviour as it adds unnecessary computational
+overhead on analysing real time metrics, therefore it is only implemented if
+required.
+
+analyzer_batch and :mod:`settings.BATCH_PROCESSING` needs to be enabled if you
+wish to use Flux to process uploaded data files, see
+`Flux <flux.html>`__.
+
 Analyzer :mod:`settings.ALERTS`
 ===============================
 

diff --git a/docs/flux.rst b/docs/flux.rst
@@ -9,8 +9,8 @@ submits them to Graphite, so they can be pickled to Skyline for analysis in near
 real time via the normal Skyline pipeline.
 
 Flux uses falcon the bare-metal web API framework for Python to serve the API
-via gunicorn.  The normal Apache reverse proxy Skyline vhost is used to serve
-the /flux endpoint and proxy requests to flux.
+via gunicorn.  The normal Apache/nginx reverse proxy Skyline vhost is used to
+serve the /flux endpoint and proxy requests to flux.
 
 It is preferable to use the POST Flux endpoint to submit metrics so that the
 Skyline flux API key can be encrypted via SSL in the POST data.
@@ -87,3 +87,50 @@ the Skyline node to connect to the Graphite node on this port.
 
 The populate_metric_worker applies resampling at 1Min, but see Vista
 populate_at_resolutions for more detailed information.
+
+Process uploaded data
+---------------------
+
+Skyline Flux can be enabled to process data uploaded via the webapp and submit
+data to Graphite.  This allows for the automated uploading and processing of
+batched measurements data and reports to time series data which is analysed in
+the normal Skyline workflow.  An example use case would be if you had an hourly
+report of wind related metrics that had a reading every 5 minutes for an hour
+period, for x number of stations.  As long as the data is in uploaded in an
+acceptable format, it can be preprocessed by flux and submitted to Graphite.
+The metric namespace/s need be declared as batch processing metrics in
+:mod:`settings.BATCH_PROCESSING_NAMESPACES` and :mod:`settings.BATCH_PROCESSING`
+has to be enabled.
+
+By default flux is not enabled to process uploaded data and the webapp is not
+configured to accept uploaded data.
+
+To enable Flux to process uploaded data the following settings need to be set
+and services running:
+
+- analyzer_batch needs to be enabled and running, see `Analyzer - analyzer_batch <analyzer.html#analyzer_batch>`__.
+- :mod:`settings.BATCH_PROCESSING` need to be set to `True`
+- The `parent_metric_namespace` or all the metric namespace in question relating
+  to the specific data being uploaded need to be declared in
+  :mod:`settings.BATCH_PROCESSING_NAMESPACES`
+- :mod:`settings.DATA_UPLOADS_PATH` is required
+- :mod:`settings.WEBAPP_ACCEPT_DATA_UPLOADS` must be enabled
+- :mod:`settings.FLUX_PROCESS_UPLOADS` must be enabled
+- If the data is being uploaded ia an automated process, curl, etc the
+  `parent_metric_namespace` needs a key set in the
+  :mod:`settings.FLUX_UPLOADS_KEYS` dictionary e.g.
+
+.. code-block:: python
+
+    FLUX_UPLOADS_KEYS = {
+        'temp_monitoring.warehouse.2.012383': '484166bf-df66-4f7d-ad4a-9336da9ef620',
+    }
+
+
+- Optionally :mod:`settings.FLUX_SAVE_UPLOADS` and
+  :mod:`settings.FLUX_SAVE_UPLOADS_PATH` can be used if you wish to save the
+  uploaded data.
+
+For specific details about the data formats and methods for uploading and
+processing data files see the `upload_data to Flux <uploa-data-to-flux.html>`__
+page.
diff --git a/docs/getting-data-into-skyline.rst b/docs/getting-data-into-skyline.rst
@@ -37,7 +37,7 @@ Now getting the data in
 =======================
 
 You currently have a number of options to get data into Skyline, via the
-Horizon, Vista and Flux services:
+Horizon, Vista and Flux services and via file upload:
 
 Horizon - TCP pickles
 ~~~~~~~~~~~~~~~~~~~~~
@@ -112,9 +112,14 @@ sorry if did you not get that.
 Flux
 ~~~~
 
-Metrics to be submitted to Flux via HTTP/S which feeds Graphite which pickles to
+Metrics to be submitted to Flux via HTTP/S which feeds Graphite with pickles to
 Skyline, see the `Flux <flux.html>`__ page.
 
+upload_data to Flux
+~~~~~~~~~~~~~~~~~~~
+
+See the `upload_data to Flux <uploa-data-to-flux.html>`__ page.
+
 Vista
 ~~~~~
 

diff --git a/docs/installation.rst b/docs/installation.rst
@@ -180,13 +180,27 @@ Skyline directories
     mkdir -p /opt/skyline/crucible/check
     mkdir -p /opt/skyline/crucible/data
     mkdir -p /opt/skyline/ionosphere/check
+    mkdir -p /opt/skyline/flux/processed_uploads
     mkdir /etc/skyline
     mkdir /tmp/skyline
 
 .. note:: Ensure you provide the appropriate ownership and permissions to the
   above specified directories for the user you wish to run the Skyline process
   as.
 
+.. code-block:: bash
+
+    # Example using user and group Skyline
+    chown skyline:skyline /var/log/skyline
+    chown skyline:skyline /var/run/skyline
+    chown skyline:skyline /var/dump
+    chown -R skyline:skyline /opt/skyline/panorama
+    chown -R skyline:skyline /opt/skyline/mirage
+    chown -R skyline:skyline /opt/skyline/crucible
+    chown -R skyline:skyline /opt/skyline/ionosphere
+    chown -R skyline:skyline /opt/skyline/flux
+    chown skyline:skyline /tmp/skyline
+
 Skyline and dependencies install
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 

diff --git a/docs/ionosphere.rst b/docs/ionosphere.rst
@@ -230,6 +230,14 @@ Skyline does not know all the contexts to the data, you do.  Ionosphere lets
 us teach Bob **that is not an earthquake!!!** and enables Bob to look and ask,
 "Did Alice say this was not an earthquake, let me look".
 
+Negative values
+^^^^^^^^^^^^^^^
+
+It needs to be noted that the current implementation of the algorithm is only
+valid for positive time series.  Any anomaly in time series that is identified
+as having a negative value in that specific time series period will not be
+trainable.
+
 "Create" or "Create and LEARN"
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -255,7 +263,7 @@ to see the relevant Analyzer :mod:`settings.FULL_DURATION` or Mirage the
 ``SECOND_ORDER_RESOLUTION_HOURS`` data as not anomalous and **not** learn at
 the fuller duration of the metric's ``learn_full_duration``.
 
-You can teach Ionosphere badly, but to unteach it is just a click of the Disable
+You can teach Ionosphere badly, but to "unteach" it is just a click of the Disable
 features profile button.
 
 Use Ionosphere learning sparingly, although it is a feature, it will also
@@ -268,7 +276,7 @@ fall into this category.  Do not go overboard on learning do it slowly and
 incrementally.  All LEARNT features profiles have the ability to be validated,
 however this is not a hard requirement, unvalidated features profiles will still
 be used in analysis, the validated flag is currently simply there to give the
-operation a view of what LEARNT features profiles have not been assessed to
+operator a view of what LEARNT features profiles have not been assessed to
 determine their accuracy.  Skyline will also LEARN from an unvalidated features
 profile.  Therefore the operator needs to review and validate or disable
 features profiles every so often, this can be achieved when reviewing and

diff --git a/docs/upload-data-to-flux.rst b/docs/upload-data-to-flux.rst
@@ -0,0 +1,197 @@
+==================================
+upload_data to Flux - EXPERIMENTAL
+==================================
+
+**SIMPLE** data files can be uploaded via the /upload_data HTTP/S endpoint for
+Flux to process and fed to Graphite.  A number of things need to be enabled and
+running to allow for processing data file uploads, which are not enabled by
+default.
+
+For information regarding configuring Skyline to allow Flux to process uploaded
+data files see the `Process uploaded data <flux.html#process-uploaded-data>`__
+section on the Flux page.
+
+Skyline currently allows for the uploading of the following format data files:
+
+- csv (tested)
+- xlsx (tested)
+
+Seeing as data files can be large, the following archive formats are accepted:
+
+- gz (tested)
+- zip (tested)
+
+A single file or archive can be uploaded or many data files can be uploaded in
+a single archive.  A `info.csv` must also be included in the archive, more
+on that below.
+
+So you could upload `data.csv`, `data.csv.gz` or `data.zip` with the `data.csv`
+file inside the zip archive.
+
+Any files in an archive that are not of an allowed format are not extracted or
+they are deleted.  Try and submit one data file per upload as ordering has not
+been tested.
+
+Multiple data files should be uploaded and processed sequentially.  At the
+moment upload the oldest data file first and then after validating the status of
+the upload (the /api?upload_status link is return on the Flux_frontend page or
+in the json response), continue with the next upload, etc.
+
+Skyline automates the conversion of SIMPLE columnar data into time series data,
+however seeing as not all data is defined or created equally, some information
+about the data needs to be passed with the data file/s to inform Skyline about
+the metric structure.  This makes setting up a upload process more work, but if
+you are uploading frequently then it is one off work.
+
+To ensure that naive data in the datetime column  can be handled, meaning not
+timezone aware date times e.g. `16-05-2020 11:00` you always need to pass a
+valid pytz timezone as listed in `Timezones list for pytz version <development/pytz.html#timezones-list-for-pytz-version>`__.
+A timezone aware timestamp either specifies the UTC offset or the timezone
+itself.
+
+- 2020-05-16 14:00:00 -04:00
+- 2020-05-16 07:46:25 BST 2020
+
+However the timezone itself may not be a valid pytz timezone, so a valid pytz
+timezone must always be passed.
+
+In many instances it is possible that the column names and format in a data file
+will not describe the measurements in terms of metrics name or be in a suitable
+metric name format.  Take a csv example of a very normal type of data structure,
+lets call it 2020-05-16-11.device_id.012383.csv
+
+::
+
+    Device ID,012383
+    Serial Number,1234579853
+    Location,Warehouse 2
+    From,16/05/2020
+    To,16/05/2020
+    Date,Roof Temperature,Floor Temperature
+    16-05-2020 11:00,45.72,22.78
+    16-05-2020 11:15,45.94,22.92
+    16-05-2020 11:30,46.13,22.98
+    16-05-2020 11:45,46.34,23.06
+
+This data informs us of the times and values and things, but it does not tell us
+what metrics they should represent.  Skyline also needs to be informed about the
+header row and rows to ignore.  A info file is used to inform Skyline how to
+read and metric the data, take 2020-05-16-11.device_id.012383.info.json
+Note, in processing rows are 0-indexed
+
+info.json
+
+::
+
+    {
+      "parent_metric_namespace": "temp_monitoring.warehouse.2.012383",
+      "timezone": "GMT",
+      "skip_rows": 5,
+      "header_row": 0,
+      "date_orientation": "rows",
+      "columns_to_metrics": "date,roof,floor"
+    }
+
+Your date time column MUST be named date in the columns_to_metrics mapping.
+
+For convenience sake you can also add two additional elements to the info.json:
+
+- `"debug": "true"` which outputs additional information regarding the imported
+  dataframe in the flux.log to aid with debugging.
+- `"dryrun": "true"` which runs through the processing but does not submit data
+  to Graphite.
+
+This tells Skyline what the parent metric namespace should be which would
+result in metrics:
+
+- temp_monitoring.warehouse.2.012383.roof
+- temp_monitoring.warehouse.2.012383.floor
+
+xlsx files are 0-indexed, csv files are not 0-indexed.
+
+It tells Skyline to ignore rows 1, 2, 3, 4, 5 (but if it were 0-indexed skip_row
+would be set to 4).
+It tells Skyline to use row 0 as the header row, e.g. column names. Note that
+if you skip_rows your header row must be 0.
+It tells Skyline how to map the column names to metric names.  A one to one
+mapping it required for every column.  Once again, your date time column MUST be
+named date in the columns_to_metrics mapping.
+
+Only alphanumeric chars and '.', '_', '-' are allowed in the metric name, e.g.
+the parent_metric_namespace and columns_to_metrics that you pass.
+
+Requirements of the data file.  The data file must have a header row.
+
+The required information elements are in the POST variables are:
+
+- key (str)
+- parent_metric_namespace (str)
+- archive (str) - gz, zip or tar_gz
+- format (str)
+- skip_rows ('none' or int)
+- header_row (int)
+- date_orientation (str) - currently only 'rows' is supported
+- data_file (required in the post variables)
+- columns_to_metrics (str) - comma separated list of names (no spaces)
+- data_file (file)
+- info_file (file)
+- json_response
+
+An example of how to POST the above csv and info.csv with curl be would be as
+follows.  Note that in this instance you would need a your
+:mod:`settings.FLUX_UPLOADS_KEYS` to be set with:
+
+.. code-block:: python
+
+    FLUX_UPLOADS_KEYS = {
+        'temp_monitoring.warehouse.2.012383': '484166bf-df66-4f7d-ad4a-9336da9ef620',
+    }
+
+curl request.
+
+.. code-block:: bash
+
+    curl \
+         -F "key=484166bf-df66-4f7d-ad4a-9336da9ef620" \
+         -F "timezone=GMT" \
+         -F "parent_metric_namespace=temp_monitoring.warehouse.2.012383" \
+         -F "archive=none" \
+         -F "format=csv" \
+         -F "skip_rows=5" \
+         -F "header_row=0" \
+         -F "date_orientation=rows" \
+         -F "columns_to_metrics=date,roof,floor" \
+         -F "data_file=@<FULL_PATH_TO_FILE>/2020-05-16-11.device_id.012383.csv" \
+         -F "info_file=@<FULL_PATH_TO_FILE>/info.json" \
+         -F "json_response=true" \
+         https://$SKYLINE_HOST/upload_data
+
+
+Vista
+~~~~~
+
+Metrics to be fetched by Vista which submits to Flux, see the
+`Vista <vista.html>`__ page.
+
+Adding a Listener
+=================
+
+If neither of these listeners are acceptable, it's easy enough to extend
+them. Add a method in listen.py and add a line in the horizon-agent that
+points to your new listener.
+
+:mod:`settings.FULL_DURATION`
+=============================
+
+Once you get real data flowing through your system, the Analyzer will be
+able start analyzing for anomalies.
+
+.. note:: Do not expect to see anomalies or anything in the Webapp immediately
+  after starting the Skyline services. Realistically :mod:`settings.FULL_DURATION`
+  should have been passed, before you begin to assess any triggered anomalies,
+  after all :mod:`settings.FULL_DURATION` is the baseline.  Although not all
+  algorithms utilize all the :mod:`settings.FULL_DURATION` data points, some do
+  and some use only 1 hour's worth.  However the Analyzer log should still report
+  values in the exception stats, reporting how many metrics were boring, too
+  short, etc as soon as it is getting data for metrics that Horizon is populating
+  into Redis.
diff --git a/requirements.txt b/requirements.txt
@@ -52,3 +52,5 @@ urllib3==1.25.3
 graphyte==1.6.0
 statsd==3.3.0
 falcon==2.0.0
+xlrd==1.2.0
+pandas_log==0.1.7; python_version >= '3.2'