Merge pull request #59 from bird-house/add-alert-handling

Monitoring: add alert rules and alert handling (deduplicate, group, route, silence, inhibit). This is a follow up to the previous PR #56 that added the monitoring itself. Added cAdvisor and Node-exporter collection of alert rules found here https://awesome-prometheus-alerts.grep.to/rules with a few fixing because of errors in the rules and tweaking to reduce false positive alarms (see list of commits). Great collection of sample of ready-made rules to hit the ground running and learn PromML query language on the way. ![2020-07-08-090953_474x1490_scrot](https://user-images.githubusercontent.com/11966697/86926000-8b086c80-c0ff-11ea-92d0-6f5ccfe2b8e1.png) Added Alertmanager to handle the alerts (deduplicate, group, route, silence, inhibit). Currently the only notification route configured is email but Alertmanager is able to route alerts to Slack and any generic services accepting webhooks. ![2020-07-08-091150_1099x669_scrot](https://user-images.githubusercontent.com/11966697/86926213-cd31ae00-c0ff-11ea-8b2a-d33803ad3d5d.png) ![2020-07-08-091302_1102x1122_scrot](https://user-images.githubusercontent.com/11966697/86926276-dc186080-c0ff-11ea-9377-bda03b69640e.png) This is an initial attempt at alerting. There are several ways to tweak the system without changing the code: * To add more Prometheus alert rules, volume-mount more *.rules files to the prometheus container. * To disable existing Prometheus alert rules, add more Alertmanager inhibition rules using `ALERTMANAGER_EXTRA_INHIBITION` via `env.local` file. * Other possible Alertmanager configs via `env.local`: `ALERTMANAGER_EXTRA_GLOBAL, ALERTMANAGER_EXTRA_ROUTES, ALERTMANAGER_EXTRA_RECEIVERS`. What more could be done after this initial attempt: * Possibly add more graphs to Grafana dashboard since we have more alerts on metrics that we do not have matching Grafana graph. Graphs are useful for historical trends and correlation with other metrics, so not required if we do not need trends and correlation. * Only basic metrics are being collected currently. We could collect more useful metrics like SMART status and alert when a disk is failing. * The autodeploy mechanism can hook into this monitoring system to report pass/fail status and execution duration, with alerting for problems. Then we can also correlate any CPU, memory, disk I/O spike, when the autodeploy runs and have a trace of previous autodeploy executions. I had to test these alerts directly in prod to tweak for less false positive alert and to debug not working rules to ensure they work on prod so these changes are already in prod ! This also test the SMTP server on the network. See rules on Prometheus side: http://pavics.ouranos.ca:9090/rules, http://medus.ouranos.ca:9090/rules Manage alerts on Alertmanager side: http://pavics.ouranos.ca:9093/#/alerts, http://medus.ouranos.ca:9093/#/alerts Part of issue #12
bird-house · Jul 11, 2020 · 582dd9c · 582dd9c
2 parents 775c3b3 + 4f9aa2d
commit 582dd9c
Show file tree

Hide file tree

Showing 15 changed files with 977 additions and 69 deletions.
diff --git a/birdhouse/README.md b/birdhouse/README.md
@@ -35,20 +35,11 @@ below and the variable `AUTODEPLOY_EXTRA_REPOS` in
 
 The automatic deployment of the PAVICS platform, of the Jupyter tutorial
 notebooks and of the automatic deployment mechanism itself can all be
-enabled and configured in the `env.local` file (a copy from
-[`env.local.example`](env.local.example)).
+enabled by following instructions [here](components/README.rst#scheduler).
 
-* Add `./components/scheduler` to `EXTRA_CONF_DIRS`.
-* Set `AUTODEPLOY_EXTRA_REPOS`, `AUTODEPLOY_DEPLOY_KEY_ROOT_DIR`,
-  `AUTODEPLOY_PLATFORM_FREQUENCY`, `AUTODEPLOY_NOTEBOOK_FREQUENCY` as
-  desired, full documentation in [`env.local.example`](env.local.example).
-* Run once [`fix-write-perm`](deployment/fix-write-perm), see doc in script.
-
-Resource usage monitoring (CPU, memory, ..) for the host and each of the containers
-can be enabled by enabling the `./components/monitoring` in `env.local` file.
-
-* Add `./components/monitoring` to `EXTRA_CONF_DIRS`.
-* Change `GRAFANA_ADMIN_PASSWORD` value.
+Resource usage monitoring (CPU, memory, ..) and alerting for the host and each
+of the containers can be enabled by following instructions
+[here](components/README.rst#monitoring).
 
 To launch all the containers, use the following command:
 ```
@@ -94,62 +85,6 @@ postgres instance. See [`scripts/create-wps-pgsql-databases.sh`](scripts/create-
 * Click "Add User".
 
 
-## Mostly automated unattended continuous deployment
-
-***NOTE***: this section about automated unattended continuous deployment is
-superseded by the new `./components/scheduler` that can be entirely
-enabled/disabled via the `env.local` file.  See the part about automatic
-deployment of the PAVICS platform in the "Docker instructions" section
-above for how to configure it.
-
-Automated unattended continuous deployment means if code change in the checkout
-of this repo, on the same currently checkout branch (ex: config changes,
-`docker-compose.yml` changes) a deployment will be performed automatically
-without human intervention.
-
-The trigger for the deployment is new code change on the server on the current
-branch (PR merged, push).  New code change locally will not trigger deployment
-so local development workflow is also supported.
-
-Note: there are still cases where a human intervention is needed.  See note in
-script [`deployment/deploy.sh`](deployment/deploy.sh).
-
-Configure logrotate for all following automations to prevent disk full:
-```
-deployment/install-logrotate-config .. $USER
-```
-
-To enable continuous deployment of PAVICS:
-
-```
-deployment/install-automated-deployment.sh .. $USER [daily|5-mins]
-# read the script for more options/details
-```
-
-If you want to manually force a deployment of PAVICS (note this might not use
-latest version of deploy.sh script):
-```
-deployment/deploy.sh .
-# read the script for more options/details
-```
-
-To enable continuous deployment of tutorial Jupyter notebooks:
-
-```
-deployment/install-deploy-notebook .. $USER
-# read the script for more details
-```
-
-To trigger tutorial Jupyter notebooks deploy manually:
-```
-# configure logrotate before because this script will log to
-# /var/log/PAVICS/notebookdeploy.log
-
-deployment/trigger-deploy-notebook
-# read the script for more details
-```
-
-
 ## Vagrant instructions
 
 Vagrant allows us to quickly spin up a VM to easily reproduce the runtime

diff --git a/birdhouse/components/README.rst b/birdhouse/components/README.rst
@@ -0,0 +1,233 @@
+#################
+PAVICS Components
+#################
+
+
+.. contents::
+
+
+Scheduler
+=========
+
+This component provides automated unattended continuous deployment for the
+"PAVICS stack", for the tutorial notebooks on the Jupyter environment and for the
+automated deployment itself.
+
+It can also be used to schedule other tasks on the PAVICS physical host.
+
+Everything is dockerized, the deployment runs inside a container that will
+update all other containers.
+
+Automated unattended continuous deployment means if code change in the remote
+repo, matching the same currently checkout branch (ex: config changes,
+``docker-compose.yml`` changes) a deployment will be performed automatically
+without human intervention.
+
+The trigger for the deployment is new code change on the server on the current
+branch (PR merged, push). New code change locally will not trigger deployment
+so local development workflow is also supported.
+
+Multiple remote repos are supported so the "PAVICS stack" can be made of
+multiple checkouts for modularity and extensibility.  The autodeploy will
+trigger if any of the checkouts (configured in ``AUTODEPLOY_EXTRA_REPOS``) is
+not up-to-date with its remote repo.
+
+A suggested "PAVICS stack" is made of at least 2 repos, this repo and another
+private repo containing the source controlled ``env.local`` file and any other
+docker-compose override for true infrastructure-as-code.
+
+Note: there are still cases where a human intervention is needed. See note in
+script deploy.sh_.
+
+
+Usage
+-----
+
+Given the unattended nature, there is no UI.  Logs are used to keep trace.
+
+- ``/var/log/PAVICS/autodeploy.log`` is for the PAVICS deployment.
+
+- ``/var/log/PAVICS/notebookdeploy.log`` is for the tutorial notebooks deployment.
+
+- logrotate is enabled for ``/var/log/PAVICS/*.log`` to avoid filling up the
+  disk.  Any new ``.log`` files in that folder will get logrotate for free.
+
+
+How to Enable the Component
+---------------------------
+
+- Edit ``env.local`` (a copy of env.local.example_)
+
+  - Add "./components/scheduler" to ``EXTRA_CONF_DIRS``.
+  - Set ``AUTODEPLOY_EXTRA_REPOS``, ``AUTODEPLOY_DEPLOY_KEY_ROOT_DIR``,
+    ``AUTODEPLOY_PLATFORM_FREQUENCY``, ``AUTODEPLOY_NOTEBOOK_FREQUENCY`` as desired,
+    full documentation in env.local.example_.
+  - Run once fix-write-perm_, see doc in script.
+
+
+Old way to deploy the automatic deployment
+------------------------------------------
+
+Superseeded by this new ``scheduler`` component.  Keeping for reference only.
+
+Doing it this old way do not need the ``scheduler`` compoment but lose the
+ability for the autodeploy system to update itself.
+
+Configure logrotate for all following automations to prevent disk full::
+
+  deployment/install-logrotate-config .. $USER
+
+To enable continuous deployment of PAVICS::
+
+  deployment/install-automated-deployment.sh .. $USER [daily|5-mins]
+  # read the script for more options/details
+
+If you want to manually force a deployment of PAVICS (note this might not use
+latest version of deploy.sh script)::
+
+  deployment/deploy.sh .
+  # read the script for more options/details
+
+To enable continuous deployment of tutorial Jupyter notebooks::
+
+  deployment/install-deploy-notebook .. $USER
+  # read the script for more details
+
+To trigger tutorial Jupyter notebooks deploy manually::
+
+  # configure logrotate before because this script will log to
+  # /var/log/PAVICS/notebookdeploy.log
+
+  deployment/trigger-deploy-notebook
+  # read the script for more details
+
+Migrating to the new mechanism requires manual deletion of all the artifacts
+created by the old install scripts: ``sudo rm /etc/cron.d/PAVICS-deploy
+/etc/cron.hourly/PAVICS-deploy-notebooks /etc/logrotate.d/PAVICS-deploy
+/usr/local/sbin/triggerdeploy.sh``.  Both can not co-exist at the same time.
+
+
+Comparison between the old and new autodeploy mechanism
+-------------------------------------------------------
+
+Maximum backward-compatibility has been kept with the old install scripts style:
+
+* Still log to the same existing log files under ``/var/log/PAVICS``.
+* Old single ssh deploy key is still compatible, but the new mechanism allows for different ssh deploy keys for each extra repos (again, public repos should use https clone path to avoid dealing with ssh deploy keys in the first place).
+* Old install scripts are kept and can still deploy the old way.
+
+Features missing in old install scripts or how the new mechanism improves on the old install scripts:
+
+* Autodeploy of the autodeploy itself !  This is the biggest win.  Previously, if ``triggerdeploy.sh`` or ``PAVICS-deploy-notebooks`` script changes, they have to be deployed manually.  It's very annoying.  Now they are volume-mount in so are fresh on each run.
+* ``env.local`` now drive absolutely everything, source control that file and we've got a true DevOPS pipeline.
+* Configurable platform and notebook autodeploy frequency.  Previously, this means manually editing the generated cron file, less ideal.
+* Do not need any support on the local host other than ``docker`` and ``docker-compose``.  ``cron/logrotate/git/ssh`` versions are all locked-down in the docker images used by the autodeploy.  Recall previously we had to deal with git version too old on some hosts.
+* Each cron job run in its own docker image meaning the runtime environment is traceable and reproducible.
+* The newly introduced scheduler component is made extensible so other jobs can added into it as well (ex: backup), via ``env.local``, which should be source controlled, meaning all surrounding maintenance related tasks can also be traceable and reproducible.
+
+
+Monitoring
+==========
+
+This component provides monitoring and alerting for the PAVICS physical host
+and containers.
+
+Prometheus stack is used:
+
+* Node-exporter to collect host metrics.
+* cAdvisor to collect containers metrics.
+* Prometheus to scrape metrics, to store them and to query them.
+* AlertManager to manage alerts: deduplicate, group, route, silence, inhibit.
+* Grafana to provide visualization dashboard for the metrics.
+
+
+Usage
+-----
+
+- Grafana to view metric graphs: http://PAVICS_FQDN:3001/d/pf6xQMWGz/docker-and-system-monitoring
+- Prometheus alert rules: http://PAVICS_FQDN:9090/rules
+- AlertManager to manage alerts: http://PAVICS_FQDN:9093
+
+
+How to Enable the Component
+---------------------------
+
+- Edit ``env.local`` (a copy of env.local.example_)
+
+  - Add "./components/monitoring" to ``EXTRA_CONF_DIRS``
+  - Set ``GRAFANA_ADMIN_PASSWORD`` to login to Grafana
+  - Set ``ALERTMANAGER_ADMIN_EMAIL_RECEIVER`` for receiving alerts
+  - Set ``SMTP_SERVER`` for sending alerts
+  - Optionally set
+
+    - ``ALERTMANAGER_EXTRA_GLOBAL`` to further configure AlertManager
+    - ``ALERTMANAGER_EXTRA_ROUTES`` to add more routes than email notification
+    - ``ALERTMANAGER_EXTRA_INHIBITION`` to disable rule from firing
+    - ``ALERTMANAGER_EXTRA_RECEIVERS`` to add more receivers than the admin emails
+
+
+Grafana Dashboard
+-----------------
+
+.. image:: grafana-dashboard.png
+
+For host, using Node-exporter to collect metrics:
+
+- uptime
+- number of container
+- used disk space
+- used memory, available memory, used swap memory
+- load
+- cpu usage
+- in and out network traffic
+- disk I/O
+
+For each container, using cAdvisor to collect metrics:
+
+- in and out network traffic
+- cpu usage
+- memory and swap memory usage
+- disk usage
+
+Useful visualisation features:
+
+- zoom in one graph and all other graph update to match the same "time range" so we can correlate event
+- view each graph independently for more details
+- mouse over each data point will show value at that moment
+
+
+Prometheus Alert Rules
+----------------------
+
+.. image:: prometheus-alert-rules.png
+
+
+AlertManager for Alert Dashboard and Silencing
+----------------------------------------------
+
+.. image:: alertmanager-dashboard.png
+.. image:: alertmanager-silence-alert.png
+
+
+Customizing the Component
+-------------------------
+
+- To add more Grafana dashboard, volume-mount more ``*.json`` files to the
+  grafana container.
+
+- To add more Prometheus alert rules, volume-mount more ``*.rules`` files to
+  the prometheus container.
+
+- To disable existing Prometheus alert rules, add more Alertmanager inhibition
+  rules using ``ALERTMANAGER_EXTRA_INHIBITION`` via ``env.local`` file.
+
+- Other possible Alertmanager configs via ``env.local``:
+  ``ALERTMANAGER_EXTRA_GLOBAL``, ``ALERTMANAGER_EXTRA_ROUTES`` (can route to
+  Slack or other services accepting webhooks), ``ALERTMANAGER_EXTRA_RECEIVERS``.
+
+
+
+
+.. _env.local.example: ../env.local.example
+.. _fix-write-perm: ../deployment/fix-write-perm
+.. _deploy.sh: ../deployment/deploy.sh
diff --git a/birdhouse/components/alertmanager-dashboard.png b/birdhouse/components/alertmanager-dashboard.png
diff --git a/birdhouse/components/alertmanager-silence-alert.png b/birdhouse/components/alertmanager-silence-alert.png
diff --git a/birdhouse/components/grafana-dashboard.png b/birdhouse/components/grafana-dashboard.png
diff --git a/birdhouse/components/monitoring/.gitignore b/birdhouse/components/monitoring/.gitignore
@@ -1,3 +1,5 @@
 prometheus.yml
 grafana_datasources.yml
 grafana_dashboards.yml
+alertmanager.yml
+prometheus.rules