Skip to content

Commit

Permalink
Merge pull request #59 from bird-house/add-alert-handling
Browse files Browse the repository at this point in the history
Monitoring: add alert rules and alert handling (deduplicate, group, route, silence, inhibit).

This is a follow up to the previous PR #56 that added the monitoring itself.

Added cAdvisor and Node-exporter collection of alert rules found here https://awesome-prometheus-alerts.grep.to/rules with a few fixing because of errors in the rules and tweaking to reduce false positive alarms (see list of commits).  Great collection of sample of ready-made rules to hit the ground running and learn PromML query language on the way.

![2020-07-08-090953_474x1490_scrot](https://user-images.githubusercontent.com/11966697/86926000-8b086c80-c0ff-11ea-92d0-6f5ccfe2b8e1.png)

Added Alertmanager to handle the alerts (deduplicate, group, route, silence, inhibit).  Currently the only notification route configured is email but Alertmanager is able to route alerts to Slack and any generic services accepting webhooks.

![2020-07-08-091150_1099x669_scrot](https://user-images.githubusercontent.com/11966697/86926213-cd31ae00-c0ff-11ea-8b2a-d33803ad3d5d.png)

![2020-07-08-091302_1102x1122_scrot](https://user-images.githubusercontent.com/11966697/86926276-dc186080-c0ff-11ea-9377-bda03b69640e.png)

This is an initial attempt at alerting.  There are several ways to tweak the system without changing the code:

* To add more Prometheus alert rules, volume-mount more *.rules files to the prometheus container.
* To disable existing Prometheus alert rules, add more Alertmanager inhibition rules using `ALERTMANAGER_EXTRA_INHIBITION` via `env.local` file.
* Other possible Alertmanager configs via `env.local`: `ALERTMANAGER_EXTRA_GLOBAL, ALERTMANAGER_EXTRA_ROUTES, ALERTMANAGER_EXTRA_RECEIVERS`.

What more could be done after this initial attempt:

* Possibly add more graphs to Grafana dashboard since we have more alerts on metrics that we do not have matching Grafana graph. Graphs are useful for historical trends and correlation with other metrics, so not required if we do not need trends and correlation.

* Only basic metrics are being collected currently.  We could collect more useful metrics like SMART status and alert when a disk is failing.

* The autodeploy mechanism can hook into this monitoring system to report pass/fail status and execution duration, with alerting for problems.  Then we can also correlate any CPU, memory, disk I/O spike, when the autodeploy runs and have a trace of previous autodeploy executions.

I had to test these alerts directly in prod to tweak for less false positive alert and to debug not working rules to ensure they work on prod so these changes are already in prod !   This also test the SMTP server on the network.

See rules on Prometheus side: http://pavics.ouranos.ca:9090/rules, http://medus.ouranos.ca:9090/rules

Manage alerts on Alertmanager side: http://pavics.ouranos.ca:9093/#/alerts, http://medus.ouranos.ca:9093/#/alerts

Part of issue #12
  • Loading branch information
tlvu committed Jul 11, 2020
2 parents 775c3b3 + 4f9aa2d commit 582dd9c
Show file tree
Hide file tree
Showing 15 changed files with 977 additions and 69 deletions.
73 changes: 4 additions & 69 deletions birdhouse/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,20 +35,11 @@ below and the variable `AUTODEPLOY_EXTRA_REPOS` in

The automatic deployment of the PAVICS platform, of the Jupyter tutorial
notebooks and of the automatic deployment mechanism itself can all be
enabled and configured in the `env.local` file (a copy from
[`env.local.example`](env.local.example)).
enabled by following instructions [here](components/README.rst#scheduler).

* Add `./components/scheduler` to `EXTRA_CONF_DIRS`.
* Set `AUTODEPLOY_EXTRA_REPOS`, `AUTODEPLOY_DEPLOY_KEY_ROOT_DIR`,
`AUTODEPLOY_PLATFORM_FREQUENCY`, `AUTODEPLOY_NOTEBOOK_FREQUENCY` as
desired, full documentation in [`env.local.example`](env.local.example).
* Run once [`fix-write-perm`](deployment/fix-write-perm), see doc in script.

Resource usage monitoring (CPU, memory, ..) for the host and each of the containers
can be enabled by enabling the `./components/monitoring` in `env.local` file.

* Add `./components/monitoring` to `EXTRA_CONF_DIRS`.
* Change `GRAFANA_ADMIN_PASSWORD` value.
Resource usage monitoring (CPU, memory, ..) and alerting for the host and each
of the containers can be enabled by following instructions
[here](components/README.rst#monitoring).

To launch all the containers, use the following command:
```
Expand Down Expand Up @@ -94,62 +85,6 @@ postgres instance. See [`scripts/create-wps-pgsql-databases.sh`](scripts/create-
* Click "Add User".


## Mostly automated unattended continuous deployment

***NOTE***: this section about automated unattended continuous deployment is
superseded by the new `./components/scheduler` that can be entirely
enabled/disabled via the `env.local` file. See the part about automatic
deployment of the PAVICS platform in the "Docker instructions" section
above for how to configure it.

Automated unattended continuous deployment means if code change in the checkout
of this repo, on the same currently checkout branch (ex: config changes,
`docker-compose.yml` changes) a deployment will be performed automatically
without human intervention.

The trigger for the deployment is new code change on the server on the current
branch (PR merged, push). New code change locally will not trigger deployment
so local development workflow is also supported.

Note: there are still cases where a human intervention is needed. See note in
script [`deployment/deploy.sh`](deployment/deploy.sh).

Configure logrotate for all following automations to prevent disk full:
```
deployment/install-logrotate-config .. $USER
```

To enable continuous deployment of PAVICS:

```
deployment/install-automated-deployment.sh .. $USER [daily|5-mins]
# read the script for more options/details
```

If you want to manually force a deployment of PAVICS (note this might not use
latest version of deploy.sh script):
```
deployment/deploy.sh .
# read the script for more options/details
```

To enable continuous deployment of tutorial Jupyter notebooks:

```
deployment/install-deploy-notebook .. $USER
# read the script for more details
```

To trigger tutorial Jupyter notebooks deploy manually:
```
# configure logrotate before because this script will log to
# /var/log/PAVICS/notebookdeploy.log
deployment/trigger-deploy-notebook
# read the script for more details
```


## Vagrant instructions

Vagrant allows us to quickly spin up a VM to easily reproduce the runtime
Expand Down
233 changes: 233 additions & 0 deletions birdhouse/components/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,233 @@
#################
PAVICS Components
#################


.. contents::


Scheduler
=========

This component provides automated unattended continuous deployment for the
"PAVICS stack", for the tutorial notebooks on the Jupyter environment and for the
automated deployment itself.

It can also be used to schedule other tasks on the PAVICS physical host.

Everything is dockerized, the deployment runs inside a container that will
update all other containers.

Automated unattended continuous deployment means if code change in the remote
repo, matching the same currently checkout branch (ex: config changes,
``docker-compose.yml`` changes) a deployment will be performed automatically
without human intervention.

The trigger for the deployment is new code change on the server on the current
branch (PR merged, push). New code change locally will not trigger deployment
so local development workflow is also supported.

Multiple remote repos are supported so the "PAVICS stack" can be made of
multiple checkouts for modularity and extensibility. The autodeploy will
trigger if any of the checkouts (configured in ``AUTODEPLOY_EXTRA_REPOS``) is
not up-to-date with its remote repo.

A suggested "PAVICS stack" is made of at least 2 repos, this repo and another
private repo containing the source controlled ``env.local`` file and any other
docker-compose override for true infrastructure-as-code.

Note: there are still cases where a human intervention is needed. See note in
script deploy.sh_.


Usage
-----

Given the unattended nature, there is no UI. Logs are used to keep trace.

- ``/var/log/PAVICS/autodeploy.log`` is for the PAVICS deployment.

- ``/var/log/PAVICS/notebookdeploy.log`` is for the tutorial notebooks deployment.

- logrotate is enabled for ``/var/log/PAVICS/*.log`` to avoid filling up the
disk. Any new ``.log`` files in that folder will get logrotate for free.


How to Enable the Component
---------------------------

- Edit ``env.local`` (a copy of env.local.example_)

- Add "./components/scheduler" to ``EXTRA_CONF_DIRS``.
- Set ``AUTODEPLOY_EXTRA_REPOS``, ``AUTODEPLOY_DEPLOY_KEY_ROOT_DIR``,
``AUTODEPLOY_PLATFORM_FREQUENCY``, ``AUTODEPLOY_NOTEBOOK_FREQUENCY`` as desired,
full documentation in env.local.example_.
- Run once fix-write-perm_, see doc in script.


Old way to deploy the automatic deployment
------------------------------------------

Superseeded by this new ``scheduler`` component. Keeping for reference only.

Doing it this old way do not need the ``scheduler`` compoment but lose the
ability for the autodeploy system to update itself.

Configure logrotate for all following automations to prevent disk full::

deployment/install-logrotate-config .. $USER

To enable continuous deployment of PAVICS::

deployment/install-automated-deployment.sh .. $USER [daily|5-mins]
# read the script for more options/details

If you want to manually force a deployment of PAVICS (note this might not use
latest version of deploy.sh script)::

deployment/deploy.sh .
# read the script for more options/details

To enable continuous deployment of tutorial Jupyter notebooks::

deployment/install-deploy-notebook .. $USER
# read the script for more details

To trigger tutorial Jupyter notebooks deploy manually::

# configure logrotate before because this script will log to
# /var/log/PAVICS/notebookdeploy.log

deployment/trigger-deploy-notebook
# read the script for more details

Migrating to the new mechanism requires manual deletion of all the artifacts
created by the old install scripts: ``sudo rm /etc/cron.d/PAVICS-deploy
/etc/cron.hourly/PAVICS-deploy-notebooks /etc/logrotate.d/PAVICS-deploy
/usr/local/sbin/triggerdeploy.sh``. Both can not co-exist at the same time.


Comparison between the old and new autodeploy mechanism
-------------------------------------------------------

Maximum backward-compatibility has been kept with the old install scripts style:

* Still log to the same existing log files under ``/var/log/PAVICS``.
* Old single ssh deploy key is still compatible, but the new mechanism allows for different ssh deploy keys for each extra repos (again, public repos should use https clone path to avoid dealing with ssh deploy keys in the first place).
* Old install scripts are kept and can still deploy the old way.

Features missing in old install scripts or how the new mechanism improves on the old install scripts:

* Autodeploy of the autodeploy itself ! This is the biggest win. Previously, if ``triggerdeploy.sh`` or ``PAVICS-deploy-notebooks`` script changes, they have to be deployed manually. It's very annoying. Now they are volume-mount in so are fresh on each run.
* ``env.local`` now drive absolutely everything, source control that file and we've got a true DevOPS pipeline.
* Configurable platform and notebook autodeploy frequency. Previously, this means manually editing the generated cron file, less ideal.
* Do not need any support on the local host other than ``docker`` and ``docker-compose``. ``cron/logrotate/git/ssh`` versions are all locked-down in the docker images used by the autodeploy. Recall previously we had to deal with git version too old on some hosts.
* Each cron job run in its own docker image meaning the runtime environment is traceable and reproducible.
* The newly introduced scheduler component is made extensible so other jobs can added into it as well (ex: backup), via ``env.local``, which should be source controlled, meaning all surrounding maintenance related tasks can also be traceable and reproducible.


Monitoring
==========

This component provides monitoring and alerting for the PAVICS physical host
and containers.

Prometheus stack is used:

* Node-exporter to collect host metrics.
* cAdvisor to collect containers metrics.
* Prometheus to scrape metrics, to store them and to query them.
* AlertManager to manage alerts: deduplicate, group, route, silence, inhibit.
* Grafana to provide visualization dashboard for the metrics.


Usage
-----

- Grafana to view metric graphs: http://PAVICS_FQDN:3001/d/pf6xQMWGz/docker-and-system-monitoring
- Prometheus alert rules: http://PAVICS_FQDN:9090/rules
- AlertManager to manage alerts: http://PAVICS_FQDN:9093


How to Enable the Component
---------------------------

- Edit ``env.local`` (a copy of env.local.example_)

- Add "./components/monitoring" to ``EXTRA_CONF_DIRS``
- Set ``GRAFANA_ADMIN_PASSWORD`` to login to Grafana
- Set ``ALERTMANAGER_ADMIN_EMAIL_RECEIVER`` for receiving alerts
- Set ``SMTP_SERVER`` for sending alerts
- Optionally set

- ``ALERTMANAGER_EXTRA_GLOBAL`` to further configure AlertManager
- ``ALERTMANAGER_EXTRA_ROUTES`` to add more routes than email notification
- ``ALERTMANAGER_EXTRA_INHIBITION`` to disable rule from firing
- ``ALERTMANAGER_EXTRA_RECEIVERS`` to add more receivers than the admin emails


Grafana Dashboard
-----------------

.. image:: grafana-dashboard.png

For host, using Node-exporter to collect metrics:

- uptime
- number of container
- used disk space
- used memory, available memory, used swap memory
- load
- cpu usage
- in and out network traffic
- disk I/O

For each container, using cAdvisor to collect metrics:

- in and out network traffic
- cpu usage
- memory and swap memory usage
- disk usage

Useful visualisation features:

- zoom in one graph and all other graph update to match the same "time range" so we can correlate event
- view each graph independently for more details
- mouse over each data point will show value at that moment


Prometheus Alert Rules
----------------------

.. image:: prometheus-alert-rules.png


AlertManager for Alert Dashboard and Silencing
----------------------------------------------

.. image:: alertmanager-dashboard.png
.. image:: alertmanager-silence-alert.png


Customizing the Component
-------------------------

- To add more Grafana dashboard, volume-mount more ``*.json`` files to the
grafana container.

- To add more Prometheus alert rules, volume-mount more ``*.rules`` files to
the prometheus container.

- To disable existing Prometheus alert rules, add more Alertmanager inhibition
rules using ``ALERTMANAGER_EXTRA_INHIBITION`` via ``env.local`` file.

- Other possible Alertmanager configs via ``env.local``:
``ALERTMANAGER_EXTRA_GLOBAL``, ``ALERTMANAGER_EXTRA_ROUTES`` (can route to
Slack or other services accepting webhooks), ``ALERTMANAGER_EXTRA_RECEIVERS``.




.. _env.local.example: ../env.local.example
.. _fix-write-perm: ../deployment/fix-write-perm
.. _deploy.sh: ../deployment/deploy.sh
Binary file added birdhouse/components/alertmanager-dashboard.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added birdhouse/components/grafana-dashboard.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions birdhouse/components/monitoring/.gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
prometheus.yml
grafana_datasources.yml
grafana_dashboards.yml
alertmanager.yml
prometheus.rules

0 comments on commit 582dd9c

Please sign in to comment.