-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #59 from bird-house/add-alert-handling
Monitoring: add alert rules and alert handling (deduplicate, group, route, silence, inhibit). This is a follow up to the previous PR #56 that added the monitoring itself. Added cAdvisor and Node-exporter collection of alert rules found here https://awesome-prometheus-alerts.grep.to/rules with a few fixing because of errors in the rules and tweaking to reduce false positive alarms (see list of commits). Great collection of sample of ready-made rules to hit the ground running and learn PromML query language on the way. ![2020-07-08-090953_474x1490_scrot](https://user-images.githubusercontent.com/11966697/86926000-8b086c80-c0ff-11ea-92d0-6f5ccfe2b8e1.png) Added Alertmanager to handle the alerts (deduplicate, group, route, silence, inhibit). Currently the only notification route configured is email but Alertmanager is able to route alerts to Slack and any generic services accepting webhooks. ![2020-07-08-091150_1099x669_scrot](https://user-images.githubusercontent.com/11966697/86926213-cd31ae00-c0ff-11ea-8b2a-d33803ad3d5d.png) ![2020-07-08-091302_1102x1122_scrot](https://user-images.githubusercontent.com/11966697/86926276-dc186080-c0ff-11ea-9377-bda03b69640e.png) This is an initial attempt at alerting. There are several ways to tweak the system without changing the code: * To add more Prometheus alert rules, volume-mount more *.rules files to the prometheus container. * To disable existing Prometheus alert rules, add more Alertmanager inhibition rules using `ALERTMANAGER_EXTRA_INHIBITION` via `env.local` file. * Other possible Alertmanager configs via `env.local`: `ALERTMANAGER_EXTRA_GLOBAL, ALERTMANAGER_EXTRA_ROUTES, ALERTMANAGER_EXTRA_RECEIVERS`. What more could be done after this initial attempt: * Possibly add more graphs to Grafana dashboard since we have more alerts on metrics that we do not have matching Grafana graph. Graphs are useful for historical trends and correlation with other metrics, so not required if we do not need trends and correlation. * Only basic metrics are being collected currently. We could collect more useful metrics like SMART status and alert when a disk is failing. * The autodeploy mechanism can hook into this monitoring system to report pass/fail status and execution duration, with alerting for problems. Then we can also correlate any CPU, memory, disk I/O spike, when the autodeploy runs and have a trace of previous autodeploy executions. I had to test these alerts directly in prod to tweak for less false positive alert and to debug not working rules to ensure they work on prod so these changes are already in prod ! This also test the SMTP server on the network. See rules on Prometheus side: http://pavics.ouranos.ca:9090/rules, http://medus.ouranos.ca:9090/rules Manage alerts on Alertmanager side: http://pavics.ouranos.ca:9093/#/alerts, http://medus.ouranos.ca:9093/#/alerts Part of issue #12
- Loading branch information
Showing
15 changed files
with
977 additions
and
69 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,233 @@ | ||
################# | ||
PAVICS Components | ||
################# | ||
|
||
|
||
.. contents:: | ||
|
||
|
||
Scheduler | ||
========= | ||
|
||
This component provides automated unattended continuous deployment for the | ||
"PAVICS stack", for the tutorial notebooks on the Jupyter environment and for the | ||
automated deployment itself. | ||
|
||
It can also be used to schedule other tasks on the PAVICS physical host. | ||
|
||
Everything is dockerized, the deployment runs inside a container that will | ||
update all other containers. | ||
|
||
Automated unattended continuous deployment means if code change in the remote | ||
repo, matching the same currently checkout branch (ex: config changes, | ||
``docker-compose.yml`` changes) a deployment will be performed automatically | ||
without human intervention. | ||
|
||
The trigger for the deployment is new code change on the server on the current | ||
branch (PR merged, push). New code change locally will not trigger deployment | ||
so local development workflow is also supported. | ||
|
||
Multiple remote repos are supported so the "PAVICS stack" can be made of | ||
multiple checkouts for modularity and extensibility. The autodeploy will | ||
trigger if any of the checkouts (configured in ``AUTODEPLOY_EXTRA_REPOS``) is | ||
not up-to-date with its remote repo. | ||
|
||
A suggested "PAVICS stack" is made of at least 2 repos, this repo and another | ||
private repo containing the source controlled ``env.local`` file and any other | ||
docker-compose override for true infrastructure-as-code. | ||
|
||
Note: there are still cases where a human intervention is needed. See note in | ||
script deploy.sh_. | ||
|
||
|
||
Usage | ||
----- | ||
|
||
Given the unattended nature, there is no UI. Logs are used to keep trace. | ||
|
||
- ``/var/log/PAVICS/autodeploy.log`` is for the PAVICS deployment. | ||
|
||
- ``/var/log/PAVICS/notebookdeploy.log`` is for the tutorial notebooks deployment. | ||
|
||
- logrotate is enabled for ``/var/log/PAVICS/*.log`` to avoid filling up the | ||
disk. Any new ``.log`` files in that folder will get logrotate for free. | ||
|
||
|
||
How to Enable the Component | ||
--------------------------- | ||
|
||
- Edit ``env.local`` (a copy of env.local.example_) | ||
|
||
- Add "./components/scheduler" to ``EXTRA_CONF_DIRS``. | ||
- Set ``AUTODEPLOY_EXTRA_REPOS``, ``AUTODEPLOY_DEPLOY_KEY_ROOT_DIR``, | ||
``AUTODEPLOY_PLATFORM_FREQUENCY``, ``AUTODEPLOY_NOTEBOOK_FREQUENCY`` as desired, | ||
full documentation in env.local.example_. | ||
- Run once fix-write-perm_, see doc in script. | ||
|
||
|
||
Old way to deploy the automatic deployment | ||
------------------------------------------ | ||
|
||
Superseeded by this new ``scheduler`` component. Keeping for reference only. | ||
|
||
Doing it this old way do not need the ``scheduler`` compoment but lose the | ||
ability for the autodeploy system to update itself. | ||
|
||
Configure logrotate for all following automations to prevent disk full:: | ||
|
||
deployment/install-logrotate-config .. $USER | ||
|
||
To enable continuous deployment of PAVICS:: | ||
|
||
deployment/install-automated-deployment.sh .. $USER [daily|5-mins] | ||
# read the script for more options/details | ||
|
||
If you want to manually force a deployment of PAVICS (note this might not use | ||
latest version of deploy.sh script):: | ||
|
||
deployment/deploy.sh . | ||
# read the script for more options/details | ||
|
||
To enable continuous deployment of tutorial Jupyter notebooks:: | ||
|
||
deployment/install-deploy-notebook .. $USER | ||
# read the script for more details | ||
|
||
To trigger tutorial Jupyter notebooks deploy manually:: | ||
|
||
# configure logrotate before because this script will log to | ||
# /var/log/PAVICS/notebookdeploy.log | ||
|
||
deployment/trigger-deploy-notebook | ||
# read the script for more details | ||
|
||
Migrating to the new mechanism requires manual deletion of all the artifacts | ||
created by the old install scripts: ``sudo rm /etc/cron.d/PAVICS-deploy | ||
/etc/cron.hourly/PAVICS-deploy-notebooks /etc/logrotate.d/PAVICS-deploy | ||
/usr/local/sbin/triggerdeploy.sh``. Both can not co-exist at the same time. | ||
|
||
|
||
Comparison between the old and new autodeploy mechanism | ||
------------------------------------------------------- | ||
|
||
Maximum backward-compatibility has been kept with the old install scripts style: | ||
|
||
* Still log to the same existing log files under ``/var/log/PAVICS``. | ||
* Old single ssh deploy key is still compatible, but the new mechanism allows for different ssh deploy keys for each extra repos (again, public repos should use https clone path to avoid dealing with ssh deploy keys in the first place). | ||
* Old install scripts are kept and can still deploy the old way. | ||
|
||
Features missing in old install scripts or how the new mechanism improves on the old install scripts: | ||
|
||
* Autodeploy of the autodeploy itself ! This is the biggest win. Previously, if ``triggerdeploy.sh`` or ``PAVICS-deploy-notebooks`` script changes, they have to be deployed manually. It's very annoying. Now they are volume-mount in so are fresh on each run. | ||
* ``env.local`` now drive absolutely everything, source control that file and we've got a true DevOPS pipeline. | ||
* Configurable platform and notebook autodeploy frequency. Previously, this means manually editing the generated cron file, less ideal. | ||
* Do not need any support on the local host other than ``docker`` and ``docker-compose``. ``cron/logrotate/git/ssh`` versions are all locked-down in the docker images used by the autodeploy. Recall previously we had to deal with git version too old on some hosts. | ||
* Each cron job run in its own docker image meaning the runtime environment is traceable and reproducible. | ||
* The newly introduced scheduler component is made extensible so other jobs can added into it as well (ex: backup), via ``env.local``, which should be source controlled, meaning all surrounding maintenance related tasks can also be traceable and reproducible. | ||
|
||
|
||
Monitoring | ||
========== | ||
|
||
This component provides monitoring and alerting for the PAVICS physical host | ||
and containers. | ||
|
||
Prometheus stack is used: | ||
|
||
* Node-exporter to collect host metrics. | ||
* cAdvisor to collect containers metrics. | ||
* Prometheus to scrape metrics, to store them and to query them. | ||
* AlertManager to manage alerts: deduplicate, group, route, silence, inhibit. | ||
* Grafana to provide visualization dashboard for the metrics. | ||
|
||
|
||
Usage | ||
----- | ||
|
||
- Grafana to view metric graphs: http://PAVICS_FQDN:3001/d/pf6xQMWGz/docker-and-system-monitoring | ||
- Prometheus alert rules: http://PAVICS_FQDN:9090/rules | ||
- AlertManager to manage alerts: http://PAVICS_FQDN:9093 | ||
|
||
|
||
How to Enable the Component | ||
--------------------------- | ||
|
||
- Edit ``env.local`` (a copy of env.local.example_) | ||
|
||
- Add "./components/monitoring" to ``EXTRA_CONF_DIRS`` | ||
- Set ``GRAFANA_ADMIN_PASSWORD`` to login to Grafana | ||
- Set ``ALERTMANAGER_ADMIN_EMAIL_RECEIVER`` for receiving alerts | ||
- Set ``SMTP_SERVER`` for sending alerts | ||
- Optionally set | ||
|
||
- ``ALERTMANAGER_EXTRA_GLOBAL`` to further configure AlertManager | ||
- ``ALERTMANAGER_EXTRA_ROUTES`` to add more routes than email notification | ||
- ``ALERTMANAGER_EXTRA_INHIBITION`` to disable rule from firing | ||
- ``ALERTMANAGER_EXTRA_RECEIVERS`` to add more receivers than the admin emails | ||
|
||
|
||
Grafana Dashboard | ||
----------------- | ||
|
||
.. image:: grafana-dashboard.png | ||
|
||
For host, using Node-exporter to collect metrics: | ||
|
||
- uptime | ||
- number of container | ||
- used disk space | ||
- used memory, available memory, used swap memory | ||
- load | ||
- cpu usage | ||
- in and out network traffic | ||
- disk I/O | ||
|
||
For each container, using cAdvisor to collect metrics: | ||
|
||
- in and out network traffic | ||
- cpu usage | ||
- memory and swap memory usage | ||
- disk usage | ||
|
||
Useful visualisation features: | ||
|
||
- zoom in one graph and all other graph update to match the same "time range" so we can correlate event | ||
- view each graph independently for more details | ||
- mouse over each data point will show value at that moment | ||
|
||
|
||
Prometheus Alert Rules | ||
---------------------- | ||
|
||
.. image:: prometheus-alert-rules.png | ||
|
||
|
||
AlertManager for Alert Dashboard and Silencing | ||
---------------------------------------------- | ||
|
||
.. image:: alertmanager-dashboard.png | ||
.. image:: alertmanager-silence-alert.png | ||
|
||
|
||
Customizing the Component | ||
------------------------- | ||
|
||
- To add more Grafana dashboard, volume-mount more ``*.json`` files to the | ||
grafana container. | ||
|
||
- To add more Prometheus alert rules, volume-mount more ``*.rules`` files to | ||
the prometheus container. | ||
|
||
- To disable existing Prometheus alert rules, add more Alertmanager inhibition | ||
rules using ``ALERTMANAGER_EXTRA_INHIBITION`` via ``env.local`` file. | ||
|
||
- Other possible Alertmanager configs via ``env.local``: | ||
``ALERTMANAGER_EXTRA_GLOBAL``, ``ALERTMANAGER_EXTRA_ROUTES`` (can route to | ||
Slack or other services accepting webhooks), ``ALERTMANAGER_EXTRA_RECEIVERS``. | ||
|
||
|
||
|
||
|
||
.. _env.local.example: ../env.local.example | ||
.. _fix-write-perm: ../deployment/fix-write-perm | ||
.. _deploy.sh: ../deployment/deploy.sh |
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,5 @@ | ||
prometheus.yml | ||
grafana_datasources.yml | ||
grafana_dashboards.yml | ||
alertmanager.yml | ||
prometheus.rules |
Oops, something went wrong.