Need to add some monitoring and notification to the automated deployment system and PAVICS in general #12

tlvu · 2020-01-29T19:44:12Z

Migrated from old PAVICS https://github.com/Ouranosinc/PAVICS/issues/140

Automated deployment was triggered but not performed on boreas because of

++ git status -u --porcelain
+ '[' '!' -z '?? birdhouse/old_docker-compose.override.yml_18062019' ']'
+ echo 'ERROR: unclean repo'
ERROR: unclean repo
+ exit 1

We need some system to monitor the logs and send notification if there are any errors. This log file error monitoring and notification can be generalized to watch any systems later so each system is not forced to reimplement monitoring and notification.

This problem has triggered this issue https://github.com/Ouranosinc/PAVICS/issues/176

There are basically 4 types of monitoring that I think we need:

Monitor system-wide resource usage (CPU, ram, disk, I/O, processes, ...): we already have this one
Monitor per container resource usage (CPU, ram, disk, I/O, processes, ...): we already have this one
Monitor application logs for errors and unauthorized access: we do not have this one. Useful for proactively catching errors instead of waiting for users to log bugs.
Monitor end-to-end workflow of all deployed applications to ensure they are working properly together (no config errors): we partially have this one with tutorial notebooks being tested by Jenkins daily. Unfortunately not all apps have associated notebooks or the notebooks exist but have problem being run non-interractively under Jenkins.

The text was updated successfully, but these errors were encountered:

Monitoring for host and each docker container. ![Screenshot_2020-06-19 Docker and system monitoring - Grafana](https://user-images.githubusercontent.com/11966697/85206384-c7f6f580-b2ef-11ea-848d-46490eb95886.png) For host, using Node-exporter to collect metrics: * uptime * number of container * used disk space * used memory, available memory, used swap memory * load * cpu usage * in and out network traffic * disk I/O For each container, using cAdvisor to collect metrics: * in and out network traffic * cpu usage * memory and swap memory usage * disk usage Useful visualisation features: * zoom in one graph and all other graph update to match the same "time range" so we can correlate event * view each graph independently for more details * mouse over each data point will show value at that moment Prometheus is used as the time series DB and Grafana is used as the visualization dashboard. Node-exporter, cAdvisor and Prometheus are exposed so another Prometheus on the network can also scrape those same metrics and perform other analysis if required. The whole monitoring stack is a separate component so user is not forced to enable it if there is already another monitoring system in place. Enabling this monitoring stack is done via `env.local` file, like all other components. The Grafana dashboard is taken from https://grafana.com/grafana/dashboards/893 with many fixes (see commits) since most of the metric names have changed over time. Still it was much quicker to hit the ground running than learning the Prometheus query language and Grafana visualization options from scratch. Not counting there are lots of metrics exposed, had to filter out which one are relevant to graph. So starting from a broken dashboard was still a big win. Grafana has a big collection of existing but probably un-maintained dashboards we can leverage. So this is a first draft for monitoring. Many things I am not sure or will need tweaking or is missing: * Probably have to add more metrics or remove some that might be irrelevant, with time we will see. * Probably will have to tweak the scrape interval and the retention time, to keep the disk storage requirement reasonable, again we'll see with time. * Missing alerting. With all the pretty graph, we are not going to look at them all day, we need some kind of alerting mechanism. Test system: http://lvupavicsmaster.ouranos.ca:3001/d/pf6xQMWGz/docker-and-system-monitoring?orgId=1&refresh=5m, user: admin, passwd: the default passwd Also tested on Medus: http://medus.ouranos.ca:3001/d/pf6xQMWGz/docker-and-system-monitoring?orgId=1&refresh=5m (on Medus had to perform full yum update to get new kernel and new docker engine for cAdvisor to work properly). Part of issue #12

Monitoring: add alert rules and alert handling (deduplicate, group, route, silence, inhibit). This is a follow up to the previous PR #56 that added the monitoring itself. Added cAdvisor and Node-exporter collection of alert rules found here https://awesome-prometheus-alerts.grep.to/rules with a few fixing because of errors in the rules and tweaking to reduce false positive alarms (see list of commits). Great collection of sample of ready-made rules to hit the ground running and learn PromML query language on the way. ![2020-07-08-090953_474x1490_scrot](https://user-images.githubusercontent.com/11966697/86926000-8b086c80-c0ff-11ea-92d0-6f5ccfe2b8e1.png) Added Alertmanager to handle the alerts (deduplicate, group, route, silence, inhibit). Currently the only notification route configured is email but Alertmanager is able to route alerts to Slack and any generic services accepting webhooks. ![2020-07-08-091150_1099x669_scrot](https://user-images.githubusercontent.com/11966697/86926213-cd31ae00-c0ff-11ea-8b2a-d33803ad3d5d.png) ![2020-07-08-091302_1102x1122_scrot](https://user-images.githubusercontent.com/11966697/86926276-dc186080-c0ff-11ea-9377-bda03b69640e.png) This is an initial attempt at alerting. There are several ways to tweak the system without changing the code: * To add more Prometheus alert rules, volume-mount more *.rules files to the prometheus container. * To disable existing Prometheus alert rules, add more Alertmanager inhibition rules using `ALERTMANAGER_EXTRA_INHIBITION` via `env.local` file. * Other possible Alertmanager configs via `env.local`: `ALERTMANAGER_EXTRA_GLOBAL, ALERTMANAGER_EXTRA_ROUTES, ALERTMANAGER_EXTRA_RECEIVERS`. What more could be done after this initial attempt: * Possibly add more graphs to Grafana dashboard since we have more alerts on metrics that we do not have matching Grafana graph. Graphs are useful for historical trends and correlation with other metrics, so not required if we do not need trends and correlation. * Only basic metrics are being collected currently. We could collect more useful metrics like SMART status and alert when a disk is failing. * The autodeploy mechanism can hook into this monitoring system to report pass/fail status and execution duration, with alerting for problems. Then we can also correlate any CPU, memory, disk I/O spike, when the autodeploy runs and have a trace of previous autodeploy executions. I had to test these alerts directly in prod to tweak for less false positive alert and to debug not working rules to ensure they work on prod so these changes are already in prod ! This also test the SMTP server on the network. See rules on Prometheus side: http://pavics.ouranos.ca:9090/rules, http://medus.ouranos.ca:9090/rules Manage alerts on Alertmanager side: http://pavics.ouranos.ca:9093/#/alerts, http://medus.ouranos.ca:9093/#/alerts Part of issue #12

huard · 2023-08-30T14:35:03Z

Solutions to convert logs into Prometheus metrics:

huard · 2023-08-30T23:48:34Z

Managed to get Fluentd to parse NGinx logs.

Build docker image with the prometheus plugin

FROM fluent/fluentd:edge

# Use root account to use apk
USER root

# below RUN includes plugin as examples elasticsearch is not required
# you may customize including plugins as you wish
RUN apk add --no-cache --update --virtual .build-deps \
        sudo build-base ruby-dev \
 && sudo gem install fluent-plugin-prometheus \
 && sudo gem sources --clear-all \
 && apk del .build-deps \
 && rm -rf /tmp/* /var/tmp/* /usr/lib/ruby/gems/*/cache/*.gem

# COPY fluent.conf /fluentd/etc/
COPY entrypoint.sh /bin/

USER fluent

Configure fluent.conf:

<source>
    @type tail
    path /var/log/nginx/access_file.log.*
    # pos_file /var/log/td-agent/nginx-access-file.pos  # Store tail position across restarts
    # follow_inodes true  # Without this parameter, file rotation causes log duplication.
    refresh_interval 2
    <parse>
      @type regexp
      expression /^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] \"(?<method>\w+)(?:\s+(?<path>[^\"]*?)(?:\s+\S*)?)?\" (?<status_code>[^ ]*) (?<size>[^ ]*)(?:\s"(?<referer>[^\"]*)") "(?<agent>[^\"]*)" (?<urt>[^ ]*)$/
        time_format %Y-%m-%dT%H:%M:%S%:z
        keep_time_key true
        types size:integer,urt:float
    </parse>
    tag nginx
</source>

<filter nginx>
  @type grep
  <regexp>
    key path
    pattern /\/twitcher\/ows\/proxy\/thredds\/(dodsC|fileserver)\//
  </regexp>
</filter>

<filter nginx>
  @type parser
  key_name path
  reserve_data true
  <parse>
    @type regexp
    expression /.*?\/thredds\/(?<tds_service>[^\/]+)(?:\/(?<dataset>[^\?]*))(?:\?(?<request>[^\=]+))?(?:=(?<params>.*))?/
  </parse>
</filter>


<filter nginx>
    @type prometheus

  <metric>
    name nginx_size_bytes_total
    type counter
    desc nginx bytes sent
    key size
  </metric>

  <metric>
    name nginx_thredds_transfer_size_kb
    type counter
    desc THREDDS data transferred [kb]
    key size
    <labels>
      remote ${remote}
      tds_service ${tds_service}
      dataset ${dataset}
    </labels>
  </metric>
</filter>

# expose metrics in prometheus format

<source>
  @type prometheus
  bind 0.0.0.0
  port 24231
  metrics_path /metrics
</source>

<source>
  @type prometheus_output_monitor
  interval 2
  <labels>
    hostname ${hostname}
  </labels>
</source>

and launch the container. It will monitor the logs, and parse any new log added. The pos_file allows to store the position of the last read.

huard · 2023-08-31T13:18:46Z

For the record, the installation of the service was straightforward, but its configuration was painful. The documentation is confusing (the same names mean different things in different contexts) and the regexps use the Ruby syntax, which is slightly different from the Python syntax. The error messages are clear enough however to fix issues as they arise.

tlvu · 2023-08-31T21:02:31Z

@fmigneault I have a vague memory that canarie-api parse the nginx/access_file.log? Any problem if the format of that file change to json? It's easier to parse json format when we need to ingest the logs to extract metrics.

fmigneault · 2023-08-31T22:43:49Z

@tlvu
Yes. It parses the logs using this regex: https://github.com/Ouranosinc/CanarieAPI/blob/master/canarieapi/logparser.py#L53
We would need some option and a small code edit to handle the JSON format instead.

tlvu · 2023-09-01T18:02:49Z

@fmigneault finally I think Nginx allows to write the same log to several files so I'll keep the existing log file intact and write the json format to a different log file and then we can parse that other file. This is a more flexible and backward compatible solution. This will probably result in a new optional-component so the solution can be re-used by other organizations deploying PAVICS.

huard · 2024-04-05T18:01:15Z

This looks promising:
https://github.com/martin-helmich/prometheus-nginxlog-exporter

## Overview This version of canarie-api permits running the proxy (nginx) container independently of the canarie-api application. This makes it easier to monitor the logs of canarie-api and proxy containers simultaneously and allows for the configuration files for canarie-api to be mapped to the canarie-api containers where appropriate. ## Changes **Non-breaking changes** - New component version canarie-api:1.0.0 **Breaking changes** ## Related Issue / Discussion - Resolves [issue id](url) ## Additional Information Links to other issues or sources. - This might make parsing the nginx logs slightly easier as well which could help with #12 and #444 ## CI Operations  birdhouse_daccs_configs_branch: master birdhouse_skip_ci: false

tlvu changed the title ~~Need to add some monitoring and notification to the automated deployment system~~ Need to add some monitoring and notification to the automated deployment system and PAVICS in general May 1, 2020

tlvu mentioned this issue Jun 20, 2020

Monitoring for host and each docker container #56

Merged

tlvu mentioned this issue Jul 8, 2020

Monitoring: add alert rules and alert handling (deduplicate, group, route, silence, inhibit) #59

Merged

huard mentioned this issue Apr 5, 2024

💡 [Feature] Log download stats from THREDDS server #444

Open

mishaschwartz mentioned this issue May 7, 2024

Bump canarie api 1.0.0 #452

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need to add some monitoring and notification to the automated deployment system and PAVICS in general #12

Need to add some monitoring and notification to the automated deployment system and PAVICS in general #12

tlvu commented Jan 29, 2020 •

edited

huard commented Aug 30, 2023 •

edited

huard commented Aug 30, 2023

huard commented Aug 31, 2023

tlvu commented Aug 31, 2023

fmigneault commented Aug 31, 2023

tlvu commented Sep 1, 2023

huard commented Apr 5, 2024

Need to add some monitoring and notification to the automated deployment system and PAVICS in general #12

Need to add some monitoring and notification to the automated deployment system and PAVICS in general #12

Comments

tlvu commented Jan 29, 2020 • edited

huard commented Aug 30, 2023 • edited

huard commented Aug 30, 2023

huard commented Aug 31, 2023

tlvu commented Aug 31, 2023

fmigneault commented Aug 31, 2023

tlvu commented Sep 1, 2023

huard commented Apr 5, 2024

tlvu commented Jan 29, 2020 •

edited

huard commented Aug 30, 2023 •

edited