This was initially built for Marathon 0.8.0, hence we don't use the event bus.
$ marathon-alerts --help Usage of marathon-alerts: --alerts-suppress-duration duration Suppress alerts for this duration once notified (default 30m0s) --check-interval duration Check runs periodically on this interval (default 30s) --check-min-healthy-critical-threshold value Min Healthy instances check fail threshold (default 0.5) --check-min-healthy-warn-threshold value Min Healthy instances check warning threshold (default 0.75) --check-min-instances-critical-threshold value Min Instances check fail threshold (default 0.5) --check-min-instances-warn-threshold value Min Instances check warning threshold (default 0.75) --debug Enable debug mode. More counters for now. --pid string File to write PID file (default "PID") --slack-channel string #Channel / @User to post the alert (defaults to webhook configuration) --slack-owner string Comma list of owners who should be alerted on the post --slack-webhook string Comma list of Slack webhooks to post the alert --uri string Marathon URI to connect
Example invocation would be like the following
$ marathon-alerts --uri http://marathon1:8080,marathon2:8080 \ --slack-webhook https://hooks.slack.com/services/..../ \ --slack-owner ashwanthkumar,slackbot
Apart from the flags that are used while starting up, the functionality can be controlled at an app level using labels in the app specification. The following table explains the properties and it's usage.
|alerts.enabled||Controls if the alerts for the app should be enabled or disabled. Defaults - true||false|
|alerts.checks.subscribe||Comma separated list of checks that needs to be run. Defaults - all||all|
|alerts.routes||Ability to route different checks to different notifiers based on their level. See the section below on Routes to understand how you can add routes to your apps. Defaults -
|alerts.min-healthy.critical.threshold||Failure threshold for min-healthy check. Defaults -
|alerts.min-healthy.warn.threshold||Warning threshold for min-healthy check. Defaults -
|alerts.min-instances.critical.threshold||Failure threshold for min-instances check. Defaults -
|alerts.min-instances.warn.threshold||Warning threshold for min-instances check. Defaults -
|alerts.slack.webhook||Comma separated list of Slack webhooks to send slack notifications. Overrides -
|alerts.slack.channel||#Channel / @User to post the alert into. Overrides -
|alerts.slack.owners||Comma separated list of users who should be tagged in the alert. Overrides -
We collect some metrics internally in marathon-alerts. They're dumped periodically to STDERR. You can find the list of metrics and it's usage in the following table
|alerts-suppressed-cleaned||Number of alerts we cleaned up because they got expired from suppress duration.|
|marathon-all-apps-response-time||Response time of marathon's /v2/apps API call|
|notifications-total||Total number of notifications we sent from AlertManager to NotificationManager|
|notifications-warning||Number of Warning check notifications we sent from AlertManager to NotificationManager|
|notifications-critical||Number of Critical check notifications we sent from AlertManager to NotificationManager|
|notifications-resolved||Number of Pass (aka Resolved) check notifications we sent from AlertManager to NotificationManager|
|notifications-rate||Meter metric that denotes the rate at which notifications are being sent|
Apart from the standard metrics above, we also collect quite a few other metrics, mostly for debugging purposes. You can enable these metrics if run
marathon-alerts with a
|alerts-suppressed-called||Number of times we called AlertManager.cleanUpSupressedAlerts()|
|alerts-process-check-called||Number of times we called AlertManager.processCheck()|
|alerts-manager-stopped||Number of times we called AlertManager.Stop()|
|apps-checker-stopped||Number of times we called AppChecker.Stop()|
|apps-checker-marathon-all-apps-api||Number of times we called Marathon's /v2/apps API|
|apps-checker-alerts-sent||Number of checks we sent to AlertManager from AppChecker|
|apps-checker-check-<name>||Number of checks identified by <name> we sent to AlertManager|
|apps-checker-app-<id>||Number of checks for an app identified by <id> we sent to AlertManager|
|apps-checker-<id>-<name>||Number of checks identified by <name> for an app identified by <id> we sent to AlertManager|
|notifications-warning-rate||Meter metric that denotes the rate at which warning notifications are being sent|
|notifications-critical-rate||Meter metric that denotes the rate at which critical notifications are being sent|
|notifications-resolved-rate||Meter metric that denotes the rate at which resolved notifications are being sent|
From v0.3.0-RC7 onwards we've an ability to route different check alerts to different notifiers. On a per-app basis you can control the routes using
alerts.routes label. The format of the value should be as following -
- Check name and Notifier names can be glob patterns. No complicated regex allowed as of now.
- Check level has to be one of warning / pass / critical / resolved.
- Multiple routes can be defined by separating them using
Default routes if none specified is -
"*/warning/*;*/critical/*;*/resolved/*". It means we'll route all check's warning / critical / resolved notifications to all available notifiers.
Binaries are available here.
We've a sample
marathon.json.conf that we use in our organization along with
To build from source, you need
glide tool in
$ cd $GOPATH/src $ mkdir -p github.com/ashwanthkumar/marathon-alerts $ git clone https://github.com/ashwanthkumar/marathon-alerts.git github.com/ashwanthkumar/marathon-alerts $ cd github.com/ashwanthkumar/marathon-alerts $ make setup # Downloads the required dependencies $ make test # Runs the test $ make build # Builds the distribution specific binary
min-healthy- Minimum % of Task instances that should be healthy else this check is fired.
min-instances- Minimum % of Task instances that should be healthy or staged, else this check is fired.
max-instances- If the number of instances goes beyond some % of the pre-defined max limit
suspended- If the service was suspended by mistake or unintentionally.
min-healthydoesn't catch suspended services today.
- Pager Duty
If you've any feature requests or issues, please open a Github issue. We accept PRs. Fork away!