Monitoring: Incubator #35

rofe · 2020-02-13T18:13:19Z

Overview

Currently when a brand new service is deployed, it gets directly hooked up to one of our live PagerDuty alert policies, and reports against our public production SLA in Statuspage. It can then very well happen that monitoring fails because something isn't quite right yet, or the service is still under construction, which will result in a PagerDuty incident and tank our SLA.

Details

Obviously this setup has 2 major issues: something going wrong in a still new and most likely irrelevant service has the potential to send our on call engineer on a wild goose chase. We also shoot our own foot by artificially decreasing our Delivery (4 nines) or Publishing (3 nines) and thereby eating into our error budget. Manual cleanup is required to clean up the mess.

Proposed Actions

I think we should add an intermediate safety step between development/testing of a service and hooking it up to our "armed" production monitoring setup. I propose the following workflow:

The default monitoring config has an incubator flag
When a service gets deployed for the first time, it is added to a dedicated Incubator page in Statuspage. This page can be public or require authentication.
In case of an error, New Relic only informs the Statuspage component and the #helix-escalations Slack channel for visibility, but won't trigger any PagerDuty incidents yet.
Once confidence in the service is sufficient, the developer removes the incubator flag from the monitoring config
This moves the service to the configured "armed" alert policy in New Relic and Statuspage component group. Any outages occurring during the incubator time will be erased.
From now on, service failures rightfully trigger PagerDuty and affect our SLA.

The text was updated successfully, but these errors were encountered:

rofe · 2020-03-06T08:21:09Z

We have to slightly adapt the solution on the Statuspage side: because components and groups cannot be truly hidden, I propose to create a dedicated page for incubator components to separate them their production cousins.

rofe · 2020-03-10T11:01:26Z

Note: using a dedicated incubator page in Statuspage is optional and only needed in case incubator components should remain hidden.

adobe-bot · 2020-03-11T16:31:55Z

🎉 This issue has been resolved in version 1.5.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

rofe self-assigned this Feb 14, 2020

rofe transferred this issue from adobe/helix-home Feb 21, 2020

rofe added the enhancement New feature or request label Feb 21, 2020

rofe mentioned this issue Mar 6, 2020

feat(monitoring): add incubator mode #38

Merged

rofe closed this as completed in #38 Mar 11, 2020

adobe-bot added the released label Mar 11, 2020

rofe mentioned this issue Mar 11, 2020

Use incubator mode for monitoring by default adobe/franklin-service#128

Closed

rofe mentioned this issue Mar 19, 2020

helix-post-deploy orb to support incubator flag #41

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring: Incubator #35

Monitoring: Incubator #35

rofe commented Feb 13, 2020 •

edited

rofe commented Mar 6, 2020

rofe commented Mar 10, 2020

adobe-bot commented Mar 11, 2020

Monitoring: Incubator #35

Monitoring: Incubator #35

Comments

rofe commented Feb 13, 2020 • edited

Overview

Details

Proposed Actions

rofe commented Mar 6, 2020

rofe commented Mar 10, 2020

adobe-bot commented Mar 11, 2020

rofe commented Feb 13, 2020 •

edited