You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently when a brand new service is deployed, it gets directly hooked up to one of our live PagerDuty alert policies, and reports against our public production SLA in Statuspage. It can then very well happen that monitoring fails because something isn't quite right yet, or the service is still under construction, which will result in a PagerDuty incident and tank our SLA.
Details
Obviously this setup has 2 major issues: something going wrong in a still new and most likely irrelevant service has the potential to send our on call engineer on a wild goose chase. We also shoot our own foot by artificially decreasing our Delivery (4 nines) or Publishing (3 nines) and thereby eating into our error budget. Manual cleanup is required to clean up the mess.
Proposed Actions
I think we should add an intermediate safety step between development/testing of a service and hooking it up to our "armed" production monitoring setup. I propose the following workflow:
The default monitoring config has an incubator flag
When a service gets deployed for the first time, it is added to a dedicated Incubator page in Statuspage. This page can be public or require authentication.
In case of an error, New Relic only informs the Statuspage component and the #helix-escalations Slack channel for visibility, but won't trigger any PagerDuty incidents yet.
Once confidence in the service is sufficient, the developer removes the incubator flag from the monitoring config
This moves the service to the configured "armed" alert policy in New Relic and Statuspage component group. Any outages occurring during the incubator time will be erased.
From now on, service failures rightfully trigger PagerDuty and affect our SLA.
The text was updated successfully, but these errors were encountered:
We have to slightly adapt the solution on the Statuspage side: because components and groups cannot be truly hidden, I propose to create a dedicated page for incubator components to separate them their production cousins.
Overview
Currently when a brand new service is deployed, it gets directly hooked up to one of our live PagerDuty alert policies, and reports against our public production SLA in Statuspage. It can then very well happen that monitoring fails because something isn't quite right yet, or the service is still under construction, which will result in a PagerDuty incident and tank our SLA.
Details
Obviously this setup has 2 major issues: something going wrong in a still new and most likely irrelevant service has the potential to send our on call engineer on a wild goose chase. We also shoot our own foot by artificially decreasing our Delivery (4 nines) or Publishing (3 nines) and thereby eating into our error budget. Manual cleanup is required to clean up the mess.
Proposed Actions
I think we should add an intermediate safety step between development/testing of a service and hooking it up to our "armed" production monitoring setup. I propose the following workflow:
incubator
flagincubator
flag from the monitoring configThe text was updated successfully, but these errors were encountered: