Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring: Incubator #35

Closed
rofe opened this issue Feb 13, 2020 · 3 comments · Fixed by #38
Closed

Monitoring: Incubator #35

rofe opened this issue Feb 13, 2020 · 3 comments · Fixed by #38
Assignees
Labels
enhancement New feature or request released

Comments

@rofe
Copy link
Collaborator

rofe commented Feb 13, 2020

Overview

Currently when a brand new service is deployed, it gets directly hooked up to one of our live PagerDuty alert policies, and reports against our public production SLA in Statuspage. It can then very well happen that monitoring fails because something isn't quite right yet, or the service is still under construction, which will result in a PagerDuty incident and tank our SLA.

Details

Obviously this setup has 2 major issues: something going wrong in a still new and most likely irrelevant service has the potential to send our on call engineer on a wild goose chase. We also shoot our own foot by artificially decreasing our Delivery (4 nines) or Publishing (3 nines) and thereby eating into our error budget. Manual cleanup is required to clean up the mess.

Proposed Actions

I think we should add an intermediate safety step between development/testing of a service and hooking it up to our "armed" production monitoring setup. I propose the following workflow:

  1. The default monitoring config has an incubator flag
  2. When a service gets deployed for the first time, it is added to a dedicated Incubator page in Statuspage. This page can be public or require authentication.
  3. In case of an error, New Relic only informs the Statuspage component and the #helix-escalations Slack channel for visibility, but won't trigger any PagerDuty incidents yet.
  4. Once confidence in the service is sufficient, the developer removes the incubator flag from the monitoring config
  5. This moves the service to the configured "armed" alert policy in New Relic and Statuspage component group. Any outages occurring during the incubator time will be erased.
  6. From now on, service failures rightfully trigger PagerDuty and affect our SLA.
@rofe rofe self-assigned this Feb 14, 2020
@rofe rofe transferred this issue from adobe/helix-home Feb 21, 2020
@rofe rofe added the enhancement New feature or request label Feb 21, 2020
@rofe
Copy link
Collaborator Author

rofe commented Mar 6, 2020

We have to slightly adapt the solution on the Statuspage side: because components and groups cannot be truly hidden, I propose to create a dedicated page for incubator components to separate them their production cousins.

@rofe
Copy link
Collaborator Author

rofe commented Mar 10, 2020

Note: using a dedicated incubator page in Statuspage is optional and only needed in case incubator components should remain hidden.

@rofe rofe closed this as completed in #38 Mar 11, 2020
@adobe-bot
Copy link

🎉 This issue has been resolved in version 1.5.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request released
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants