Awesome SRE

You want your computer systems to run well, and the subjective definition of what well means depends on the nature of the system and your goals regarding it.

Most of the time, the primary motivation for companies is to create profit for the owners and shareholders.

The definition of running well will therefore be a derivative of the business model objectives.

"Hope is not a strategy."

1. Site Reliability Engineering

2. SRE Culture

3. DevOps

4. Monitoring and Observability

My Awesome Observability Repo ;-)

5. Alerting

My Awesome Observability Repo ;-)

6. Incident Response and Post-Mortem

A collection of post-mortems
A collection of postmortem templates
Our incident postmortem template - Hosted Graphite postmotem template.
Postmortem exercise
Squadcast - Experience the journey from On-Call to SRE.
PagerDuty - Your platform for digital operations management.
VictorOps - VictorOps is now Splunk On-Call.
Splunk On-Call - Developers, devops and operations teams make on-call suck less while reducing mean time to acknowledge and restore outages.
OpsGenie - On-call and alert management to keep services always on.
AlertOps - Transform real-time operational intelligence into automated incident response.
Blameless - The Blameless SRE Platform empowers engineering and DevOps teams through incidents, retrospectives, and detecting the interesting patterns. With the right data, of course.
OnPage - Incident alert management system with a secure smartphone app, enabling response teams to get the most out of their digital technology investments.
PagerTree - Intelligent alert routing for the modern team.
Cabot - Get alerted when services go down or metrics go crazy.
xMatters - Automate operations workflows, ensure applications are always working, and deliver remarkable products at scale with the xMatters service reliability platform.
Derdack Enterprise Alert - Enterprise Alert Notification Software.
Bigpanda - AIOps Event Correlation and Automation platform enables Tech Ops teams to keep the digital economy running.
OpenDuty - Openduty is an incident escalation tool, just like Pagerduty (no longer maintaining).
ngDesk - ngDesk includes support, sales, asset management, marketing and pager in an all-in-one application that is ready to go and easy to use.
Geneos - Real-time monitoring for all your environments in one platform.
FireHydrant - Gives teams the tools to maintain service catalogs, respond to incidents, communicate through status pages, and learn with retrospectives.
Rootly - The fastest way to declare an incident.

7. On-Call

8. Chaos Engineering

My Awesome Chaos Repo ;-)

9. Automation

10. Performance

11. Tools

SLO Generator - Tool to compute and export Service Level Objectives (SLOs), Error Budgets and Burn Rates, using configurations written in YAML (or JSON) format.
SLO Computer - SLOs, Error windows and alerts are complicated. Here's an attempt to make it easy.
SLO Tracker - A simple but effective way to track SLO's and Error budgets. SLO-tracker can be integrated with few alerting tools via webhook integration to receive SLO voilating incidents.
SLO exporter - Computes standardized Service Level Indicator (SLI) and Service Level Objectives (SLO) metrics based on events coming from various data sources.
Pyrra - Making SLOs with Prometheus manageable, accessible, and easy to use for everyone.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
code-of-conduct.md		code-of-conduct.md
contributing.md		contributing.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome SRE

Contents

1. Site Reliability Engineering

2. SRE Culture

3. DevOps

4. Monitoring and Observability

5. Alerting

6. Incident Response and Post-Mortem

7. On-Call

8. Chaos Engineering

9. Automation

10. Performance

11. Tools

12. Books

13. References

14. License

15. Contributing

About

Releases

Packages

License

adriannovegil/awesome-sre

Folders and files

Latest commit

History

Repository files navigation

Awesome SRE

Contents

1. Site Reliability Engineering

2. SRE Culture

3. DevOps

4. Monitoring and Observability

5. Alerting

6. Incident Response and Post-Mortem

7. On-Call

8. Chaos Engineering

9. Automation

10. Performance

11. Tools

12. Books

13. References

14. License

15. Contributing

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Packages