# GCP DevOps/SRE Cert Notes

## Being a good DevOps Engineers is understanding BOPT
<ins>BOPT</ins>: Business, Organization, Process/Techniques, Technology/Tools. 
* Business - External forces and revenue creation
* Organization - Internal forces and structure
* Process/Techniques - Human considerations
* Technology/Tools - Cloud services and the nuts and bolts of doing the job

---

## What is Site Reliability Engineering (SRE)/DevOps

<ins>DevOps</ins>: developed in 2008 by Andrew Shafer and Patrick Debois. They defined it as "a software engineering culture and practice, that aims at unifying software development and software operation."
* DevOps is generally visualized as an infinity loop 

<ins>Site Reliability Engineering (SRE)</ins>: "what happens when a software engineer is tasked with what used to be called operations" - Ben Traynor (2003), Founder of Google Site Reliability Team

### Key pillars of DevOps & SRE
* Idea 1
    * DevOps: Reduce organization silos
        * Bridge teams together
        * Increase communication
        * Shared company vision
    * SRE: Shared ownership, tooling, and techniques for developers and operations
* Idea 2
    * DevOps: Accept failure as normal
        * Try to anticipate failures, but understand that incidents are bound to occur
        * Failures help the team learn
    * SRE: No-fault post mortems & SLO not met
        * Don't fail the same way twice
        * Track incidents (SLIs)
        * Map SLIs to objectives (SLOs)
* Idea 3
    * DevOps: Implement gradual change
        * Small updates are better than large ones
        * Small updates are easier to review and rollback from
    * SRE: reduce costs of failure
        * Use limited "canary" rollouts in order to impact the fewest amount of users in the event of an issue
        * Automate where possible
* Idea 4
    * DevOps: Leverage tooling & automation
        * Reduces manual tasks which frees up more time to do other work
        * CI/CD pipelines are the heart of this
        * Fosters speed and consistency
    * SRE: Automate this year's job away
        * Automation is a force multiplier
        * Automation centralizes mistakes and makes it easier to respond to issues
* Idea 5
    * DevOps: Measure everything
        * Critical gauge of success or failure
        * CI/CD needs full monitoring
        * DevOps stresses synthetic, proactive monitoring (i.e. simulated user behavior)
    * SREs: measure toil and reliability
        * Key to SLOs and SLAs
        * Aim to reduce repetetive manual labor (toil), in order to increase available engineering time
        * Ensure measurement over time in order to view trends

### Why "reliability"
* An unstable service likely indicates a variety of issues
* Reliability is the absence of errors and it must be attended to at all times, not just when their is an issue

### 3 Goals of SRE
1. Define availability (SLO)
2. Determine level of availability (SLI)
3. Detail what happens when availability fails (SLA)

---

## Service Level Indicators (SLI)

* <ins>Service Level Indicator (SLI)</ins>: "A carefully defined quantitative measure of some aspect of the level of service that is provided". SLIs are metrics over time that define "reliability" - they are specific to a user journey such as request/response, data processing, or storage, that show how well a service is doing.
    * Example SLIs:
        * <ins>Request latency</ins>: how long it takes to return a response to a request
        * <ins>Failure rate</ins>: a fraction of all rates received: (unsuccessful requests/all requests)
        * <ins>Batch Throughput</ins>: proportion of time = data processing rate > than a threshold
    * Example SLIs for a Request/Response:
        * <ins>Availability</ins>: proportion of valid requests served successfully
        * <ins>Latency</ins>: proportion of valid requests served faster than a certain threshold
        * <ins>Quality</ins>: proportion of valid requests served maintaining quality
    * Example SLIs for Data Processing:
        * <ins>Freshness</ins>: proportion of valid data updated more recently than a threshold
        * <ins>Correctness</ins>: proportion of valid data producing correct output
        * <ins>Throughput</ins>: proportion of time where the data processing rate is faster than a threshold

### Google's 4 Golden Signals (SLIs)
* Latency: The time it takes for your service to fulfill a request
* Errors: The rate at which your service fails
* Traffic: How much demand is directed at your service
* Saturation: A meausure of how close to fully utilized the service's resources are

### Simple SLI Aggregation Equation

$SLI = \left( \frac{\text{good events}}{\text{valid events}} \right) \times 100$

### SLI Best Practice
1. Limit the number of SLIs you choose
    * 3-5 per user journey as too many increases the difficulty for the operator and can lead to data contradictions
2. Reduce complexity
    * Define simple, discrete SLIs that can be calculated with little compute
3. Prioritize journeys
    * Identify user-centric events/journeys and prioritize those over developer-centric ones
4. Aggregate similar SLIs
    * Collect data over time and turn it into a rate, average, or percentile
5. Bucket to distinguish response classes
    * Not all requests are the same (some may be people, bots, or background apps)
6. Collect data at load balancer
    * It's the most efficient method of data collection and it's the closest to the user's experience


---

## Service Level Objectives (SLO)
* <ins>Service Level Objectives</ins>: "SLOs specify a target level for the reliability of your service" - Steven Thurgood and David Ferguson, The Site Reliability Workbook. Your SLOs should always be < 100% reliability as it's very expensive to get near 100%, it becomes more technically complex, and users generally don't need 100% to have an acceptable experience.
    * SLOs are measured by SLIs and can be a single target value or a range of values
    * SLOs are agreed-upon bounds regarding how often SLIs must be met. For example:
        * SLI = site page latency requests < 300ms over last 5 minutes @ 95%
        * SLO = 95% percentile homepage SLI will succeed 99.9% over next year
    * SLOs need buy-in across the entire organization
    * Make your SLOs achievable and base them on past performance, if there is no historical data, collect some
    * Reminder that measurement does not equal user satisfaction

### Aspirational SLOs
* Typically higher than achievable/standard SLOs
* Set a reasonable target and begin measuring 
* Compare user feedback with SLOs and ensure they align


---

## Service Level Agreement (SLA)

* <ins>Service Level Agreement (SLA)</ins>: