# Multi-Cloud Public Outage Collector.

This notebook collects **public outage data** from:
- AWS (RSS)
- Azure (RSS)
- GCP (JSON)

In [14]:
!uv pip install feedparser requests dotenv beautifulsoup4 lxml


The value specified in an AutoRun registry key could not be parsed.
[2mUsing Python 3.12.12 environment at: C:\Users\vishalkc2\Documents\Sample_Projects\ed_donner\llm_engineering\.venv[0m
[2mAudited [1m5 packages[0m [2min 25ms[0m[0m


In [15]:
from datetime import datetime, timezone
import os
from dotenv import load_dotenv
from IPython.display import Markdown, display
from openai import OpenAI


In [16]:
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

# Check the key

if not api_key:
    print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
elif api_key.strip() != api_key:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")

API key found and looks good so far!


In [17]:
def now_iso():
    return datetime.now(timezone.utc).isoformat()


In [18]:
import json
from bs4 import BeautifulSoup
from datetime import datetime, timezone
import re

In [19]:
import requests

def fetch_aws_incidents():
    url = "https://status.aws.amazon.com/history.json"

    try:
        resp = requests.get(
            url,
            headers={"User-Agent": "Mozilla/5.0"},
            timeout=20
        )
        resp.raise_for_status()
        data = resp.json()
    except Exception:
        # AWS no longer exposes public history JSON
        return [{
            "cloud": "aws",
            "service": "all",
            "region": "global",
            "title": "AWS does not expose public historical outage data via JSON APIs",
            "status": "unsupported",
            "start_time": None,
            "end_time": None,
            "details": (
                "AWS Health historical events require either "
                "1) AWS Health API with Business/Enterprise support, or "
                "2) JavaScript execution in a browser context."
            ),
            "source_url": "https://health.aws.amazon.com/health/status"
        }]

    incidents = []

    for date, events in data.items():
        for e in events:
            incidents.append({
                "cloud": "aws",
                "service": e.get("service"),
                "region": e.get("region") or "global",
                "title": e.get("summary"),
                "status": e.get("status"),
                "start_time": date,
                "end_time": None,
                "details": e.get("description"),
                "source_url": "https://health.aws.amazon.com/health/status"
            })

    return incidents


In [None]:
def fetch_azure_incidents():
    url = "https://azure.status.microsoft/en-us/status/history/"
    html = requests.get(
        url,
        headers={"User-Agent": "Mozilla/5.0"},
        timeout=20
    ).text

    soup = BeautifulSoup(html, "lxml")
    results = []

    # Each month block
    for month_header in soup.select("div.month-title-container h2"):
        month_text = month_header.get_text(strip=True)

        wrapper = (
            month_header
            .find_parent("div")
            .find_next_sibling("div", class_="month-incident-container-wrapper")
        )
        if not wrapper:
            continue

        # Each incident row
        for row in wrapper.select("div.row"):
            day_el = row.select_one(".incident-history-day")
            title_el = row.select_one(".incident-history-title")
            tracking_el = row.select_one(".incident-history-tracking-id")
            body_el = row.select_one(".incident-history-collapse .card-body")

            if not all([day_el, title_el, tracking_el, body_el]):
                continue

            # Normalize date
            day = day_el.get_text(strip=True)
            try:
                date_obj = datetime.strptime(
                    f"{month_text} {day}", "%B %Y %d"
                )
                iso_date = date_obj.date().isoformat()
            except Exception:
                iso_date = None

            title = title_el.get_text(strip=True)
            tracking_id = (
                tracking_el.get_text(strip=True)
                .replace("Tracking ID:", "")
                .strip()
            )

            raw_html = str(body_el)
            clean_text = body_el.get_text("\n", strip=True)

            # Extract PIR sections
            sections = {}
            current = None

            def norm(s):
                return re.sub(r"\s+", " ", s.lower())

            for el in body_el.find_all(["strong", "p", "li"]):
                text = el.get_text(strip=True)
                key = norm(text)

                if "what happened" in key:
                    current = "what_happened"
                    sections[current] = []
                elif "what went wrong" in key:
                    current = "what_went_wrong"
                    sections[current] = []
                elif "how did we respond" in key:
                    current = "how_did_we_respond"
                    sections[current] = []
                elif "how are we making" in key:
                    current = "mitigation"
                    sections[current] = []
                elif "how can customers" in key:
                    current = "customer_guidance"
                    sections[current] = []
                elif current:
                    sections[current].append(text)

            sections = {k: "\n".join(v) for k, v in sections.items()}

            results.append({
                "cloud": "azure",
                "month": month_text,
                "date": iso_date,
                "tracking_id": tracking_id,
                "title": title,
                "text": clean_text,
                "sections": sections,
                "raw_html": raw_html,
                "source_url": url
            })

    return results


In [22]:
def fetch_gcp_incidents():
    url = "https://status.cloud.google.com/incidents.json"
    data = requests.get(url, timeout=15).json()

    incidents = []
    for inc in data:
        incidents.append({
            "cloud": "gcp",
            "service": ", ".join(inc.get("services", [])),
            "region": ", ".join(inc.get("currently_affected_locations", [])) or "global",
            "title": inc.get("external_desc"),
            "status": inc.get("status"),
            "start_time": inc.get("begin"),
            "end_time": inc.get("end"),
            "url": inc.get("uri")
        })

    return incidents


In [None]:
def get_cloud_outages_json():
    incidents = []
    incidents.extend(fetch_aws_incidents())
    incidents.extend(fetch_azure_incidents())
    incidents.extend(fetch_gcp_incidents())

    payload = {
        "generated_at": datetime.now(timezone.utc).isoformat(),
        "source": "public-status-pages",
        "incident_count": len(incidents),
        "incidents": incidents
    }

    return json.dumps(payload, separators=(",", ":"), ensure_ascii=False)


In [25]:
outage_json = get_cloud_outages_json()
outage_json


'{"generated_at":"2026-01-23T11:59:29.060209+00:00","source":"public-status-pages","incident_count":21,"incidents":[{"cloud":"aws","service":"all","region":"global","title":"AWS does not expose public historical outage data via JSON APIs","status":"unsupported","start_time":null,"end_time":null,"details":"AWS Health historical events require either 1) AWS Health API with Business/Enterprise support, or 2) JavaScript execution in a browser context.","source_url":"https://health.aws.amazon.com/health/status"},{"cloud":"azure","month":"January 2026","date":"2026-01-10","tracking_id":"XM22-5_G","title":"Preliminary Post Incident Review (PIR) – Power event impacting multiple services in AZ01 – West US 2","text":"This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within\xa014 days)\xa0we\xa0will publish a Final PIR with\xa0additional\xa0details.\\nDuring this incident, we temporarily used our public Azure Status page because it\

In [None]:
system_prompt = """
You are a snarky assistant that analyzes the outages of 3 different cloud providers and provider suggestions on which cloud provider I choose based on the outage.
Provides a short, snarky, humorous summary, ignoring text that might be navigation related.
Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.
"""

In [41]:
user_prompt_prefix = """
Here are the contents of the outage for 90 days from aws, azure and gcp. If a cloud service provider is not giving enough outage details, consider that as well.
Some cases the cloud provider (eg. AWS) gives details but it the data is retrieved dynamically from server and hence cannot be scraped from webpage. But the data can be accessed directly going to the browser and click different events. No payment needed.
Provide a short summary of the outage and your recommendation on which service provider to choose and why.
If it includes major and minor outages and duration of outage, then summarize these too.

"""

In [35]:
def messages_for(outage_data):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_prefix + outage_data}
    ]

In [36]:
def summarize(outage_data):
    openai = OpenAI()
    response = openai.chat.completions.create(
        model = "gpt-4.1-mini",
        messages = messages_for(outage_data)
    )
    return response.choices[0].message.content

In [30]:
summarize(outage_json)

'# Cloud Outage Snark-Off: Who’s the least outage-y?\n\n## AWS\n- **Outage transparency?** Nope. AWS treats its historical outage data like the crown jewels – locked behind Business/Enterprise support paywalls or sneaky JavaScript browser gymnastics.\n- **Incident count:** 21 (but details? Nada from the JSON feed.)\n- **Summary:** AWS outages might be happening, but you’ll have to be in the VIP club to know the juicy details. Otherwise, it\'s all smoke and mirrors.\n\n## Azure\n- **Incident count:** Multiple detailed incidents over the last 90 days.\n- **Major Outages:**\n  - **Jan 2026 West US 2 power event:** ~7.5 hours outage affecting a slew of services due to an emergency power-off safety system tripping. Compute/storage back by ~2 hours, but networking residual issues lasted till ~7.5 hours later.\n  - **Dec 2025 PIM API failures:** ~10.5 hours of elevated errors and timeouts tied to database connection exhaustion after a config change.\n  - **Dec 2025 ARM key rotation hiccups (t

In [37]:
def display_summary(outage_data):
    summary = summarize(outage_data)
    display(Markdown(summary))

In [42]:
display_summary(outage_json)

# Cloud Outage Snark & Recommendations

## AWS  
- AWS doesn't share its historical outage data via API or static means. You gotta play detective in your browser with JavaScript ninja skills and have a Business/Enterprise support plan.  
- So basically, AWS is like that friend who "forgot" to answer your texts about last night's party (outages). The data is there, but you'll have to go spelunking through their UI.  
- Incident count is 21, but no concrete downtime or affected services details publicly given. Sneaky!

---

## Azure  
Azure put on a mini soap opera of incidents from November 2025 to January 2026. Here's the gist:  

1. **Power Outage (Jan 10-11, 2026, West US 2 AZ01)**  
   - Power-off safety system freaked out, killing power to racks.  
   - Services down: VMs, Storage, Databricks, Synapse, SQL DB, Redis, Cosmos DB & more.  
   - Downtime: ~7.5 hours (17:50 UTC 10 Jan to 01:23 UTC 11 Jan).  
   - Residual VM creation/updates affected until manual fix at 01:23.  
   - Response time: Fast detection, but recovery took time because of network layer (SLB) complexity.  

2. **Entra Privileged Identity Management API Failures (Dec 22, 2025)**  
   - Deployment overload = CPU spike + exhausted DB connections.  
   - Duration roughly 10+ hours (08:05 to 18:30 UTC).  
   - Rolled-back config, scaled out resources, restarted DB.  

3. **ARM (Azure Resource Manager) Service Failures (Dec 8, 2025) & China regions**  
   - Automated key rotation gone wild, causing auth failures globally in ARM management plane.  
   - Outage lasted ~3 hours (11:04-14:13 EST in US Gov regions; ~9.5 hours in China region).  
   - Service management via portal, CLI, APIs was down or flaky.  

4. **Thermal Event in West Europe (Nov 5-6, 2025)**  
   - Voltage sag shut down cooling units, racks overheated and storage scaled down forcibly.  
   - Recovery took ~9.5 hours (16:53 Nov 5 to 02:25 Nov 6 UTC).  
   - Extensive data consistency checks slowed restoration.  

5. **Azure Front Door (AFD) Global Outage due to config bug (Oct 29-30, 2025)**  
   - Config metadata incompatibility caused data plane crashes and DNS resolution errors.  
   - Impact: 8+ hours outage (15:41 Oct 29 to 00:05 Oct 30 UTC).  
   - Recovery via manual rollout of fixed config; some portal services still affected longer.  

**Azure takeaway:** They provide detailed post-mortems, full timelines, root causes, mitigations, and customer guidance. Confidence points for transparency and thoroughness - they own their messes and plan fixes.

---

## GCP  
- Outages are listed mostly as titles with start and end time, no juicy root causes or affected services specifics.  
- Examples:  
  - 9+ major incidents over 90 days, lasting from couple hours to over a day; e.g., compute engine issues (May 20, 2025, ~8.5hrs), elevated error rates in various regions.   
  - Titles mention API failures, latency spikes, service unavailability but zero detailed explanations.  
- A cloud outage page is a mystery novel with no plot, just a list of "We noticed this, it lasted that long."  
- No public post incident reviews, no root cause, no customer guidance. It's like "Trust us, stuff happened."  

---

# Final Snarky Summary & Recommendation

| Provider | Transparency | Outage Severity / Duration | Incident Detail | Cool Factor | Recommend? |
|----------|--------------|----------------------------|-----------------|-------------|------------|
| **AWS**  | Wall of silence + browser hacking | Unknown (21 incidents, no details) | Nada in API or JSON, only in-browser JS detective work | Very secretive, Houdini-level data tricks | Nah, unless you're a JS spelunker with $$$ |
| **Azure**| Full drama series, in-depth PIRs with timelines, causes, mitigations | Multiple multi-hour incidents, up to ~9.5 hrs outage on some incidents | Detailed, candid, and actionable, plus customer guidance | Transparency and accountability champs | Yes, if you want to know what went wrong and how it's fixed; slightly higher downtime overall but they *own* it |
| **GCP**  | Minimal, vague incident titles with no root causes or fixes documented | Many (~9) multi-hour outages, no detail | Evasive, vague, no customer advice or postmortem | The "we had issues" whisperer | Meh, unless you like to live in mystery |

---

# **Recommendation:**  

If you want a cloud that openly admits when it borked and tells you what exactly to do next, go **Azure**. Their outages are transparent, root problems are explained, and they provide timeline and remediation plans. Sure, they have several multi-hour outages, but at least you can make informed decisions and plan around them.

**AWS** is like the secretive sibling who knows they messed up but makes you earn the info with complicated JavaScript dance routines — not fun unless you have deep pockets and lots of patience.

**GCP** feels like the elusive fog of cloud outages — many events, but no clues and no answers, just "stuff happened." If you like mystery and suspense, that's your pick.

---

# TL;DR:  
Azure spills the tea, AWS hides it in JS code, GCP whispers "yeah, something broke."

Choose **Azure** if you want to avoid blind spots in your cloud outage knowledge.  
Choose **AWS** if you like a challenge and don’t mind paying for it.  
Choose **GCP** if you enjoy guessing games about outage causes.  

---

# Final note:  
Clouds will sometimes rain, but at least Azure shows you the clouds. The others? They keep their skies suspiciously clear in public view. ☁️🌩️🕵️‍♂️