# 03 — Formal Blackout Detection & Labeling

This notebook takes the cleaned time × detector speed panel from
`01_load_and_clean.ipynb` and produces reproducible labels for:

- **Per-detector blackouts**: contiguous runs of missing readings for a single detector.
- **Network-level blackouts**: time periods when a large fraction of detectors are missing simultaneously.

We save two tables:

- `data/blackout_events_detectors.parquet`
- `data/blackout_events_network.parquet`

In [18]:
import pandas as pd
import numpy as np
from pathlib import Path

In [19]:
# -----------------------------------------
# Parameters for blackout definitions
# -----------------------------------------
STEP_MINUTES = 5    # data are on a 5-minute grid

# Per-detector blackout: run of NaNs of this length or longer
MIN_LEN_DET = 2     # 2 steps = 10 minutes

# Network-level blackout: fraction of detectors missing
THRESH_NET = 0.10   # ≥10% missing at the same time
MIN_LEN_NET = 2     # must persist for at least 2 steps

print("Per-detector blackout: ≥", MIN_LEN_DET * STEP_MINUTES, "minutes of NaN")
print("Network-level blackout: missing frac ≥", THRESH_NET,
      "for ≥", MIN_LEN_NET * STEP_MINUTES, "minutes")

Per-detector blackout: ≥ 10 minutes of NaN
Network-level blackout: missing frac ≥ 0.1 for ≥ 10 minutes


In [20]:
# -----------------------------------------
# Load cleaned time × detector panel
# -----------------------------------------
try:
    wide = pd.read_parquet("data/seattle_loop_clean.parquet")
    print("Loaded clean panel from parquet.")
except Exception as e:
    print("Parquet load failed, falling back to pickle. Error was:")
    print(" ", e)
    wide = pd.read_pickle("data/seattle_loop_clean.pkl")
    print("Loaded clean panel from pickle.")

print("Wide panel shape:", wide.shape)
print("Time span:", wide.index.min(), "→", wide.index.max())

Loaded clean panel from parquet.
Wide panel shape: (105120, 147)
Time span: 2015-01-01 00:00:00 → 2015-12-31 23:55:00


In [21]:
# Boolean missingness matrix (True = missing)
missing = wide.isna()   # Create a DataFrame where each cell is True if the original value is NaN
mask = missing.to_numpy()
timestamps = wide.index.to_numpy()
T, D = mask.shape
detectors = wide.columns.to_numpy()

print(f"T (time steps): {T},  D (detectors): {D}")
print("Overall missing fraction:", mask.mean())

T (time steps): 105120,  D (detectors): 147
Overall missing fraction: 0.051772771513476014


## 1. Helper: streak finder

We reuse the same `find_streaks` utility as in the EDA notebook:  
given a boolean array, it returns contiguous runs where the value is `True`.

In [22]:
def find_streaks(bool_array):
    """
    Find start and end indices of True streaks in a 1D boolean array.

    Returns
    -------
    list of (start_idx, end_idx) pairs, inclusive.
    """
    out, start = [], None   # Initialize output list and streak start index
    
    for i, v in enumerate(bool_array):
        if v and start is None:
            start = i   # Begin a new streak
        elif not v and start is not None:
            out.append((start, i - 1))   # End of a streak → save (start, end)
            start = None   # Reset for the next potential streak
    if start is not None:
        out.append((start, len(bool_array) - 1))   # Handle a streak that goes till the end
    return out   # Return list of (start, end) index pairs

## 2. Per-detector blackout events

Definition: for a given detector \(d\), a **blackout** is a maximal contiguous run  
of missing values (NaN) of length ≥ `MIN_LEN_DET` time steps.

We now:

1. Scan each detector’s missingness vector.
2. Find all missing streaks using `find_streaks`.
3. Keep only those with length ≥ `MIN_LEN_DET`.
4. Store results as a table with `(detector, start, end, len_steps, len_minutes)`.

In [23]:
rows_det = []    # List to collect blackout metadata for each detector


# Loop through each detector (column)
for j, det in enumerate(detectors):
    col_mask = mask[:, j]   # True when reading is missing for detector j
    streaks = find_streaks(col_mask)    # Identify (start, end) indices of missing-value streaks

    for s, e in streaks:
        L_steps = e - s + 1    # Length of blackout in time steps
        if L_steps < MIN_LEN_DET:
            continue   # Skip streaks shorter than blackout threshold (e.g., <2 steps)

        # Record valid blackout information for this detector
        rows_det.append({
            "detector": det,
            "start": timestamps[s],   # Start time of blackout
            "end": timestamps[e],     # End time of blackout
            "len_steps": L_steps,     # Length in time steps
            "len_minutes": L_steps * STEP_MINUTES,   # Convert to minutes using 5-min step size
        })

# Convert collected blackout records into a DataFrame
blackouts_det = (
    pd.DataFrame(rows_det)
      .sort_values(["detector", "start"])
      .reset_index(drop=True)
)

print("Per-detector blackouts:")
print("  total events:", len(blackouts_det))  # Total blackout streaks detected
print("  detectors with ≥1 blackout:", blackouts_det["detector"].nunique())   # How many unique detectors had at least one blackout

blackouts_det.head()   # Preview the first few blackout events

Per-detector blackouts:
  total events: 942
  detectors with ≥1 blackout: 146


Unnamed: 0,detector,start,end,len_steps,len_minutes
0,005es15036,2015-01-16 00:10:00,2015-01-17 11:40:00,427,2135
1,005es15036,2015-09-28 20:55:00,2015-09-28 21:55:00,13,65
2,005es15036,2015-11-17 19:20:00,2015-11-17 20:45:00,18,90
3,005es15125,2015-01-16 00:10:00,2015-01-17 11:40:00,427,2135
4,005es15125,2015-09-28 20:55:00,2015-09-28 21:55:00,13,65


## 3. Network-level blackouts

Definition: a **network-level blackout** is a time interval where the fraction of  
detectors missing is at least `THRESH_NET` and this condition holds for at least  
`MIN_LEN_NET` consecutive time steps.

This captures cabinet-level or system-wide outages where many sensors fail together.

In [24]:
# Fraction of detectors missing at each time
missing_frac_time = mask.mean(axis=1)   # shape (T,)

# Identify time steps where the fraction of detectors missing exceeds threshold
global_mask = missing_frac_time >= THRESH_NET   # Boolean mask for global blackout condition
streaks_net = find_streaks(global_mask)         # Find contiguous intervals of global blackouts

rows_net = []   # List to store metadata for each network-wide blackout event


# Loop through each blackout streak
for s, e in streaks_net:
    L_steps = e - s + 1   # Duration in steps
    if L_steps < MIN_LEN_NET:
        continue  # Skip short intervals below minimum threshold

    frac_slice = missing_frac_time[s:e+1]    # Subset of missing fraction during the blackout


    # Record metadata for each valid blackout interval
    rows_net.append({
        "start": timestamps[s],        # Start time of blackout
        "end": timestamps[e],          # End time of blackout
        "len_steps": L_steps,          # Length in 5-minute steps
        "len_minutes": L_steps * STEP_MINUTES,   # Length in minutes
        "missing_frac_start": float(missing_frac_time[s]),    # Missing fraction at start
        "missing_frac_max": float(frac_slice.max()),          # Peak missing fraction during blackout
        "missing_frac_mean": float(frac_slice.mean()),        # Average missing fraction during blackout
    })


# Create a DataFrame from the collected global blackout records
blackouts_net = (
    pd.DataFrame(rows_net)
      .sort_values("start")
      .reset_index(drop=True)
)

print("Network-level blackouts:")
print("  total intervals:", len(blackouts_net))   # Total multi-detector blackout events
if len(blackouts_net):
    print("  median duration (minutes):",
          blackouts_net["len_minutes"].median())  # Median length of global blackouts

blackouts_net.head()   # Display first few rows of the global blackout log

Network-level blackouts:
  total intervals: 25
  median duration (minutes): 180.0


Unnamed: 0,start,end,len_steps,len_minutes,missing_frac_start,missing_frac_max,missing_frac_mean
0,2015-01-16 00:10:00,2015-01-17 11:40:00,427,2135,0.115646,0.115646,0.111106
1,2015-01-17 22:00:00,2015-01-18 00:25:00,30,150,0.102041,0.102041,0.102041
2,2015-01-18 01:55:00,2015-01-18 04:55:00,37,185,0.115646,0.142857,0.134216
3,2015-01-18 22:00:00,2015-01-19 00:55:00,36,180,0.102041,0.102041,0.102041
4,2015-02-07 20:50:00,2015-02-08 00:50:00,49,245,0.122449,0.142857,0.13661


In [25]:
# -----------------------------------------
# 4. Save labels to disk
# -----------------------------------------
Path("data").mkdir(exist_ok=True)

det_path = Path("data") / "blackout_events_detectors.parquet"
net_path = Path("data") / "blackout_events_network.parquet"

blackouts_det.to_parquet(det_path, index=False)
blackouts_net.to_parquet(net_path, index=False)

print("Saved blackout event tables to:")
print("  -", det_path.resolve())
print("  -", net_path.resolve())


Saved blackout event tables to:
  - C:\Users\asust\Desktop\Modeling Information Blackouts\data\blackout_events_detectors.parquet
  - C:\Users\asust\Desktop\Modeling Information Blackouts\data\blackout_events_network.parquet


### 5. Summary

We have now turned the qualitative idea of “blackouts” into precise labels:

- **Per-detector blackouts**: all contiguous NaN runs ≥ 10 minutes, with start/end
  timestamps and durations stored in `blackout_events_detectors.parquet`.

- **Network-level blackouts**: all intervals where at least 10% of detectors are
  missing for ≥ 10 minutes, with summary statistics on the missing fraction stored in
  `blackout_events_network.parquet`.

These tables will serve as ground truth for later steps, where we (i) filter or
down-weight chronically unreliable detectors, and (ii) build models that jointly
predict traffic and the occurrence of blackout events.