# 01 — Load and Prepare the Dataset

In this notebook we turn the raw Seattle loop CSV dump into a clean,
analysis-ready **time × detector** speed panel.

The steps are:

- **Load the raw CSV** (`loop_20150101_20151231.csv`), assign column names, and
  parse the `time` field.
- **Pivot to a wide panel** where:
  - Rows = 5-minute timestamps over 2015.
  - Columns = detector IDs.
  - Entries = speed readings (mph).
- **Enforce a strict 5-minute grid**:
  - Build a complete `date_range` from the first to last timestamp.
  - Reindex the panel so that every 5-minute slot exists; missing rows become `NaN`.
- **Clean unreliable raw values**:
  - Treat `speed == 0` as missing and convert it to `NaN`.
  - Detect “frozen” sensors that report the *exact* same speed for ≥60 minutes
    and mark those flat segments as missing as well.

Finally, we save the cleaned panel to disk in two formats:

- `data/seattle_loop_clean.parquet`
- `data/seattle_loop_clean.pkl`

These files are the **canonical input** for all later notebooks (missingness EDA,
blackout detection, and model training).

In [8]:
import pandas as pd
import numpy as np
from pathlib import Path

In [9]:
# -------------------------------
# 1. Load raw CSV
# -------------------------------

# Read the CSV file containing loop detector data
# There are no headers in file, so we provide column names explicitly
df = pd.read_csv(
    "loop_20150101_20151231.csv",
    header=None,
    names=["time", "detector", "direction", "speed", "volume", "occupancy"]
)

# Convert 'time' to datetime and sort chronologically
df["time"] = pd.to_datetime(df["time"])
df = df.sort_values("time")

print("Raw dataframe shape:", df.shape)
print("Time span:", df["time"].min(), "→", df["time"].max())

Raw dataframe shape: (22211424, 6)
Time span: 2015-01-01 00:00:00 → 2015-12-31 23:55:00


### Pivot Data to Wide Format (Time × Detector) and Clean

In [10]:
# ----------------------------------
# 2. Pivot to time × detector panel
# ----------------------------------

# Reshape the DataFrame so that each row is a timestamp
# and each column is a detector's speed reading
wide = df.pivot_table(
    index="time",
    columns="detector",
    values="speed"
).sort_index()

In [11]:
# ---------------------------------
# 3. Enforce strict 5-minute grid 
# ---------------------------------

# Build a full 5-minute index from first to last timestamp
full_index = pd.date_range(
    start=wide.index.min(),
    end=wide.index.max(),
    freq="5min"
)

# Reindex the panel so every 5-minute slot exists;
# any missing timestamps become rows full of NaN
wide = wide.reindex(full_index)
wide.index.name = "time"  # keep index name clean

print("Wide panel shape AFTER reindex:", wide.shape)
print("Expected number of 5-min steps:", (full_index[-1] - full_index[0]) / pd.Timedelta("5min") + 1)


Wide panel shape AFTER reindex: (105120, 147)
Expected number of 5-min steps: 105120.0


In [12]:
# -------------------------------
# 4. Clean speed values           # (zeros → NaN)
# -------------------------------
# Ensure we are working in float so NaNs are handled cleanly
wide = wide.astype(float)

# Replace speed = 0 (sensor failure / no data) with NaN
wide = wide.replace(0, np.nan)

In [13]:
# -------------------------------------------------
# 5. Identify and mask frozen readings (> 1 hour)
# -------------------------------------------------
values = wide.to_numpy().astype(float)
T, D = values.shape

# Initialize a boolean mask of the same shape to mark frozen values
frozen = np.zeros_like(values, dtype=bool)
MAX_RUN = 12  # Define a frozen run as 12 or more identical values
              # (12 × 5min = 60 minutes of unchanged readings)


# Loop through each detector column
for d in range(D):
    col = values[:, d]  # Extract the column of speed values
    run_val = col[0]    # Start by tracking the first value
    run_start = 0       # Index where the current run begins

    for t in range(1, T):
        # Check if current value is the same as the previous one
        # Also treats two NaNs in a row as "equal"
        same = (col[t] == run_val) or (np.isnan(col[t]) and np.isnan(run_val))
        if same:
            continue  # Still in the same constant run — keep going

        # value changed -> close current run
        run_len = t - run_start
        if run_len >= MAX_RUN and not np.isnan(run_val):
            # If run was too long AND value wasn't already NaN → mark it as frozen
            frozen[run_start:t, d] = True

        # Start tracking the new value
        run_val = col[t]
        run_start = t

    # Handle case where a run continues to the last time step
    run_len = T - run_start
    if run_len >= MAX_RUN and not np.isnan(run_val):
        frozen[run_start:T, d] = True

# After building the frozen mask, replace those entries with NaN
# This treats long flatlines as missing data (likely sensor malfunction)
values[frozen] = np.nan

# Convert back to DataFrame with the same index/columns
wide = pd.DataFrame(values, index=wide.index, columns=wide.columns)

# Print final fraction of missing data (after frozen value masking)
print("Fraction missing overall (after freezing):",
      np.mean(np.isnan(wide.to_numpy())))

Fraction missing overall (after freezing): 0.051772771513476014


- Rows: 105,120 → exactly 365 days × 24 hours × 12 (5-min slots)
- Cols: 147 detectors
- Fraction missing overall: 0.0518 → about 5.2% of all (time, detector) cells are NaN  
- 5.2% means ~800,000 missing datapoints

In [14]:
# -------------------------------
# 6. Save clean panel            
# -------------------------------

# Make sure a data/ folder exists 
Path("data").mkdir(exist_ok=True)

out_parquet = Path("data") / "seattle_loop_clean.parquet"
out_pickle  = Path("data") / "seattle_loop_clean.pkl"

# Overwrite any existing file
wide.to_parquet(out_parquet, engine="pyarrow")
wide.to_pickle(out_pickle)

print("Saved clean panel to:")
print("  -", out_parquet.resolve())
print("  -", out_pickle.resolve())

Saved clean panel to:
  - C:\Users\asust\Desktop\Modeling Information Blackouts\data\seattle_loop_clean.parquet
  - C:\Users\asust\Desktop\Modeling Information Blackouts\data\seattle_loop_clean.pkl


### Summary

This notebook converts the raw 2015 Seattle loop CSV file into a clean,
regular time series panel suitable for modeling.

- The final panel has **105,120 rows** (one full year of 5-minute intervals)
  and **147 detectors**.
- We standardize timestamps, pivot to a **time × detector** matrix, and enforce
  a strict 5-minute grid.
- We treat `0` speeds and long frozen runs (≥60 minutes of identical readings)
  as missing and convert them to `NaN`.
- The resulting dataset has a small but meaningful amount of missingness
  (≈5% of all entries), with clear outages that we will analyze in the next
  notebook.

Outputs:

- `data/seattle_loop_clean.parquet`
- `data/seattle_loop_clean.pkl`

These files are the starting point for **02 — Exploratory Data Analysis on Missingness**.