# 04 — Build $x_t$ and $m_t$ Matrices

This notebook converts the cleaned Seattle loop panel into:

- Observation matrix $x_t \in \mathbb{R}^{T \times D}$ (speeds on a 5-minute grid),
- Missingness indicator matrix $m_{t,d} \in \{0,1\}^{T \times D}$,
- Per-time observed index sets $O_t = \{d : m_{t,d} = 0\}$.

In [29]:
import numpy as np
import pandas as pd
from pathlib import Path
import pickle

In [30]:
# -----------------------------------------
# 1. Load cleaned time × detector panel
# -----------------------------------------

try:
    wide = pd.read_parquet("data/seattle_loop_clean.parquet")
    print("Loaded clean panel from parquet.")
except Exception as e:
    print("Parquet load failed, falling back to pickle. Error was:")
    print(" ", e)
    wide = pd.read_pickle("data/seattle_loop_clean.pkl")
    print("Loaded clean panel from pickle.")

print("Wide panel shape:", wide.shape)
print("Time span:", wide.index.min(), "→", wide.index.max())

T, D = wide.shape
timestamps = wide.index.to_numpy()
detector_ids = wide.columns.to_numpy()

print(f"T (time steps): {T},  D (detectors): {D}")

Loaded clean panel from parquet.
Wide panel shape: (105120, 147)
Time span: 2015-01-01 00:00:00 → 2015-12-31 23:55:00
T (time steps): 105120,  D (detectors): 147


## 2. Build observation matrix $x_t$

We keep the detector speeds on the 5-minute grid as a $T \times D$ matrix:

- $x_{t,d}$ is the speed (in mph) for detector $d$ at time step $t$.
- Missing readings remain as `NaN`, since the EKF code can use a separate
  mask to decide which entries to trust.

In [31]:
# -----------------------------------------
# 2. Observation matrix x_t
# -----------------------------------------

# T × D float32 array with NaNs where data are missing
X_nan = wide.to_numpy(dtype=np.float32)

print("X_nan shape:", X_nan.shape)
print("Fraction NaN in X_nan:", np.isnan(X_nan).mean())

# Make sure data directory exists
Path("data").mkdir(exist_ok=True)

np.save("data/x_t_nan.npy", X_nan)

# Also save detector ID ordering so Allan knows what column index d means
np.save("data/detector_ids.npy", detector_ids)

print("Saved:")
print("  data/x_t_nan.npy")
print("  data/detector_ids.npy")


X_nan shape: (105120, 147)
Fraction NaN in X_nan: 0.051772771513476014
Saved:
  data/x_t_nan.npy
  data/detector_ids.npy


## 3. Build missingness matrix $m_{t,d}$

By definition,

- $m_{t,d} = 1$ if detector $d$ is **missing** at time $t$,
- $m_{t,d} = 0$ if detector $d$ is **observed** at time $t$.

So $m_t$ has the same shape as $x_t$ and can be used both to:

- drive the EKF measurement update (which detectors to include at each step),
- define per-detector and network-level blackout events.

In [32]:
# -----------------------------------------
# 3. Missingness matrix m_t
# -----------------------------------------

# Boolean matrix: True where missing
missing_bool = np.isnan(X_nan)

# Binary matrix: 1 = missing, 0 = observed
M = missing_bool.astype(np.int8)

print("M shape:", M.shape)
print("Fraction missing (M == 1):", M.mean())

np.save("data/m_t.npy", M)
print("Saved: data/m_t.npy")

M shape: (105120, 147)
Fraction missing (M == 1): 0.051772771513476014
Saved: data/m_t.npy


## 4. Observed detector sets $O_t$

For each time step $t$, we define

$$
O_t = \{ d \in \{0,\dots,D-1\} : m_{t,d} = 0 \},
$$

i.e., the set of detector indices that are *observed* at time $t$.

Storing $O_t$ as a Python list of NumPy arrays makes it straightforward for
the EKF code to iterate over time, pull out the observed detectors, and build
the corresponding measurement sub-matrices on the fly.

In [33]:
# -----------------------------------------
# 4. Observed index sets O_t
# -----------------------------------------

O_t = []

for t in range(T):
    # Detectors that are observed at time t (m_{t,d} == 0)
    observed_indices = np.where(M[t] == 0)[0].astype(np.int32)
    O_t.append(observed_indices)

# Quick sanity check on a few time steps
for t_check in [0, 1000, 5000]:
    if t_check < T:
        print(f"t={t_check}: |O_t| = {len(O_t[t_check])}")

with open("data/O_t.pkl", "wb") as f:
    pickle.dump(O_t, f, protocol=pickle.HIGHEST_PROTOCOL)

print("Saved: data/O_t.pkl")

t=0: |O_t| = 136
t=1000: |O_t| = 137
t=5000: |O_t| = 135
Saved: data/O_t.pkl


### 5. Summary

This notebook exports the core inputs required by the state-space / EKF model:

- `x_t_nan.npy`: $T \times D$ matrix of speeds with `NaN` for missing entries.
- `m_t.npy`: $T \times D$ binary mask with $m_{t,d} = 1$ when detector $d$ is missing.
- `O_t.pkl`: list of observed index sets $O_t$, one NumPy array per time step.
- `detector_ids.npy`: mapping from column index $d$ to the external detector ID.

# Allan

1. Read  `x_t_nan.npy` for the raw measurements,
2. Use `m_t.npy` and/or `O_t.pkl` to select the observed detectors at each time,
3. Align internal state indices with real detector IDs via `detector_ids.npy`.

- **`x_t`**: Used in **Step 3 (Speed observation block)** as the observed values \( y^{(x)}_t = x^{\text{obs}}_t \), and for constructing \( C^{(x)}_t \) and \( R^{(x)}_t \), based on which detectors are present at time \( t \).

- **`m_t`**: Used in **Step 4 (Missingness block)** as the binary observation \( y^{(m)}_t = m_t \in \{0,1\}^D \), forming the pseudo-observation for modeling missingness via a logistic function.

- **`O_t`**: Used in **Step 3 (Speed observation block)** to define the set of observed detectors \( \mathcal{O}_t = \{ d : m_{t,d} = 0 \} \), which determines how \( x_t \), \( C \), and \( R \) are masked and reduced.
