# 05 — Feature Engineering for Missingness Model

This notebook constructs feature vectors for the missingness process $p(m_t \mid z_t)$. We focus on covariates that are plausibly related to sensor dropouts:

- **Time-based features** (hour, weekday, night vs day, network-wide outage).
- **Detector-level features** (average speed, speed variability, chronic missing rate).

We package everything into a single design tensor:

$$
\Phi[t, d, :] \in \mathbb{R}^K
$$

This tensor can be used directly in a logistic model (or any other
parametric dropout model) inside the EKF pipeline for
$P(m_{t,d} = 1 \mid \phi_{t,d})$.
The resulting arrays are saved under `data/` for use by the model code
via `data_interface.py`.

In [10]:
import numpy as np
import pandas as pd
from pathlib import Path

In [11]:
# ------------------------------------------------
# 1. Load cleaned time × detector speed panel
# ------------------------------------------------

try:
    wide = pd.read_parquet("data/seattle_loop_clean.parquet")
    print("Loaded clean panel from parquet.")
except Exception as e:
    print("Parquet load failed, falling back to pickle. Error was:")
    print(" ", e)
    wide = pd.read_pickle("data/seattle_loop_clean.pkl")
    print("Loaded clean panel from pickle.")

print("Wide panel shape:", wide.shape)
print("Time span:", wide.index.min(), "→", wide.index.max())

T, D = wide.shape
timestamps = wide.index.to_numpy()
detector_ids = wide.columns.to_numpy()

print(f"T (time steps): {T},  D (detectors): {D}")

Loaded clean panel from parquet.
Wide panel shape: (105120, 147)
Time span: 2015-01-01 00:00:00 → 2015-12-31 23:55:00
T (time steps): 105120,  D (detectors): 147


## 2. Build missingness matrix $m_{t,d}$

We re-derive the missingness indicator:

- $x_{t,d}$ is missing $\iff$ the entry in `wide` is `NaN`.
- $m_{t,d} = 1$ if missing, $0$ if observed.

This will be consistent with the definition used in earlier notebooks.

In [12]:
# ------------------------------------------------------
# 2. Missingness matrix m_t  (1 = missing, 0 = observed)
# -------------------------------------------------------

X = wide.to_numpy(dtype=np.float32)
missing_bool = np.isnan(X)
M = missing_bool.astype(np.int8)

print("M shape:", M.shape)
print("Fraction missing (M == 1):", M.mean())

# (Optional) save again here for convenience if needed elsewhere
Path("data").mkdir(exist_ok=True)
np.save("data/m_t.npy", M)
print("Saved (again for convenience): data/m_t.npy")


M shape: (105120, 147)
Fraction missing (M == 1): 0.051772771513476014
Saved (again for convenience): data/m_t.npy


## 3. Time-based covariates

We create time covariates that might affect the probability of dropout:

- **Hour-of-day (cyclical)**: $\sin(2\pi \, \text{hour} / 24)$, $\cos(2\pi \, \text{hour} / 24)$
- **Weekend indicator**: 1 for Saturday/Sunday, 0 otherwise.
- **Night-time indicator**: 1 for late night / early morning (here: 22:00–05:59), 0 otherwise.
- **Network-level blackout indicator**: 1 when at least 10% of detectors are missing, 0 otherwise.

These are defined per time step $t$ and will be broadcast to all detectors.

In [13]:
# ------------------------------------------------
# 3. Time-based features (shape: T × F_time)
# ------------------------------------------------

# Extract hour and weekday from the datetime index
hours = wide.index.hour.to_numpy()
weekdays = wide.index.dayofweek.to_numpy()  # 0=Mon, ..., 6=Sun

# Cyclical encoding of hour-of-day
hour_angle = 2.0 * np.pi * hours / 24.0
hour_sin = np.sin(hour_angle).astype(np.float32)
hour_cos = np.cos(hour_angle).astype(np.float32)

# Weekend indicator (Saturday=5, Sunday=6)
is_weekend = ((weekdays >= 5).astype(np.float32))

# Night-time indicator (example: 22:00–05:59)
is_night = (((hours >= 22) | (hours < 6)).astype(np.float32))

# Network-wide blackout indicator: fraction of detectors missing ≥ 0.10
missing_frac_time = M.mean(axis=1)  # shape (T,)
NET_THRESH = 0.10
is_net_blackout = (missing_frac_time >= NET_THRESH).astype(np.float32)

# Stack into a single time-feature matrix: [sin_hour, cos_hour, weekend, night, net_blackout]
time_features = np.stack(
    [hour_sin, hour_cos, is_weekend, is_night, is_net_blackout],
    axis=1
).astype(np.float32)

print("time_features shape:", time_features.shape)  # (T, 5)

np.save("data/time_features.npy", time_features)
print("Saved: data/time_features.npy")

time_features shape: (105120, 5)
Saved: data/time_features.npy


## 4. Detector-based covariates

We compute static features per detector \(d\):

- **Average speed** over the year (ignoring missing values).
- **Speed standard deviation** (traffic variability).
- **Missing fraction**: overall proportion of time steps where detector \(d\) is missing.

These encode how “healthy” and “stable” a detector is; chronically broken
detectors and high-variance detectors may have different dropout behaviour.

In [14]:
# ------------------------------------------------
# 4. Detector-based features (shape: D × F_det)
# ------------------------------------------------

# Compute per-detector stats ignoring NaNs
speed_mean_d = np.nanmean(X, axis=0).astype(np.float32)
speed_std_d  = np.nanstd(X, axis=0).astype(np.float32)
missing_frac_d = M.mean(axis=0).astype(np.float32)  # fraction missing for each detector

detector_features = np.stack(
    [speed_mean_d, speed_std_d, missing_frac_d],
    axis=1
).astype(np.float32)

print("detector_features shape:", detector_features.shape)  # (D, 3)

np.save("data/detector_features.npy", detector_features)
np.save("data/detector_ids.npy", detector_ids)  # to keep alignment explicit
print("Saved: data/detector_features.npy")
print("Saved (alignment): data/detector_ids.npy")

detector_features shape: (147, 3)
Saved: data/detector_features.npy
Saved (alignment): data/detector_ids.npy


## 5. Combine into $\Phi[t, d, :]$ feature tensor

We now combine the time-based and detector-based covariates into a single feature tensor:

$$
\Phi[t, d, :] = \left[
    1, \quad                       \text{(intercept)} \\
    \sin(\text{hour}_t), \quad     \cos(\text{hour}_t), \\
    \text{is\_weekend}_t, \quad    \text{is\_night}_t, \\
    \text{is\_net\_blackout}_t, \\
    \overline{v}_d, \quad          \sigma(v)_d, \quad \text{miss\_frac}_d
\right]
$$

This yields $K = 9$ features per $(t, d)$ pair:

1. Intercept (always 1.0)  
2. Hour-of-day sine  
3. Hour-of-day cosine  
4. Weekend indicator  
5. Night-time indicator  
6. Network-level blackout indicator  
7. Detector’s mean speed  
8. Detector’s speed standard deviation  
9. Detector’s overall missing fraction

This representation is suitable for a logistic model of $m_{t,d}$ and can be extended later if needed.

In [15]:
# ------------------------------------------------
# 5. Build Phi[t, d, k] design tensor
# ------------------------------------------------

F_time = time_features.shape[1]   # 5
F_det  = detector_features.shape[1]  # 3

K = 1 + F_time + F_det  # 1 intercept + 5 time + 3 detector = 9

Phi = np.zeros((T, D, K), dtype=np.float32)

# 0: intercept
Phi[:, :, 0] = 1.0

# 1–5: time-based features (broadcast along detectors)
# time_features: (T, 5) → (T, 1, 5) → broadcast to (T, D, 5)
Phi[:, :, 1:1+F_time] = time_features[:, None, :]

# 6–8: detector-based features (broadcast along time)
# detector_features: (D, 3) → (1, D, 3) → broadcast to (T, D, 3)
Phi[:, :, 1+F_time:1+F_time+F_det] = detector_features[None, :, :]

print("Phi shape:", Phi.shape)  # (T, D, 9)

np.save("data/phi_features.npy", Phi)
print("Saved: data/phi_features.npy")


Phi shape: (105120, 147, 9)
Saved: data/phi_features.npy


## 6. Feature documentation

The file `data/phi_features.npy` is a NumPy array of shape $(T, D, K)$ with:

- $T$ = number of 5-minute time steps in 2015 ($105{,}120$)  
- $D$ = number of detectors ($147$)  
- $K = 9$ features  

The feature channels are ordered as:

0. **Intercept**: constant 1.0  
1. **hour_sin**: $\sin(2\pi \cdot \text{hour} / 24)$, hour = 0,…,23  
2. **hour_cos**: $\cos(2\pi \cdot \text{hour} / 24)$  
3. **is\_weekend**: 1 if Saturday or Sunday, else 0  
4. **is\_night**: 1 if hour $\in \{22, 23, 0, 1, 2, 3, 4, 5\}$, else 0  
5. **is\_net\_blackout**: 1 if at least 10% of detectors are missing at time $t$, 0 otherwise  
6. **mean\_speed\_d**: average speed of detector $d$ (ignoring NaNs)  
7. **std\_speed\_d**: standard deviation of speed for detector $d$ (ignoring NaNs)  
8. **missing\_frac\_d**: fraction of time steps where detector $d$ is missing  

Additional helper files:

- `data/time_features.npy`: $(T, 5)$ matrix with the time-based features  
- `data/detector_features.npy`: $(D, 3)$ matrix with the detector-based features  
- `data/detector_ids.npy`: $(D,)$ array giving the original detector ID for each column index $d$  