# 02 – Data QA, Imputation, and Feature Engineering (Daily)

**Goal:** Prepare model-ready daily features for GP (inflow & demand) and the BN module, aligned with your proposal.

**Inputs:**
- `master_kaligandaki_daily_withrain.csv` (2019–2023) with columns:
  - `date`, `discharge_cms`, `gauge_m`, `rainfall_mm`,
  - `peak_load_mw`, `avg_load_mw`, `energy_mwh`, `year`.

**Outputs:**
- `features_daily.csv` – cleaned & engineered features
- Train/validation split markers (2019–2022 train, 2023 validation)


In [None]:
# ==== 0. Imports & Config ====
import pandas as pd
import numpy as np
from pathlib import Path

MASTER_PATH = 'master_kaligandaki_daily_withrain.csv'  # change if needed
OUT_FEATURES = 'features_daily.csv'


In [None]:
# ==== 1. Load master data ====
df = pd.read_csv(MASTER_PATH, parse_dates=['date']).sort_values('date').reset_index(drop=True)
print(df.head())
print('\nDate range:', df['date'].min().date(), '→', df['date'].max().date(), '| rows=', len(df))


In [None]:
# ==== 2. Quick QA – Missing counts and basic sanity ====
missing = df.isna().sum().sort_values(ascending=False)
print('Missing values per column:\n', missing)
print('\nBasic stats (selected):')
print(df[['discharge_cms','gauge_m','rainfall_mm','peak_load_mw','avg_load_mw','energy_mwh']].describe())


### Imputation policy
- Keep full daily calendar.
- For **short gaps (≤3 days)** in `discharge_cms`, `gauge_m`, `peak_load_mw`, `avg_load_mw`, `energy_mwh`: use linear interpolation.
- Leave long gaps as NaN and create **flag columns** (so models/BN can know about uncertainty).

In [None]:
# ==== 3. Gap filling (linear) for short gaps only ====
def linear_impute_short_gaps(s: pd.Series, max_gap: int = 3):
    s = s.copy()
    isna = s.isna()
    if not isna.any():
        return s, pd.Series(False, index=s.index)
    blocks = []
    start = None
    for i, v in enumerate(isna):
        if v and start is None:
            start = i
        if (not v or i == len(isna)-1) and start is not None:
            end = i if not v else i
            blocks.append((start, end))
            start = None
    imputed_flag = pd.Series(False, index=s.index)
    for (a, b) in blocks:
        length = b - a
        if length <= max_gap:
            s.iloc[a:b] = s.interpolate(limit=max_gap, limit_direction='both').iloc[a:b]
            imputed_flag.iloc[a:b] = True
    return s, imputed_flag

cols_to_impute = ['discharge_cms','gauge_m','peak_load_mw','avg_load_mw','energy_mwh']
for c in cols_to_impute:
    df[c], df[f'{c}_imputed'] = linear_impute_short_gaps(df[c], max_gap=3)

print('After imputation (short gaps):')
print(df[cols_to_impute].isna().sum())


### Outlier flagging (robust rule)
- Compute rolling median (7 days) and MAD.
- Flag points with |x − median| > 5 × MAD as potential outliers.

In [None]:
# ==== 4. Outlier flags (robust) ====
def mad_based_outlier_flags(x: pd.Series, window=7, k=5.0):
    med = x.rolling(window, min_periods=1, center=True).median()
    mad = (x - med).abs().rolling(window, min_periods=1, center=True).median()
    return ((x - med).abs() > k * (mad + 1e-9)).astype(int)

for c in ['discharge_cms','gauge_m','rainfall_mm','peak_load_mw','avg_load_mw','energy_mwh']:
    df[f'{c}_outlier'] = mad_based_outlier_flags(df[c])

df.head()


### Feature engineering (daily)
- **Calendar:** day-of-week, month, monsoon flag (Jun–Sep)
- **Lags:** 1, 2, 3, 7 days for hydrology & load
- **Rolling:** 3, 7, 30-day means (and rainfall sums for 3 & 7)
- **Targets:** next-day peak load and inflow

In [None]:
# ==== 5. Feature engineering ====
df['dow'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['is_monsoon'] = df['month'].isin([6,7,8,9]).astype(int)

lag_cols = ['discharge_cms','gauge_m','rainfall_mm','peak_load_mw','avg_load_mw','energy_mwh']
for c in lag_cols:
    for L in [1,2,3,7]:
        df[f'{c}_lag{L}'] = df[c].shift(L)

for c in ['discharge_cms','gauge_m','peak_load_mw','avg_load_mw']:
    for W in [3,7,30]:
        df[f'{c}_roll{W}'] = df[c].rolling(W, min_periods=1).mean()

for W in [3,7]:
    df[f'rainfall_sum{W}'] = df['rainfall_mm'].rolling(W, min_periods=1).sum()
    df[f'rainfall_roll{W}'] = df['rainfall_mm'].rolling(W, min_periods=1).mean()

df['y_demand_peak_next'] = df['peak_load_mw'].shift(-1)
df['y_inflow_next'] = df['discharge_cms'].shift(-1)

df.head()


### Train/Validation split
- Train: 2019–2022
- Validation: 2023

In [None]:
# ==== 6. Train/Validation mask ====
df['set'] = np.where(df['date'] < '2023-01-01', 'train', 'valid')
print(df['set'].value_counts())


In [None]:
# ==== 7. Save features ====
df.to_csv(OUT_FEATURES, index=False)
print('Saved features to:', OUT_FEATURES, '| rows=', len(df))
