# 02 — Imputation Strategies: Baseline versus Model

**Objective.** This notebook documents two imputation approaches: a baseline median-based strategy and a model-based strategy. We compare distributions and (if predictions are available) predicted versus actual values.

**Tables.**
- Curated (baseline applied): `PROJECT_ID.DATASET.curated_usage`
- Optional model predictions: `PROJECT_ID.DATASET.imputed_model`

In [None]:
# Configuration
PROJECT_ID = "data-engineering-ecometricx"
DATASET = "energy_analytics"
LOCATION = "EU"

# !pip -q install --upgrade pandas pandas-gbq google-cloud-bigquery

import pandas as pd
from pandas_gbq import read_gbq
import matplotlib.pyplot as plt

STG = f"`{PROJECT_ID}.energy_analytics.stg_usage`".format(PROJECT_ID=PROJECT_ID)
CUR = f"`{PROJECT_ID}.energy_analytics.curated_usage`".format(PROJECT_ID=PROJECT_ID)
IMPUTED_MODEL = f"`{PROJECT_ID}.energy_analytics.imputed_model`".format(PROJECT_ID=PROJECT_ID)

## 1. Distributional comparison (pre versus curated/baseline)

In [None]:
pre = read_gbq(f"SELECT lusage FROM {STG} LIMIT 200000", project_id=PROJECT_ID, location=LOCATION)
post = read_gbq(f"SELECT lusage, was_imputed, impute_strategy FROM {CUR} LIMIT 200000", project_id=PROJECT_ID, location=LOCATION)

plt.figure()
pre['lusage'].dropna().hist(bins=50)
plt.title('Pre (staging): lusage distribution')
plt.xlabel('lusage'); plt.ylabel('count')
plt.show()

plt.figure()
post['lusage'].dropna().hist(bins=50)
plt.title('Post (curated/baseline): lusage distribution')
plt.xlabel('lusage'); plt.ylabel('count')
plt.show()

## 2. Imputed share by month (baseline path)

In [None]:
imp_q = (
    "SELECT month, COUNT(*) AS n, COUNTIF(was_imputed) AS imputed_n "
    f"FROM {CUR} "
    "GROUP BY month ORDER BY month"
)
imp = read_gbq(imp_q, project_id=PROJECT_ID, location=LOCATION)
plt.figure()
plt.bar(imp['month'], imp['imputed_n'])
plt.title('Imputed rows by month (baseline)')
plt.xlabel('month'); plt.ylabel('count imputed')
plt.show()
imp

## 3. Predicted versus actual (model-based, if available)

In [None]:
try:
    mdl = read_gbq(f"SELECT lusage, lusage_pred_model FROM {IMPUTED_MODEL} LIMIT 200000",
                   project_id=PROJECT_ID, location=LOCATION)
    corr = mdl[['lusage','lusage_pred_model']].corr().iloc[0,1]
    sample = mdl.dropna().sample(n=min(20000, len(mdl)), random_state=1)

    plt.figure()
    plt.scatter(sample['lusage_pred_model'], sample['lusage'], s=3, alpha=0.3)
    plt.title(f'Predicted versus actual (corr={corr:.3f})')
    plt.xlabel('predicted lusage'); plt.ylabel('actual lusage')
    plt.show()

    mdl[['lusage','lusage_pred_model']].describe()
except Exception as e:
    print("Model predictions not available or query failed:", e)

## 4. Discussion

- The baseline method preserves the central tendency with minimal distortion; when missingness is rare, its impact is negligible.
- The model-based method can capture cross-sectional heterogeneity and temporal persistence, but may be sensitive to feature quality and regularization; careful evaluation is required.