### Imputation Rationale

**Do not impute inconsistent/partial variables by default.** Only consider imputation if the variable is conceptually indispensable and FMI suggests the information can be credibly recovered (e.g., plausible MAR with auxiliary predictors).

It’s not reasonable to impute inconsistent/partial variables without first considering FMI and context. Imputation is not a neutral operation; it encodes assumptions about the missingness mechanism, temporal comparability, and the meaning of the variable. If a variable is inconsistent across months/years, imputing it can fabricate continuity that wasn’t in the data, undermining factor analysis and comparability across regions and time.

**Tier 1 — Consistent variables:**

- Action: Eligible for imputation.
- Rule: Use FMI to determine imputation intensity (light/cautious/advanced).
- Justification: Stable measurement; imputation supports matrix completion for EFA.

**Tier 2 — Partial variables (intermittent presence or minor coding drift):**

- Action: Conditional imputation.
- Rule: Impute only if FMI is moderate/high but MAR plausibility exists via auxiliary predictors, and coding is harmonized; otherwise flag for sensitivity analysis.
- Justification: Limited comparability; treat as supporting evidence, not core FA inputs.

**Tier 3 — Inconsistent variables (structural changes, major coding breaks):**

- Action: Do not impute for FA.
- Rule: Document and retain for diagnostics; consider future harmonization projects or use in qualitative context.

- Justification: Imputation would manufacture comparability and can distort factor structure.

**Override - Conceptual indispensability:**

- Action: If a variable is central to sensitivity/resilience/exposure and lacks a close proxy, allow imputation even if partial, but only with:
- Explicit MAR argument using auxiliary variables,
- complete coding evidence, and
- Sensitivity analyses comparing included vs excluded.

**Why imputing inconsistent variables without FMI review is not defensible?**

Measurement instability:  

Inconsistent variables often arise because the survey question changed, coding shifted, or the variable wasn’t asked in some rounds. Imputing them blindly assumes the missingness is random noise, when in fact it reflects structural differences. That creates false comparability across years.
**Factor analysis assumptions:**

FA assumes each variable measures the same construct across all observations. If a variable is inconsistent, imputing values fabricates continuity that wasn’t there. This risks producing spurious factors that look “interpretable” but are actually artifacts of imputation.

**Auditability and thesis defense:**

The approved pipeline methodology emphasizes transparency and conceptual justification. If the team imputes inconsistent variables without FMI, reviewers can easily challenge: “Why did you treat structurally missing data as if it were random?”

### Documentation and audit trail

Action matrix: For each variable, store:

- Tag: consistent/partial/inconsistent.
- FMI bucket: Low/Moderate/High/Critical.
- Dimension role: sensitivity/resilience/exposure.
- Decision: keep, impute (light/cautious/advanced), sensitivity-only, exclude from FA.
- Rationale: conceptual indispensability, MAR plausibility, harmonization status, auxiliary predictors.
- Sensitivity analysis flags: Flag variables where inclusion materially changes factor loadings or KMO/Bartlett results, so the team can revisit.

In [None]:
# 09_Imputation Notebook — Decision Matrix Builder
# ------------------------------------------------

import json
from pathlib import Path
import os
import pandas as pd
import numpy as np
from datetime import datetime

# --- Load config ---
with open(Path("./data/interim/config.json")) as f:
    cfg = json.load(f)

BASE_PATH = Path(cfg["BASE_PATH"])
INTERIM_DIR = Path(cfg["INTERIM_DIR"])
PROCESSED_DIR = Path(cfg["PROCESSED_DIR"])
LOG_DIR = Path(cfg["LOG_DIR"])
MONTH_ORDER = cfg["MONTH_ORDER"]

# --- Load inventory (optional, for parity) ---
with open(Path(INTERIM_DIR) / "inventory.json") as f:
    inventory = json.load(f)

# --- Paths ---
RENAMED_ROOT = BASE_PATH / "NEW Renamed Fully Decoded Surveys"
CONSISTENCY_ROOT = BASE_PATH / "NEW Variable Consistency Check"
FMI_ROOT = BASE_PATH / "NEW FMI Reports"
DECISION_ROOT = BASE_PATH / "Decision Matrix for Imputation"
os.makedirs(DECISION_ROOT, exist_ok=True)

# --- Load inputs ---
consistency_df = pd.read_csv(CONSISTENCY_ROOT / "consistency_profile.csv")
fmi_df = pd.read_csv(FMI_ROOT / "fmi_profile.csv")

# --- Merge consistency + FMI ---
decision_df = fmi_df.merge(
    consistency_df[["Variable", "ConsistencyTag"]],
    on="Variable",
    how="left"
)

# --- Handle duplicate ConsistencyTag columns if present ---
if "ConsistencyTag_x" in decision_df.columns and "ConsistencyTag_y" in decision_df.columns:
    decision_df["ConsistencyTag"] = decision_df["ConsistencyTag_x"].combine_first(decision_df["ConsistencyTag_y"])
    decision_df.drop(columns=["ConsistencyTag_x", "ConsistencyTag_y"], inplace=True)

# --- Manual factor formation dictionary (customizable) ---
dimension_map = {
    # Sensitivity
    "Available for Work": "Sensitivity",
    "C13-Major Occupation Group": "Sensitivity",
    "C14-Primary Occupation": "Sensitivity",
    "C15-Major Industry Group": "Sensitivity",
    "C16-Kind of Business (Primary Occupation)": "Sensitivity",
    "C24-Basis of Payment (Primary Occupation)": "Sensitivity",
    "C25-Basic Pay per Day (Primary Occupation)": "Sensitivity",
    "Class of Worker (Primary Occupation)": "Sensitivity",
    "Nature of Employment (Primary Occupation)": "Sensitivity",
    "Total Hours Worked for all Jobs": "Sensitivity",
    "Work Arrangement": "Sensitivity",
    "Work Indicator": "Sensitivity",
    # Resilience
    "C03-Relationship to Household Head": "Resilience",
    "C04-Sex": "Resilience",
    "C05-Age as of Last Birthday": "Resilience",
    "C06-Marital Status": "Resilience",
    "C07-Highest Grade Completed": "Resilience",
    "C08-Currently Attending School": "Resilience",
    "C09-Graduate of technical/vocational course": "Resilience",
    "C09a - Currently Attending Non-formal Training for Skills Development": "Resilience",
    "Household Size": "Resilience",
    # Exposure
    "Province": "Exposure",
    "Province Recode": "Exposure",
    "Region": "Exposure",
    "Urban-RuralFIES": "Exposure",
    "Location of Work (Province, Municipality)": "Exposure",
    "Survey Month": "Exposure",
    "Survey Year": "Exposure",
}

# --- Dimension assignment function ---
def assign_dimension(var):
    if var in dimension_map:
        return dimension_map[var]
    v = var.lower()
    if any(k in v for k in ["occupation", "work", "employment", "job", "hours", "basis", "industry"]):
        return "Sensitivity"
    elif any(k in v for k in ["grade", "school", "household", "age", "marital", "ethnicity", "training"]):
        return "Resilience"
    elif any(k in v for k in ["region", "province", "urban", "survey", "weight", "psu", "replicate"]):
        return "Exposure"
    else:
        return "Unclassified"

decision_df["Dimension"] = decision_df["Variable"].apply(assign_dimension)

# --- SuggestedAction logic ---
def suggest_action(row):
    fmi = row["OverallFMI"]
    tag = row["ConsistencyTag"]

    if pd.isna(fmi):
        return "review"
    if tag == "consistent":
        if fmi < 0.05: return "keep"
        elif fmi < 0.20: return "impute_light"
        elif fmi < 0.40: return "impute_cautious"
        else: return "consider_drop_or_advanced"
    elif tag == "partial":
        if fmi < 0.20: return "sensitivity_only"
        else: return "exclude_from_FA"
    else:  # inconsistent
        return "exclude_from_FA"

decision_df["Action"] = decision_df.apply(suggest_action, axis=1)

# --- Reorder columns for clarity ---
decision_df = decision_df[[
    "Variable", "ConsistencyTag", "OverallFMI", "Flag",
    "Dimension", "Action", 
]]

# --- Save template ---
out_file = DECISION_ROOT / "Decision_Matrix.csv"
decision_df.to_csv(out_file, index=False)
print(f"[OK] Decision matrix template saved to {out_file}")


In [None]:
decision_df.head(10)

#### CRUCIAL NOTES (README)

-  Not sure with the difference between `work indicator and work indicator.1.` Kindly see Decision_Matrix sheets for granular details.
-  Also Check `Province and Province Recode` for missing values. Not sure what kind of imputation is applicable for this one since (assuming manual imputation, since lists of provinces can be acquired online and shall serve as a guide for encoding.). But we can still automate  this given that we have a strict list of dictionary once its acquired from online. IMPROPER IMPUTATION will done at this test stage.

### Decision Matrix for Imputation - Defense

This matrix is the bridge between FMI diagnostics and factor analysis.  
It ensures that **every variable** is evaluated not only by its missingness (FMI) and consistency, but also by its **conceptual role** in financial vulnerability.

- **Sensitivity**: Variables tied to employment stability, income regularity, and sectoral risk.  
- **Resilience**: Variables reflecting household capacity, education, skills, and adaptability.  
- **Exposure**: Variables representing structural or locational factors (region, province, urban/rural).

#### Why automate?
Manual factor formation was encoded into a reproducible dictionary and keyword rules.  
This ensures consistency across runs, while still allowing customization:
- The `dimension_map` dictionary can be edited to refine assignments.  
- Keyword rules act as a fallback for variables not explicitly mapped.  
- Any variable left as `"Unclassified"` is flagged for manual review.

#### Why this is defensible?
- **Theory-guided**: Dimensions are based on the approved thesis framework.  
- **Transparent**: Every variable is listed, no silent exclusions.  
- **Customizable**: Teammates can refine the dictionary or rationale column later.  
- **Audit-ready**: The matrix documents not just FMI and consistency, but also conceptual relevance.

This way, imputation decisions are **informed from the start**, but remain flexible for recalibration.


### Imputation Proper

At this stage, basic imputation will be done to the missing values following the mentioned criterias above. This notebook is customizable according to the further rules that will further be applied to the analysis. For further context, kindly read the CRUCIAL NOTES (README) section in this notebook outline.

In [None]:
from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np
from pathlib import Path
from difflib import get_close_matches

# --- Paths ---
INPUT_ROOT = BASE_PATH / "NEW Renamed Fully Decoded Surveys"
CONSISTENCY_ROOT = BASE_PATH / "NEW Variable Consistency Check"
FMI_ROOT = BASE_PATH / "NEW FMI Reports"
METADATA_ROOT = BASE_PATH / "NEW Metadata Sheet 2 CSVs"
OUTPUT_ROOT = BASE_PATH / "Imputed Data for Analysis"
OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)

# --- Load consistency + FMI profiles ---
consistency_df = pd.read_csv(CONSISTENCY_ROOT / "consistency_profile.csv")
fmi_df = pd.read_csv(FMI_ROOT / "fmi_profile.csv")

decision_df = fmi_df.merge(
    consistency_df[["Variable", "ConsistencyTag"]],
    on="Variable", how="left"
)

# Deduplicate merge artifacts
if "ConsistencyTag_x" in decision_df.columns and "ConsistencyTag_y" in decision_df.columns:
    decision_df["ConsistencyTag"] = decision_df["ConsistencyTag_x"].combine_first(decision_df["ConsistencyTag_y"])
    decision_df.drop(columns=["ConsistencyTag_x", "ConsistencyTag_y"], inplace=True)

# --- Load metadata value sets ---
metadata_dict = {}
for file in Path(METADATA_ROOT).glob("*.csv"):
    meta_df = pd.read_csv(file)
    if "Variable" in meta_df.columns and "AllowedValues" in meta_df.columns:
        for _, row in meta_df.iterrows():
            var = str(row["Variable"]).strip()
            values = str(row["AllowedValues"]).split(";")
            metadata_dict[var] = [v.strip().lower() for v in values if v.strip()]

# --- Normalize names ---
def normalize_name(name: str) -> str:
    return (
        str(name)
        .strip()
        .lower()
        .replace("\xa0", " ")
        .replace("-", " ")
        .replace("_", " ")
    )

decision_df["Variable_norm"] = decision_df["Variable"].apply(normalize_name)

# --- Flexible finder with fuzzy matching ---
def find_column(df, var):
    cols_norm = {normalize_name(c): c for c in df.columns}
    var_norm = normalize_name(var)

    if var_norm in cols_norm:
        return cols_norm[var_norm]

    matches = get_close_matches(var_norm, list(cols_norm.keys()), n=1, cutoff=0.8)
    if matches:
        return cols_norm[matches[0]]

    return None

# --- Helpers ---
def robust_mode(series: pd.Series):
    m = series.mode(dropna=True)
    return None if m.empty else m.iloc[0]

def clean_age_column(col: pd.Series) -> pd.Series:
    s = col.astype(str)
    s = s.where(~s.str.contains(r"\d{4}-\d{2}-\d{2}", regex=True), "UnknownAge")
    numeric_coerced = pd.to_numeric(s, errors="coerce")
    if numeric_coerced.notna().sum() >= (0.5 * len(s)):
        return numeric_coerced.fillna(-1).astype(int)
    else:
        s = s.replace({"nan": "UnknownAge"})
        return s

# --- Metadata-guided flexible imputation ---
def apply_imputation(df: pd.DataFrame, var: str, audit_rows: list):
    col_name = find_column(df, var)
    if col_name is None:
        audit_rows.append({
            "Variable": var,
            "MethodApplied": "not_matched",
            "AllowedValues": None,
            "BeforeMissing": None,
            "AfterMissing": None,
            "Note": "Variable not matched to any column (check naming)."
        })
        return

    # Normalize blanks to NaN
    df[col_name] = df[col_name].replace(r'^\s*$', np.nan, regex=True)

    before_missing = int(df[col_name].isna().sum())
    dtype_numeric = pd.api.types.is_numeric_dtype(df[col_name])

    allowed = metadata_dict.get(var, None)
    method, note = "none", "No imputation required."
    after_missing = before_missing

    if normalize_name(var) == normalize_name("C05-Age as of Last Birthday"):
        df[col_name] = clean_age_column(df[col_name])
        dtype_numeric = pd.api.types.is_numeric_dtype(df[col_name])

    if dtype_numeric:
        if before_missing > 0:
            med = df[col_name].median()
            df[col_name].fillna(med, inplace=True)
            method = "median"
            note = f"Numeric imputation with median={med:.4f}."
            after_missing = int(df[col_name].isna().sum())
    else:
        if before_missing > 0:
            mode_val = robust_mode(df[col_name])
            if allowed:
                # restrict mode to allowed values
                if mode_val is not None and str(mode_val).lower() in allowed:
                    df[col_name].fillna(mode_val, inplace=True)
                    method = "metadata_mode"
                    note = f"Categorical imputation with mode='{mode_val}' (validated against metadata)."
                else:
                    df[col_name].fillna("Unknown", inplace=True)
                    method = "metadata_unknown"
                    note = "No valid mode within metadata; filled with 'Unknown'."
            else:
                # fallback if no metadata
                if mode_val is not None:
                    df[col_name].fillna(mode_val, inplace=True)
                    method = "categorical_mode"
                    note = f"Categorical imputation with mode='{mode_val}'."
                else:
                    df[col_name].fillna("Unknown", inplace=True)
                    method = "unknown_fallback"
                    note = "No valid mode; filled with 'Unknown'."
            after_missing = int(df[col_name].isna().sum())

    audit_rows.append({
        "Variable": var,
        "MethodApplied": method,
        "AllowedValues": allowed,
        "BeforeMissing": before_missing,
        "AfterMissing": after_missing,
        "Note": note
    })

# --- Year-by-year execution ---
consistent_vars = consistency_df[consistency_df["ConsistencyTag"] == "consistent"]["Variable"].tolist()

for year_folder in INPUT_ROOT.iterdir():
    if not year_folder.is_dir():
        continue

    year_out_dir = OUTPUT_ROOT / year_folder.name
    year_out_dir.mkdir(parents=True, exist_ok=True)

    for file in year_folder.glob("*.csv"):
        print(f"Processing {file.name} from {year_folder.name}")
        df = pd.read_csv(file)

        # Normalize df columns
        df.columns = [normalize_name(c) for c in df.columns]

        # Audit log
        audit_rows = []
        for var in consistent_vars:
            apply_imputation(df, var, audit_rows)

        # Save imputed dataset
        out_file = year_out_dir / f"imputed_{file.stem}.csv"
        df.to_csv(out_file, index=False)

        # Save audit log
        audit_df = pd.DataFrame(audit_rows)
        audit_file = year_out_dir / f"imputation_log_{file.stem}.csv"
        audit_df.to_csv(audit_file, index=False)

        print(f"[OK] Saved {out_file} | Audit log: {audit_file}")

### Preprocessing and Imputation Pipeline

**Column normalization**

- All column names are standardized: lowercase, stripped of leading/trailing spaces, and harmonized by replacing dashes/underscores with spaces.

- Fuzzy matching ensures Decision Matrix variables align with survey file headers, reducing mismatches across survey waves.

**Missing value normalization**

- Blanks and whitespace‑only entries are converted to NaN inline before imputation.
- This guarantees that missingness is consistently recognized and that audit logs accurately reflect true counts.

**Metadata‑guided imputation logic**

- Consistent variables are always considered for imputation, even if flagged as consider_drop_or_advanced.
- Allowed value sets are retrieved dynamically from NEW Metadata Sheet 2 CSVs to validate imputation choices.

**Rules applied:**

- Numeric variables: imputed with median; clipped to metadata‑defined ranges if available.

- Binary categorical (≤3 allowed values): imputed with majority class (mode) validated against metadata.

- General categorical: imputed with mode restricted to metadata values; fallback to "Unknown" if no valid mode exists.

- Identifiers/time variables (e.g., PSU number, Survey Year): left unchanged to preserve structural integrity.

- This design ensures imputations respect official metadata and avoid arbitrary category inflation.

**Audit logging**

- Each variable logs: Action, AllowedValues, MethodApplied, BeforeMissing, AfterMissing, and explanatory Note.

-  Overrides are explicitly marked when imputation is applied to variables flagged as consider_drop_or_advanced.

- Logs provide transparency across survey years and support reproducibility for thesis defense and team review.

### Evaluation of Imputation (By Completeness)

In [None]:
import pandas as pd
from pathlib import Path

OUTPUT_ROOT = BASE_PATH / "Imputed Data for Analysis"

summary_rows = []

for year_folder in OUTPUT_ROOT.iterdir():
    if not year_folder.is_dir():
        continue

    for file in year_folder.glob("imputed_*.csv"):
        df = pd.read_csv(file, low_memory=False)
        null_counts = df.isnull().sum()
        total_missing = int(null_counts.sum())

        summary_rows.append({
            "Year": year_folder.name,
            "File": file.name,
            "TotalMissing": total_missing,
            "Completeness": "PASS" if total_missing == 0 else "FAIL",
            **null_counts.to_dict()  # expand variable-level missing counts
        })

# Build DataFrame
summary_df = pd.DataFrame(summary_rows)

# Preview file-level completeness
print(summary_df[["Year","File","TotalMissing","Completeness"]])

# Optional: Year-level summary
year_summary = summary_df.groupby("Year")["Completeness"].value_counts().unstack(fill_value=0)
print("\nYear-level completeness summary:")
print(year_summary)


### Evaluation of Imputation (By Metadata Accuracy)

In [None]:
import pandas as pd

metadata_results = []

for year_folder in OUTPUT_ROOT.iterdir():
    if not year_folder.is_dir():
        continue

    for file in year_folder.glob("imputed_*.csv"):
        df = pd.read_csv(file, low_memory=False)

        for var, allowed in metadata_dict.items():
            col_name = find_column(df, var)

            # Case 1: Variable not found in file
            if not col_name:
                metadata_results.append({
                    "Year": year_folder.name,
                    "File": file.name,
                    "Variable": var,
                    "AllowedValues": allowed,
                    "InvalidValues": [],
                    "Status": "NOT_FOUND"
                })
                continue

            # Case 2: No metadata available
            if not allowed:
                metadata_results.append({
                    "Year": year_folder.name,
                    "File": file.name,
                    "Variable": var,
                    "AllowedValues": None,
                    "InvalidValues": [],
                    "Status": "NO_METADATA"
                })
                continue

            # Case 3: Validate against metadata
            unique_vals = set(df[col_name].dropna().astype(str).str.lower())
            valid_set = set([v.lower() for v in allowed] + ["unknown"])
            invalid_vals = unique_vals - valid_set

            metadata_results.append({
                "Year": year_folder.name,
                "File": file.name,
                "Variable": var,
                "AllowedValues": allowed,
                "InvalidValues": list(invalid_vals),
                "Status": "OK" if not invalid_vals else "INVALID"
            })

# Build DataFrame
metadata_df = pd.DataFrame(metadata_results)

# Preview first 20 rows
print(metadata_df.head(20))

# Summary counts
print("\nSummary by Status:")
print(metadata_df["Status"].value_counts())

# Optional: Year-level summary
year_summary = metadata_df.groupby("Year")["Status"].value_counts().unstack(fill_value=0)
print("\nYear-level metadata accuracy summary:")
print(year_summary)
