### Imputation Rationale

**Do not impute inconsistent/partial variables by default.** Only consider imputation if the variable is conceptually indispensable and FMI suggests the information can be credibly recovered (e.g., plausible MAR with auxiliary predictors).

It’s not reasonable to impute inconsistent/partial variables without first considering FMI and context. Imputation is not a neutral operation; it encodes assumptions about the missingness mechanism, temporal comparability, and the meaning of the variable. If a variable is inconsistent across months/years, imputing it can fabricate continuity that wasn’t in the data, undermining factor analysis and comparability across regions and time.

**Tier 1 — Consistent variables:**

- Action: Eligible for imputation.
- Rule: Use FMI to determine imputation intensity (light/cautious/advanced).
- Justification: Stable measurement; imputation supports matrix completion for EFA.

**Tier 2 — Partial variables (intermittent presence or minor coding drift):**

- Action: Conditional imputation.
- Rule: Impute only if FMI is moderate/high but MAR plausibility exists via auxiliary predictors, and coding is harmonized; otherwise flag for sensitivity analysis.
- Justification: Limited comparability; treat as supporting evidence, not core FA inputs.

**Tier 3 — Inconsistent variables (structural changes, major coding breaks):**

- Action: Do not impute for FA.
- Rule: Document and retain for diagnostics; consider future harmonization projects or use in qualitative context.

- Justification: Imputation would manufacture comparability and can distort factor structure.

**Override - Conceptual indispensability:**

- Action: If a variable is central to sensitivity/resilience/exposure and lacks a close proxy, allow imputation even if partial, but only with:
- Explicit MAR argument using auxiliary variables,
- complete coding evidence, and
- Sensitivity analyses comparing included vs excluded.

**Why imputing inconsistent variables without FMI review is not defensible?**

Measurement instability:  

Inconsistent variables often arise because the survey question changed, coding shifted, or the variable wasn’t asked in some rounds. Imputing them blindly assumes the missingness is random noise, when in fact it reflects structural differences. That creates false comparability across years.
**Factor analysis assumptions:**

FA assumes each variable measures the same construct across all observations. If a variable is inconsistent, imputing values fabricates continuity that wasn’t there. This risks producing spurious factors that look “interpretable” but are actually artifacts of imputation.

**Auditability and thesis defense:**

The approved pipeline methodology emphasizes transparency and conceptual justification. If the team imputes inconsistent variables without FMI, reviewers can easily challenge: “Why did you treat structurally missing data as if it were random?”

### Documentation and audit trail

Action matrix: For each variable, store:

- Tag: consistent/partial/inconsistent.
- FMI bucket: Low/Moderate/High/Critical.
- Dimension role: sensitivity/resilience/exposure.
- Decision: keep, impute (light/cautious/advanced), sensitivity-only, exclude from FA.
- Rationale: conceptual indispensability, MAR plausibility, harmonization status, auxiliary predictors.
- Sensitivity analysis flags: Flag variables where inclusion materially changes factor loadings or KMO/Bartlett results, so the team can revisit.

In [3]:
# 09_Imputation Notebook — Decision Matrix Builder
# ------------------------------------------------

import json
from pathlib import Path
import os
import pandas as pd
import numpy as np
from datetime import datetime

# --- Load config ---
with open(Path("./data/interim/config.json")) as f:
    cfg = json.load(f)

BASE_PATH = Path(cfg["BASE_PATH"])
INTERIM_DIR = Path(cfg["INTERIM_DIR"])
PROCESSED_DIR = Path(cfg["PROCESSED_DIR"])
LOG_DIR = Path(cfg["LOG_DIR"])
MONTH_ORDER = cfg["MONTH_ORDER"]

# --- Load inventory (optional, for parity) ---
with open(Path(INTERIM_DIR) / "inventory.json") as f:
    inventory = json.load(f)

# --- Paths ---
RENAMED_ROOT = BASE_PATH / "NEW Renamed Fully Decoded Surveys"
CONSISTENCY_ROOT = BASE_PATH / "NEW Variable Consistency Check"
FMI_ROOT = BASE_PATH / "NEW FMI Reports"
DECISION_ROOT = BASE_PATH / "Decision Matrix for Imputation"
os.makedirs(DECISION_ROOT, exist_ok=True)

# --- Load inputs ---
consistency_df = pd.read_csv(CONSISTENCY_ROOT / "consistency_profile.csv")
fmi_df = pd.read_csv(FMI_ROOT / "fmi_profile.csv")

# --- Merge consistency + FMI ---
decision_df = fmi_df.merge(
    consistency_df[["Variable", "ConsistencyTag"]],
    on="Variable",
    how="left"
)

# --- Handle duplicate ConsistencyTag columns if present ---
if "ConsistencyTag_x" in decision_df.columns and "ConsistencyTag_y" in decision_df.columns:
    decision_df["ConsistencyTag"] = decision_df["ConsistencyTag_x"].combine_first(decision_df["ConsistencyTag_y"])
    decision_df.drop(columns=["ConsistencyTag_x", "ConsistencyTag_y"], inplace=True)

# --- Manual factor formation dictionary (customizable) ---
dimension_map = {
    # Sensitivity
    "Available for Work": "Sensitivity",
    "C13-Major Occupation Group": "Sensitivity",
    "C14-Primary Occupation": "Sensitivity",
    "C15-Major Industry Group": "Sensitivity",
    "C16-Kind of Business (Primary Occupation)": "Sensitivity",
    "C24-Basis of Payment (Primary Occupation)": "Sensitivity",
    "C25-Basic Pay per Day (Primary Occupation)": "Sensitivity",
    "Class of Worker (Primary Occupation)": "Sensitivity",
    "Nature of Employment (Primary Occupation)": "Sensitivity",
    "Total Hours Worked for all Jobs": "Sensitivity",
    "Work Arrangement": "Sensitivity",
    "Work Indicator": "Sensitivity",
    # Resilience
    "C03-Relationship to Household Head": "Resilience",
    "C04-Sex": "Resilience",
    "C05-Age as of Last Birthday": "Resilience",
    "C06-Marital Status": "Resilience",
    "C07-Highest Grade Completed": "Resilience",
    "C08-Currently Attending School": "Resilience",
    "C09-Graduate of technical/vocational course": "Resilience",
    "C09a - Currently Attending Non-formal Training for Skills Development": "Resilience",
    "Household Size": "Resilience",
    # Exposure
    "Province": "Exposure",
    "Province Recode": "Exposure",
    "Region": "Exposure",
    "Urban-RuralFIES": "Exposure",
    "Location of Work (Province, Municipality)": "Exposure",
    "Survey Month": "Exposure",
    "Survey Year": "Exposure",
}

# --- Dimension assignment function ---
def assign_dimension(var):
    if var in dimension_map:
        return dimension_map[var]
    v = var.lower()
    if any(k in v for k in ["occupation", "work", "employment", "job", "hours", "basis", "industry"]):
        return "Sensitivity"
    elif any(k in v for k in ["grade", "school", "household", "age", "marital", "ethnicity", "training"]):
        return "Resilience"
    elif any(k in v for k in ["region", "province", "urban", "survey", "weight", "psu", "replicate"]):
        return "Exposure"
    else:
        return "Unclassified"

decision_df["Dimension"] = decision_df["Variable"].apply(assign_dimension)

# --- SuggestedAction logic ---
def suggest_action(row):
    fmi = row["OverallFMI"]
    tag = row["ConsistencyTag"]

    if pd.isna(fmi):
        return "review"
    if tag == "consistent":
        if fmi < 0.05: return "keep"
        elif fmi < 0.20: return "impute_light"
        elif fmi < 0.40: return "impute_cautious"
        else: return "consider_drop_or_advanced"
    elif tag == "partial":
        if fmi < 0.20: return "sensitivity_only"
        else: return "exclude_from_FA"
    else:  # inconsistent
        return "exclude_from_FA"

decision_df["Action"] = decision_df.apply(suggest_action, axis=1)

# --- Reorder columns for clarity ---
decision_df = decision_df[[
    "Variable", "ConsistencyTag", "OverallFMI", "Flag",
    "Dimension", "Action", 
]]

# --- Save template ---
out_file = DECISION_ROOT / "Decision_Matrix.csv"
decision_df.to_csv(out_file, index=False)
print(f"[OK] Decision matrix template saved to {out_file}")


[OK] Decision matrix template saved to G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Decision Matrix for Imputation\Decision_Matrix.csv


In [5]:
decision_df.head(10)

Unnamed: 0,Variable,ConsistencyTag,OverallFMI,Flag,Dimension,Action
0,Available for Work,consistent,0.96537,Critical,Sensitivity,consider_drop_or_advanced
1,C03-Relationship to Household Head,consistent,0.0,Low,Resilience,keep
2,C04-Sex,consistent,0.0,Low,Resilience,keep
3,C05-Age as of Last Birthday,consistent,0.016952,Low,Resilience,keep
4,C05B - Ethnicity,inconsistent,0.0,Low,Resilience,exclude_from_FA
5,C06-Marital Status,consistent,0.073508,Moderate,Resilience,impute_light
6,C07-Highest Grade Completed,consistent,0.074142,Moderate,Resilience,impute_light
7,C08-Currently Attending School,inconsistent,0.5543,Critical,Resilience,exclude_from_FA
8,C09-Graduate of technical/vocational course,inconsistent,0.282487,High,Resilience,exclude_from_FA
9,C09a - Currently Attending Non-formal Training...,inconsistent,0.279905,High,Resilience,exclude_from_FA


### Decision Matrix for Imputation - Defense

This matrix is the bridge between FMI diagnostics and factor analysis.  
It ensures that **every variable** is evaluated not only by its missingness (FMI) and consistency, but also by its **conceptual role** in financial vulnerability.

- **Sensitivity**: Variables tied to employment stability, income regularity, and sectoral risk.  
- **Resilience**: Variables reflecting household capacity, education, skills, and adaptability.  
- **Exposure**: Variables representing structural or locational factors (region, province, urban/rural).

#### Why automate?
Manual factor formation was encoded into a reproducible dictionary and keyword rules.  
This ensures consistency across runs, while still allowing customization:
- The `dimension_map` dictionary can be edited to refine assignments.  
- Keyword rules act as a fallback for variables not explicitly mapped.  
- Any variable left as `"Unclassified"` is flagged for manual review.

#### Why this is defensible?
- **Theory-guided**: Dimensions are based on the approved thesis framework.  
- **Transparent**: Every variable is listed, no silent exclusions.  
- **Customizable**: Teammates can refine the dictionary or rationale column later.  
- **Audit-ready**: The matrix documents not just FMI and consistency, but also conceptual relevance.

This way, imputation decisions are **informed from the start**, but remain flexible for recalibration.
