### Imputation Rationale

**Do not impute inconsistent/partial variables by default.** Only consider imputation if the variable is conceptually indispensable and FMI suggests the information can be credibly recovered (e.g., plausible MAR with auxiliary predictors).

It’s not reasonable to impute inconsistent/partial variables without first considering FMI and context. Imputation is not a neutral operation; it encodes assumptions about the missingness mechanism, temporal comparability, and the meaning of the variable. If a variable is inconsistent across months/years, imputing it can fabricate continuity that wasn’t in the data, undermining factor analysis and comparability across regions and time.

**Tier 1 — Consistent variables:**

- Action: Eligible for imputation.
- Rule: Use FMI to determine imputation intensity (light/cautious/advanced).
- Justification: Stable measurement; imputation supports matrix completion for EFA.

**Tier 2 — Partial variables (intermittent presence or minor coding drift):**

- Action: Conditional imputation.
- Rule: Impute only if FMI is moderate/high but MAR plausibility exists via auxiliary predictors, and coding is harmonized; otherwise flag for sensitivity analysis.
- Justification: Limited comparability; treat as supporting evidence, not core FA inputs.

**Tier 3 — Inconsistent variables (structural changes, major coding breaks):**

- Action: Do not impute for FA.
- Rule: Document and retain for diagnostics; consider future harmonization projects or use in qualitative context.

- Justification: Imputation would manufacture comparability and can distort factor structure.

**Override - Conceptual indispensability:**

- Action: If a variable is central to sensitivity/resilience/exposure and lacks a close proxy, allow imputation even if partial, but only with:
- Explicit MAR argument using auxiliary variables,
- complete coding evidence, and
- Sensitivity analyses comparing included vs excluded.

**Why imputing inconsistent variables without FMI review is not defensible?**

Measurement instability:  

Inconsistent variables often arise because the survey question changed, coding shifted, or the variable wasn’t asked in some rounds. Imputing them blindly assumes the missingness is random noise, when in fact it reflects structural differences. That creates false comparability across years.
**Factor analysis assumptions:**

FA assumes each variable measures the same construct across all observations. If a variable is inconsistent, imputing values fabricates continuity that wasn’t there. This risks producing spurious factors that look “interpretable” but are actually artifacts of imputation.

**Auditability and thesis defense:**

The approved pipeline methodology emphasizes transparency and conceptual justification. If the team imputes inconsistent variables without FMI, reviewers can easily challenge: “Why did you treat structurally missing data as if it were random?”

### Documentation and audit trail

Action matrix: For each variable, store:

- Tag: consistent/partial/inconsistent.
- FMI bucket: Low/Moderate/High/Critical.
- Dimension role: sensitivity/resilience/exposure.
- Decision: keep, impute (light/cautious/advanced), sensitivity-only, exclude from FA.
- Rationale: conceptual indispensability, MAR plausibility, harmonization status, auxiliary predictors.
- Sensitivity analysis flags: Flag variables where inclusion materially changes factor loadings or KMO/Bartlett results, so the team can revisit.

In [1]:
# 09_Imputation Notebook — Decision Matrix Builder
# ------------------------------------------------

import json
from pathlib import Path
import os
import pandas as pd
import numpy as np
from datetime import datetime

# --- Load config ---
with open(Path("./data/interim/config.json")) as f:
    cfg = json.load(f)

BASE_PATH = Path(cfg["BASE_PATH"])
INTERIM_DIR = Path(cfg["INTERIM_DIR"])
PROCESSED_DIR = Path(cfg["PROCESSED_DIR"])
LOG_DIR = Path(cfg["LOG_DIR"])
MONTH_ORDER = cfg["MONTH_ORDER"]

# --- Load inventory (optional, for parity) ---
with open(Path(INTERIM_DIR) / "inventory.json") as f:
    inventory = json.load(f)

# --- Paths ---
RENAMED_ROOT = BASE_PATH / "NEW Renamed Fully Decoded Surveys"
CONSISTENCY_ROOT = BASE_PATH / "NEW Variable Consistency Check"
FMI_ROOT = BASE_PATH / "NEW FMI Reports"
DECISION_ROOT = BASE_PATH / "Decision Matrix for Imputation"
os.makedirs(DECISION_ROOT, exist_ok=True)

# --- Load inputs ---
consistency_df = pd.read_csv(CONSISTENCY_ROOT / "consistency_profile.csv")
fmi_df = pd.read_csv(FMI_ROOT / "fmi_profile.csv")

# --- Merge consistency + FMI ---
decision_df = fmi_df.merge(
    consistency_df[["Variable", "ConsistencyTag"]],
    on="Variable",
    how="left"
)

# --- Handle duplicate ConsistencyTag columns if present ---
if "ConsistencyTag_x" in decision_df.columns and "ConsistencyTag_y" in decision_df.columns:
    decision_df["ConsistencyTag"] = decision_df["ConsistencyTag_x"].combine_first(decision_df["ConsistencyTag_y"])
    decision_df.drop(columns=["ConsistencyTag_x", "ConsistencyTag_y"], inplace=True)

# --- Manual factor formation dictionary (customizable) ---
dimension_map = {
    # Sensitivity
    "Available for Work": "Sensitivity",
    "C13-Major Occupation Group": "Sensitivity",
    "C14-Primary Occupation": "Sensitivity",
    "C15-Major Industry Group": "Sensitivity",
    "C16-Kind of Business (Primary Occupation)": "Sensitivity",
    "C24-Basis of Payment (Primary Occupation)": "Sensitivity",
    "C25-Basic Pay per Day (Primary Occupation)": "Sensitivity",
    "Class of Worker (Primary Occupation)": "Sensitivity",
    "Nature of Employment (Primary Occupation)": "Sensitivity",
    "Total Hours Worked for all Jobs": "Sensitivity",
    "Work Arrangement": "Sensitivity",
    "Work Indicator": "Sensitivity",
    # Resilience
    "C03-Relationship to Household Head": "Resilience",
    "C04-Sex": "Resilience",
    "C05-Age as of Last Birthday": "Resilience",
    "C06-Marital Status": "Resilience",
    "C07-Highest Grade Completed": "Resilience",
    "C08-Currently Attending School": "Resilience",
    "C09-Graduate of technical/vocational course": "Resilience",
    "C09a - Currently Attending Non-formal Training for Skills Development": "Resilience",
    "Household Size": "Resilience",
    # Exposure
    "Province": "Exposure",
    "Province Recode": "Exposure",
    "Region": "Exposure",
    "Urban-RuralFIES": "Exposure",
    "Location of Work (Province, Municipality)": "Exposure",
    "Survey Month": "Exposure",
    "Survey Year": "Exposure",
}

# --- Dimension assignment function ---
def assign_dimension(var):
    if var in dimension_map:
        return dimension_map[var]
    v = var.lower()
    if any(k in v for k in ["occupation", "work", "employment", "job", "hours", "basis", "industry"]):
        return "Sensitivity"
    elif any(k in v for k in ["grade", "school", "household", "age", "marital", "ethnicity", "training"]):
        return "Resilience"
    elif any(k in v for k in ["region", "province", "urban", "survey", "weight", "psu", "replicate"]):
        return "Exposure"
    else:
        return "Unclassified"

decision_df["Dimension"] = decision_df["Variable"].apply(assign_dimension)

# --- SuggestedAction logic ---
def suggest_action(row):
    fmi = row["OverallFMI"]
    tag = row["ConsistencyTag"]

    if pd.isna(fmi):
        return "review"
    if tag == "consistent":
        if fmi < 0.05: return "keep"
        elif fmi < 0.20: return "impute_light"
        elif fmi < 0.40: return "impute_cautious"
        else: return "consider_drop_or_advanced"
    elif tag == "partial":
        if fmi < 0.20: return "sensitivity_only"
        else: return "exclude_from_FA"
    else:  # inconsistent
        return "exclude_from_FA"

decision_df["Action"] = decision_df.apply(suggest_action, axis=1)

# --- Reorder columns for clarity ---
decision_df = decision_df[[
    "Variable", "ConsistencyTag", "OverallFMI", "Flag",
    "Dimension", "Action", 
]]

# --- Save template ---
out_file = DECISION_ROOT / "Decision_Matrix.csv"
decision_df.to_csv(out_file, index=False)
print(f"[OK] Decision matrix template saved to {out_file}")


[OK] Decision matrix template saved to G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Decision Matrix for Imputation\Decision_Matrix.csv


In [2]:
decision_df.head(10)

Unnamed: 0,Variable,ConsistencyTag,OverallFMI,Flag,Dimension,Action
0,Available for Work,consistent,0.96537,Critical,Sensitivity,consider_drop_or_advanced
1,C03-Relationship to Household Head,consistent,0.0,Low,Resilience,keep
2,C04-Sex,consistent,0.0,Low,Resilience,keep
3,C05-Age as of Last Birthday,consistent,0.016952,Low,Resilience,keep
4,C05B - Ethnicity,inconsistent,0.0,Low,Resilience,exclude_from_FA
5,C06-Marital Status,consistent,0.073508,Moderate,Resilience,impute_light
6,C07-Highest Grade Completed,consistent,0.074142,Moderate,Resilience,impute_light
7,C08-Currently Attending School,inconsistent,0.5543,Critical,Resilience,exclude_from_FA
8,C09-Graduate of technical/vocational course,inconsistent,0.282487,High,Resilience,exclude_from_FA
9,C09a - Currently Attending Non-formal Training...,inconsistent,0.279905,High,Resilience,exclude_from_FA


#### CRUCIAL NOTES (README)

-  Not sure with the difference between `work indicator and work indicator.1.` Kindly see Decision_Matrix sheets for granular details.
-  Also Check `Province and Province Recode` for missing values. Not sure what kind of imputation is applicable for this one since (assuming manual imputation, since lists of provinces can be acquired online and shall serve as a guide for encoding.). But we can still automate  this given that we have a strict list of dictionary once its acquired from online. IMPROPER IMPUTATION will done at this test stage.

### Decision Matrix for Imputation - Defense

This matrix is the bridge between FMI diagnostics and factor analysis.  
It ensures that **every variable** is evaluated not only by its missingness (FMI) and consistency, but also by its **conceptual role** in financial vulnerability.

- **Sensitivity**: Variables tied to employment stability, income regularity, and sectoral risk.  
- **Resilience**: Variables reflecting household capacity, education, skills, and adaptability.  
- **Exposure**: Variables representing structural or locational factors (region, province, urban/rural).

#### Why automate?
Manual factor formation was encoded into a reproducible dictionary and keyword rules.  
This ensures consistency across runs, while still allowing customization:
- The `dimension_map` dictionary can be edited to refine assignments.  
- Keyword rules act as a fallback for variables not explicitly mapped.  
- Any variable left as `"Unclassified"` is flagged for manual review.

#### Why this is defensible?
- **Theory-guided**: Dimensions are based on the approved thesis framework.  
- **Transparent**: Every variable is listed, no silent exclusions.  
- **Customizable**: Teammates can refine the dictionary or rationale column later.  
- **Audit-ready**: The matrix documents not just FMI and consistency, but also conceptual relevance.

This way, imputation decisions are **informed from the start**, but remain flexible for recalibration.


### Imputation Proper

At this stage, basic imputation will be done to the missing values following the mentioned criterias above. This notebook is customizable according to the further rules that will further be applied to the analysis. For further context, kindly read the CRUCIAL NOTES (README) section in this notebook outline.

In [3]:
from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np
from pathlib import Path
from difflib import get_close_matches

# --- Paths ---
INPUT_ROOT = BASE_PATH / "NEW Renamed Fully Decoded Surveys"
CONSISTENCY_ROOT = BASE_PATH / "NEW Variable Consistency Check"
FMI_ROOT = BASE_PATH / "NEW FMI Reports"
OUTPUT_ROOT = BASE_PATH / "Imputed Data for Analysis"
OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)

# --- Load consistency + FMI profiles ---
consistency_df = pd.read_csv(CONSISTENCY_ROOT / "consistency_profile.csv")
fmi_df = pd.read_csv(FMI_ROOT / "fmi_profile.csv")

decision_df = fmi_df.merge(
    consistency_df[["Variable", "ConsistencyTag"]],
    on="Variable", how="left"
)

# Deduplicate merge artifacts
if "ConsistencyTag_x" in decision_df.columns and "ConsistencyTag_y" in decision_df.columns:
    decision_df["ConsistencyTag"] = decision_df["ConsistencyTag_x"].combine_first(decision_df["ConsistencyTag_y"])
    decision_df.drop(columns=["ConsistencyTag_x", "ConsistencyTag_y"], inplace=True)

# --- SuggestedAction logic ---
def suggest_action(row):
    fmi = row["OverallFMI"]
    tag = row["ConsistencyTag"]
    if pd.isna(fmi): return "review"
    if tag == "consistent":
        if fmi < 0.05: return "keep"
        elif fmi < 0.20: return "impute_light"
        elif fmi < 0.40: return "impute_cautious"
        else: return "consider_drop_or_advanced"
    elif tag == "partial":
        if fmi < 0.20: return "sensitivity_only"
        else: return "exclude_from_FA"
    else:
        return "exclude_from_FA"

decision_df["Action"] = decision_df.apply(suggest_action, axis=1)

# --- Normalize names ---
def normalize_name(name: str) -> str:
    return (
        str(name)
        .strip()
        .lower()
        .replace("\xa0", " ")
        .replace("-", " ")
        .replace("_", " ")
    )

decision_df["Variable_norm"] = decision_df["Variable"].apply(normalize_name)

# --- Flexible finder with fuzzy matching ---
def find_column(df, var):
    cols_norm = {normalize_name(c): c for c in df.columns}
    var_norm = normalize_name(var)

    # exact match
    if var_norm in cols_norm:
        return cols_norm[var_norm]

    # fuzzy match
    matches = get_close_matches(var_norm, list(cols_norm.keys()), n=1, cutoff=0.8)
    if matches:
        return cols_norm[matches[0]]

    return None

# --- Helpers ---
def robust_mode(series: pd.Series):
    m = series.mode(dropna=True)
    return None if m.empty else m.iloc[0]

def clean_age_column(col: pd.Series) -> pd.Series:
    s = col.astype(str)
    s = s.where(~s.str.contains(r"\d{4}-\d{2}-\d{2}", regex=True), "UnknownAge")
    numeric_coerced = pd.to_numeric(s, errors="coerce")
    if numeric_coerced.notna().sum() >= (0.5 * len(s)):
        return numeric_coerced.fillna(-1).astype(int)
    else:
        s = s.replace({"nan": "UnknownAge"})
        return s

def apply_imputation(df: pd.DataFrame, var: str, action: str, audit_rows: list):
    col_name = find_column(df, var)
    if col_name is None:
        audit_rows.append({
            "Variable": var,
            "Action": action,
            "MethodApplied": "not_matched",
            "BeforeMissing": None,
            "AfterMissing": None,
            "Note": "Variable not matched to any column (check naming)."
        })
        return

    before_missing = int(df[col_name].isna().sum())
    dtype_numeric = pd.api.types.is_numeric_dtype(df[col_name])

    if normalize_name(var) == normalize_name("C05-Age as of Last Birthday"):
        df[col_name] = clean_age_column(df[col_name])
        dtype_numeric = pd.api.types.is_numeric_dtype(df[col_name])

    method, note = "none", "No missing data observed; no imputation required."
    after_missing = before_missing

    if action == "keep":
        method = "none"
        note = "Left as-is per Decision Matrix."
    elif action == "impute_light":
        if dtype_numeric:
            med = df[col_name].median()
            df[col_name].fillna(med, inplace=True)
            method = "median"
            note = f"Numeric light imputation with median={med:.4f}."
        else:
            mode_val = robust_mode(df[col_name])
            if mode_val is not None:
                df[col_name].fillna(mode_val, inplace=True)
                method = "mode"
                note = f"Categorical light imputation with mode='{mode_val}'."
            else:
                df[col_name].fillna("Unknown", inplace=True)
                method = "unknown_fallback"
                note = "No valid mode; filled with 'Unknown'."
        after_missing = int(df[col_name].isna().sum())
    elif action == "impute_cautious":
        if dtype_numeric:
            numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
            if len(numeric_cols) >= 2:
                imputer = KNNImputer(n_neighbors=5)
                imputed_numeric = pd.DataFrame(
                    imputer.fit_transform(df[numeric_cols]),
                    columns=numeric_cols, index=df.index
                )
                df[col_name] = imputed_numeric[col_name]
                method = "knn_k5"
                note = "Numeric cautious imputation (KNN, k=5)."
            else:
                med = df[col_name].median()
                df[col_name].fillna(med, inplace=True)
                method = "median_fallback"
                note = "Insufficient predictors; median fallback."
        else:
            mode_val = robust_mode(df[col_name])
            if mode_val is not None:
                df[col_name].fillna(mode_val, inplace=True)
                method = "mode_cautious"
                note = f"Categorical cautious imputation with mode='{mode_val}'."
            else:
                df[col_name].fillna("Unknown", inplace=True)
                method = "unknown_fallback"
                note = "No valid mode; filled with 'Unknown'."
        after_missing = int(df[col_name].isna().sum())

    audit_rows.append({
        "Variable": var,
        "Action": action,
        "MethodApplied": method,
        "BeforeMissing": before_missing,
        "AfterMissing": after_missing,
        "Note": note
    })

# --- Year-by-year execution ---
for year_folder in INPUT_ROOT.iterdir():
    if not year_folder.is_dir():
        continue

    year_out_dir = OUTPUT_ROOT / year_folder.name
    year_out_dir.mkdir(parents=True, exist_ok=True)

    for file in year_folder.glob("*.csv"):
        print(f"Processing {file.name} from {year_folder.name}")
        df = pd.read_csv(file)

        # Normalize df columns
        df.columns = [normalize_name(c) for c in df.columns]

        # Audit log
        audit_rows = []
        for _, r in decision_df[decision_df["ConsistencyTag"] == "consistent"].iterrows():
            apply_imputation(df, r["Variable"], r["Action"], audit_rows)

        # Save imputed dataset
        out_file = year_out_dir / f"imputed_{file.stem}.csv"
        df.to_csv(out_file, index=False)

        # Save audit log
        audit_df = pd.DataFrame(audit_rows)
        audit_file = year_out_dir / f"imputation_log_{file.stem}.csv"
        audit_df.to_csv(audit_file, index=False)

        print(f"[OK] Saved {out_file} | Audit log: {audit_file}")


Processing APRIL_2018.CSV from 2018
[OK] Saved G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2018\imputed_APRIL_2018.csv | Audit log: G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2018\imputation_log_APRIL_2018.csv
Processing JULY_2018.CSV from 2018
[OK] Saved G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2018\imputed_JULY_2018.csv | Audit log: G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2018\imputation_log_JULY_2018.csv
Processing JANUARY_2018.CSV from 2018
[OK] Saved G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2018\imputed_JANUARY_2018.csv | Audit log: G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2018\imputation_log_JANUARY_2

  df = pd.read_csv(file)


[OK] Saved G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2022\imputed_JULY_2022.csv | Audit log: G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2022\imputation_log_JULY_2022.csv
Processing AUGUST_2022.CSV from 2022
[OK] Saved G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2022\imputed_AUGUST_2022.csv | Audit log: G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2022\imputation_log_AUGUST_2022.csv
Processing DECEMBER_2022.CSV from 2022
[OK] Saved G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2022\imputed_DECEMBER_2022.csv | Audit log: G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2022\imputation_log_DECEMBER_2022.csv
Processing NOVEMBER_2

  df = pd.read_csv(file)


[OK] Saved G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2022\imputed_NOVEMBER_2022.csv | Audit log: G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2022\imputation_log_NOVEMBER_2022.csv
Processing OCTOBER_2022.CSV from 2022
[OK] Saved G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2022\imputed_OCTOBER_2022.csv | Audit log: G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2022\imputation_log_OCTOBER_2022.csv
Processing SEPTEMBER_2022.CSV from 2022
[OK] Saved G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2022\imputed_SEPTEMBER_2022.csv | Audit log: G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Imputed Data for Analysis\2022\imputation_log_SEPTEMBER_2022.csv
Process

# Imputation Pipeline Documentation

### Overview
This pipeline processes survey data year-by-year from *NEW Renamed Fully Decoded Surveys* and produces imputed datasets in *Imputed Data for Analysis*. Each year has its own subfolder with:
- `imputed_<monthyear>.csv`: the cleaned dataset
- `imputation_log_<monthyear>.csv`: detailed audit of imputation actions

### Scope
- Only variables tagged **consistent** in the Decision Matrix are imputed **FOR NOW.**
- Variables tagged **inconsistent** or **consider_drop_or_advanced** are excluded.
- Province and Province Recode are excluded pending dictionary-based encoding.
- Partial variables are excluded unless explicitly toggled.

### Imputation Rules
- **Numeric, low FMI (<0.20)** → Median imputation
- **Numeric, moderate FMI (0.20–0.40)** → KNN imputation (k=5), median fallback if insufficient predictors
- **Categorical, low FMI (<0.20)** → Mode imputation, fallback "Unknown"