### Fraction of Missing Information (FMI) Analysis

This notebook computes the **Fraction of Missing Information (FMI)** for all renamed and validated survey files. It quantifies missingness at the variable level across survey months and provides guidance for imputation and variable selection.

The workflow builds on the outputs of previous preprocessing stages, particularly the variable consistency and harmonization audit, and serves as the final diagnostic step before imputation.

---

#### Overview of the Workflow

The FMI process is structured into four main stages:

1. **Configuration and Data Loading**  
   Paths and parameters are dynamically loaded from `config.json`, ensuring reproducibility and eliminating hardcoded directories. Survey files are accessed using the inventory and consistency outputs from earlier notebooks.

2. **Missingness Detection and FMI Computation**  
   A unified missingness detector identifies absent values using:
   - Standard nulls and empty strings  
   - Text-based missing markers (e.g., "NA", "-", "N/A")   

   For each variable and survey month, FMI is computed and classified into interpretive levels (Low, Moderate, High, Critical), with corresponding analytical recommendations.

3. **Consolidation and Cross-Month Aggregation**  
   Monthly FMI reports are combined into a consolidated profile that summarizes missingness patterns across all survey waves. Variables are aligned with their consistency tags to contextualize missingness within structural stability.

4. **Integration with Stability and Harmonization Decisions**  
   FMI results are merged with the structural stability report from Notebook 07. Only variables that passed the consistency and harmonization audit are retained in the final FMI profile, ensuring that missingness analysis is performed on conceptually stable and comparable variables.

---

#### Inputs

- `config.json` – centralized path and parameter configuration  
- Renamed survey files (`NEW Renamed Fully Decoded Surveys`)  
- `consistency_profile.csv` – initial variable presence/consistency tags  
- `structural_stability_report.csv` – final retain/drop decisions and harmonizations from Notebook 07  

---

#### Outputs

- **Monthly FMI reports**  

- **Consolidated FMI profiles**  
  - `fmi_profile_all.csv` – FMI statistics for all variables  
  - `fmi_profile_consistent.csv` – FMI statistics for retained, harmonized, and consistent variables (temporally and value consistent) 

- **Log file**  
  Detailed execution log stored in `LOG_DIR`

---

#### Key Contributions of This Notebook

- Automatically computes FMI without using fixed (hardcoded) file paths.  
- Detects missing values using both text-based and numeric rules.  
- Aligns variables so they are comparable across all survey months.  
- Combines the results of variable consistency and harmonization checks with the FMI analysis.  
- Distinguishes between initial FMI exploration and the final set of variables used for analysis.

This notebook ensures that missingness is analyzed only on variables that are consistent and properly harmonized across survey waves, making the results reliable for imputation and further analysis.

In [1]:
import json
from pathlib import Path
import os
import pandas as pd
import numpy as np
import shutil
from datetime import datetime

# --- Load config ---
with open(Path("./data/interim/config.json")) as f:
    cfg = json.load(f)

BASE_PATH = Path(cfg["BASE_PATH"])
INTERIM_DIR = Path(cfg["INTERIM_DIR"])
PROCESSED_DIR = Path(cfg["PROCESSED_DIR"])
LOG_DIR = Path(cfg["LOG_DIR"])
MONTH_ORDER = cfg["MONTH_ORDER"]

# --- Paths ---
RENAMED_ROOT = BASE_PATH / "NEW Renamed Fully Decoded Surveys"
CONSISTENCY_ROOT = BASE_PATH / "NEW Variable Consistency Check"
FMI_MONTHLY_ROOT = BASE_PATH / "NEW FMI Reports"

# --- IDEMPOTENT CLEANUP LOGIC ---
if FMI_MONTHLY_ROOT.exists():
    print(f" Cleaning existing FMI folder to prevent duplicates: {FMI_MONTHLY_ROOT}")
    shutil.rmtree(FMI_MONTHLY_ROOT)

os.makedirs(FMI_MONTHLY_ROOT, exist_ok=True)
print(f" Fresh folder created: {FMI_MONTHLY_ROOT}")

# --- Load consistency profile ---
consistency_df = pd.read_csv(CONSISTENCY_ROOT / "consistency_profile.csv")
consistency_tags = dict(zip(consistency_df["Variable"], consistency_df["ConsistencyTag"]))

# --- Missingness detector (REFINED: TEXT & BLANKS ONLY) ---
# Numeric sentinels (9, 99) are removed to treat all digits as valid data.
TEXT_MISSING = {"", " ", "NA", "N/A", "NaN", "nan", ".", "-", "_"}

def build_missing_mask(series: pd.Series) -> pd.Series:
    """
    Identifies missingness based ONLY on blanks and explicit text markers.
    Trusts all numeric values as actual data.
    """
    s = series.astype(str).str.strip()
    # A cell is missing if it is a true NaN, an empty string, or in our text list
    mask = series.isna() | (s == "") | s.isin(TEXT_MISSING)
    return mask

 Cleaning existing FMI folder to prevent duplicates: H:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\NEW FMI Reports
 Fresh folder created: H:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\NEW FMI Reports


In [2]:
def fmi_scan_csv(file_path: Path, year: str, month: str) -> pd.DataFrame:
    df = pd.read_csv(file_path, low_memory=False)
    rows = []
    for col in df.columns:
        # Call the simplified mask builder
        miss_mask = build_missing_mask(df[col])
        
        missing = int(miss_mask.sum())
        total = int(len(df[col]))
        fmi = (missing / total) if total > 0 else 0.0

        if fmi < 0.05:
            flag, rec = "Low", "Keep"
        elif fmi < 0.20:
            flag, rec = "Moderate", "Consider imputation"
        elif fmi < 0.40:
            flag, rec = "High", "Strongly consider imputation"
        else:
            flag, rec = "Critical", "Candidate to drop (validate with business logic)"

        rows.append({
            "Year": year,
            "Month": month,
            "Variable": col.strip(),
            "Missing": missing,
            "Total": total,
            "FMI": round(fmi, 6),
            "Flag": flag,
            "Recommendation": rec,
            "ConsistencyTag": consistency_tags.get(col.strip(), "unknown"),
        })
    return pd.DataFrame(rows)

# --- Batch runner ---
log_file = LOG_DIR / f"fmi_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
success_count, error_count = 0, 0
all_reports = []

with open(log_file, "w", encoding="utf-8") as log:
    log.write("STARTING FMI REPORT\n")
    log.write(f"Source: {RENAMED_ROOT}\n")
    log.write(f"Dest:   {FMI_MONTHLY_ROOT}\n")
    log.write("===============================================\n\n")

    for year in sorted(os.listdir(RENAMED_ROOT)):
        year_folder = RENAMED_ROOT / year
        if not year_folder.is_dir():
            continue

        year_out = FMI_MONTHLY_ROOT / year
        os.makedirs(year_out, exist_ok=True)

        for file in os.listdir(year_folder):
            if not file.lower().endswith(".csv"):
                continue

            month = file.split("_")[0].capitalize()
            file_path = year_folder / file

            try:
                report = fmi_scan_csv(file_path, year, month)
                out_file = year_out / f"FMI_{month}_{year}.csv"
                report.to_csv(out_file, index=False)
                all_reports.append(report)

                success_count += 1
                msg = f"[OK] {file} → {len(report)} variables"
                print(msg); log.write(msg + "\n")
            except Exception as e:
                error_count += 1
                msg = f"[ERROR] {file} → {e}"
                print(msg); log.write(msg + "\n")

    summary_msg = f"COMPLETED. Success: {success_count} | Errors: {error_count}"
    print(summary_msg); log.write("\n" + summary_msg + "\n")

[OK] APRIL_2018.CSV → 50 variables
[OK] JANUARY_2018.CSV → 50 variables
[OK] JULY_2018.CSV → 51 variables
[OK] OCTOBER_2018.CSV → 51 variables
[OK] JULY_2019.CSV → 49 variables
[OK] APRIL_2019.CSV → 49 variables
[OK] JANUARY_2019.CSV → 49 variables
[OK] OCTOBER_2019.CSV → 49 variables
[OK] NOVEMBER_2022.CSV → 42 variables
[OK] FEBRUARY_2022.csv → 41 variables
[OK] MARCH_2022.csv → 41 variables
[OK] MAY_2022.csv → 42 variables
[OK] JANUARY_2022.csv → 52 variables
[OK] APRIL_2022.csv → 52 variables
[OK] OCTOBER_2022.CSV → 52 variables
[OK] AUGUST_2022.CSV → 42 variables
[OK] DECEMBER_2022.CSV → 42 variables
[OK] SEPTEMBER_2022.CSV → 42 variables
[OK] JUNE_2022.csv → 42 variables
[OK] JULY_2022.CSV → 52 variables
[OK] JUNE_2023.CSV → 42 variables
[OK] NOVEMBER_2023.CSV → 41 variables
[OK] MARCH_2023.CSV → 42 variables
[OK] AUGUST_2023.CSV → 41 variables
[OK] FEBRUARY_2023.CSV → 42 variables
[OK] DECEMBER_2023.CSV → 41 variables
[OK] JULY_2023.CSV → 52 variables
[OK] JANUARY_2023.CSV → 52 

In [3]:
# --- Consolidate in-memory reports ---
if all_reports:
    combined = pd.concat(all_reports, ignore_index=True)

    FMI_summary = (
        combined.groupby(["Variable", "ConsistencyTag"])
        .agg(
            TotalMissing=("Missing", "sum"),
            TotalRows=("Total", "sum"),
            AvgFMI=("FMI", "mean"),
            MonthsObserved=("Year", "count"),
        )
        .reset_index()
    )

    FMI_summary["OverallFMI"] = FMI_summary["TotalMissing"] / FMI_summary["TotalRows"]

    def flag_and_rec(fmi):
        if fmi < 0.05: return "Low", "Keep"
        elif fmi < 0.20: return "Moderate", "Consider imputation"
        elif fmi < 0.40: return "High", "Strongly consider imputation"
        else: return "Critical", "Candidate to drop (validate with business logic)"

    FMI_summary[["Flag", "Recommendation"]] = FMI_summary["OverallFMI"].apply(
        lambda x: pd.Series(flag_and_rec(x))
    )

    all_out_primary = FMI_MONTHLY_ROOT / "fmi_profile_all.csv"
    FMI_summary.to_csv(all_out_primary, index=False)

    print(f"[OK] FMI Audit (All Variables) saved to: {all_out_primary}")
    print(f"Total variables audited: {len(FMI_summary)}")
else:
    print("[ERROR] No reports found.")

[OK] FMI Audit (All Variables) saved to: H:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\NEW FMI Reports\fmi_profile_all.csv
Total variables audited: 76


In [4]:
# --- Create Consistent (Retained) Profile ---
fmi_all = pd.read_csv(FMI_MONTHLY_ROOT / "fmi_profile_all.csv")
stability_df = pd.read_csv(CONSISTENCY_ROOT / "structural_stability_report.csv")

fmi_merged = pd.merge(
    fmi_all,
    stability_df[['Variable', 'Decision']],
    on="Variable",
    how="left"
)

fmi_consistent = fmi_merged[fmi_merged["Decision"] == "Retain"].copy()
fmi_consistent = fmi_consistent.drop(columns=["Decision"])

consistent_out = FMI_MONTHLY_ROOT / "fmi_profile_consistent.csv"
fmi_consistent.to_csv(consistent_out, index=False)

print(f"[OK] Consistent/Retained Set generated: {consistent_out.name}")
print(f"--- Analytical variable count (Ready for Imputation): {len(fmi_consistent)} ---")

[OK] Consistent/Retained Set generated: fmi_profile_consistent.csv
--- Analytical variable count (Ready for Imputation): 20 ---
