### Fraction of Missing Information (FMI) Analysis

This notebook computes the **Fraction of Missing Information (FMI)** for all renamed and validated survey files. It quantifies missingness at the variable level across survey months and provides guidance for imputation and variable selection.

The workflow builds on the outputs of previous preprocessing stages, particularly the variable consistency and harmonization audit, and serves as the final diagnostic step before imputation.

---

#### Overview of the Workflow

The FMI process is structured into four main stages:

1. **Configuration and Data Loading**  
   Paths and parameters are dynamically loaded from `config.json`, ensuring reproducibility and eliminating hardcoded directories. Survey files are accessed using the inventory and consistency outputs from earlier notebooks.

2. **Missingness Detection and FMI Computation**  
   A unified missingness detector identifies absent values using:
   - Standard nulls and empty strings  
   - Text-based missing markers (e.g., "NA", "-", "N/A")   

   For each variable and survey month, FMI is computed and classified into interpretive levels (Low, Moderate, High, Critical), with corresponding analytical recommendations.

3. **Consolidation and Cross-Month Aggregation**  
   Monthly FMI reports are combined into a consolidated profile that summarizes missingness patterns across all survey waves. Variables are aligned with their consistency tags to contextualize missingness within structural stability.

4. **Integration with Stability and Harmonization Decisions**  
   FMI results are merged with the structural stability report from Notebook 07. Only variables that passed the consistency and harmonization audit are retained in the final FMI profile, ensuring that missingness analysis is performed on conceptually stable and comparable variables.

---

#### Inputs

- `config.json` – centralized path and parameter configuration  
- Renamed and harmonized survey files (`FINAL Consistent Surveys`)  
- `consistency_profile.csv` – initial variable presence/consistency tags  
- `structural_stability_report.csv` – final retain/drop decisions and harmonizations from Notebook 07  

---

#### Outputs

- **Monthly FMI reports**  

- **Consolidated FMI profiles**  
  - `fmi_profile_all.csv` – FMI statistics for all variables  
  - `fmi_profile_consistent.csv` – FMI statistics for retained, harmonized, and consistent variables (temporally and value consistent) 

- **Log file**  
  Detailed execution log stored in `LOG_DIR`

---

#### Key Contributions of This Notebook

- Automatically computes FMI without using fixed (hardcoded) file paths.  
- Detects missing values using both text-based and numeric rules.  
- Aligns variables so they are comparable across all survey months.  
- Combines the results of variable consistency and harmonization checks with the FMI analysis.  
- Distinguishes between initial FMI exploration and the final set of variables used for analysis.

This notebook ensures that missingness is analyzed only on variables that are consistent and properly harmonized across survey waves, making the results reliable for imputation and further analysis.

In [1]:
import json
from pathlib import Path
import os
import pandas as pd
import numpy as np
import shutil
from datetime import datetime

# --- Load config ---
with open(Path("./data/interim/config.json")) as f:
    cfg = json.load(f)

BASE_PATH = Path(cfg["BASE_PATH"])
LOG_DIR = Path(cfg["LOG_DIR"])
MONTH_ORDER = cfg["MONTH_ORDER"]

# --- Paths ---
# UPDATED: We now point to the Consistent Surveys from Notebook 7
CONSISTENT_ROOT = BASE_PATH / "FINAL Consistent Surveys"
CONSISTENCY_ROOT = BASE_PATH / "NEW Variable Consistency Check"
FMI_MONTHLY_ROOT = BASE_PATH / "NEW FMI Reports"

# --- IDEMPOTENT CLEANUP LOGIC ---
if FMI_MONTHLY_ROOT.exists():
    print(f" Cleaning existing FMI folder: {FMI_MONTHLY_ROOT}")
    shutil.rmtree(FMI_MONTHLY_ROOT)

os.makedirs(FMI_MONTHLY_ROOT, exist_ok=True)
print(f" Fresh folder created: {FMI_MONTHLY_ROOT}")

# --- Load consistency profile ---
consistency_profile_path = CONSISTENCY_ROOT / "consistency_profile.csv"
if consistency_profile_path.exists():
    consistency_df = pd.read_csv(consistency_profile_path)
    consistency_tags = dict(zip(consistency_df["Variable"], consistency_df["ConsistencyTag"]))
else:
    print(" Warning: consistency_profile.csv not found. Tags will be 'unknown'.")
    consistency_tags = {}

# --- Missingness detector (REFINED) ---
TEXT_MISSING = {"", " ", "NA", "N/A", "NaN", "nan", ".", "-", "_"}

def build_missing_mask(series: pd.Series) -> pd.Series:
    """Identifies missingness based on blanks and explicit text markers."""
    s = series.astype(str).str.strip()
    mask = series.isna() | (s == "") | s.isin(TEXT_MISSING)
    return mask

 Cleaning existing FMI folder: H:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\NEW FMI Reports
 Fresh folder created: H:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\NEW FMI Reports


In [2]:
def fmi_scan_csv(file_path: Path, year: str, month: str) -> pd.DataFrame:
    df = pd.read_csv(file_path, low_memory=False)
    rows = []
    for col in df.columns:
        miss_mask = build_missing_mask(df[col])
        missing = int(miss_mask.sum())
        total = int(len(df[col]))
        fmi = (missing / total) if total > 0 else 0.0

        if fmi < 0.05:
            flag, rec = "Low", "Keep"
        elif fmi < 0.20:
            flag, rec = "Moderate", "Consider imputation"
        elif fmi < 0.40: 
            flag, rec = "High", "Validate for Structural Logic"
        else:
            flag, rec = "Critical", "Assess Logic or Drop"

        rows.append({
            "Year": year,
            "Month": month,
            "Variable": col.strip(),
            "Missing": missing,
            "Total": total,
            "FMI": round(fmi, 6),
            "Flag": flag,
            "Recommendation": rec,
            "ConsistencyTag": consistency_tags.get(col.strip(), "unknown"),
        })
    return pd.DataFrame(rows)

# --- Batch runner ---
log_file = LOG_DIR / f"fmi_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
success_count, error_count = 0, 0
all_reports = []

with open(log_file, "w", encoding="utf-8") as log:
    log.write("STARTING FMI REPORT ON CONSISTENT SURVEYS\n")
    log.write(f"Source: {CONSISTENT_ROOT}\n")
    log.write("===============================================\n\n")

    # Use CONSISTENT_ROOT for the loop
    for year in sorted([d for d in os.listdir(CONSISTENT_ROOT) if (CONSISTENT_ROOT / d).is_dir()]):
        year_folder = CONSISTENT_ROOT / year
        year_out = FMI_MONTHLY_ROOT / year
        os.makedirs(year_out, exist_ok=True)

        for file in os.listdir(year_folder):
            if not file.lower().endswith(".csv"):
                continue

            # Handles both "January_2018.csv" and "January 2018.csv"
            month = file.split("_")[0].split(" ")[0].capitalize()
            file_path = year_folder / file

            try:
                report = fmi_scan_csv(file_path, year, month)
                out_file = year_out / f"FMI_{month}_{year}.csv"
                report.to_csv(out_file, index=False)
                all_reports.append(report)

                success_count += 1
                msg = f"[OK] {file} scanned."
                print(msg); log.write(msg + "\n")
            except Exception as e:
                error_count += 1
                msg = f"[ERROR] {file} -> {e}"
                print(msg); log.write(msg + "\n")

    summary_msg = f"COMPLETED. Success: {success_count} | Errors: {error_count}"
    print("\n" + summary_msg); log.write("\n" + summary_msg + "\n")

[OK] APRIL_2018.CSV scanned.
[OK] JANUARY_2018.CSV scanned.
[OK] JULY_2018.CSV scanned.
[OK] OCTOBER_2018.CSV scanned.
[OK] JULY_2019.CSV scanned.
[OK] APRIL_2019.CSV scanned.
[OK] JANUARY_2019.CSV scanned.
[OK] OCTOBER_2019.CSV scanned.
[OK] NOVEMBER_2022.CSV scanned.
[OK] FEBRUARY_2022.csv scanned.
[OK] MARCH_2022.csv scanned.
[OK] MAY_2022.csv scanned.
[OK] JANUARY_2022.csv scanned.
[OK] APRIL_2022.csv scanned.
[OK] OCTOBER_2022.CSV scanned.
[OK] AUGUST_2022.CSV scanned.
[OK] DECEMBER_2022.CSV scanned.
[OK] SEPTEMBER_2022.CSV scanned.
[OK] JUNE_2022.csv scanned.
[OK] JULY_2022.CSV scanned.
[OK] JUNE_2023.CSV scanned.
[OK] NOVEMBER_2023.CSV scanned.
[OK] MARCH_2023.CSV scanned.
[OK] AUGUST_2023.CSV scanned.
[OK] FEBRUARY_2023.CSV scanned.
[OK] DECEMBER_2023.CSV scanned.
[OK] JULY_2023.CSV scanned.
[OK] JANUARY_2023.CSV scanned.
[OK] OCTOBER_2023.CSV scanned.
[OK] MAY_2023.CSV scanned.
[OK] SEPTEMBER_2023.CSV scanned.
[OK] APRIL_2023.CSV scanned.
[OK] JANUARY_2024.CSV scanned.
[OK] AP

In [3]:
# --- Consolidate in-memory reports ---
if all_reports:
    combined = pd.concat(all_reports, ignore_index=True)

    FMI_summary = (
        combined.groupby(["Variable", "ConsistencyTag"])
        .agg(
            TotalMissing=("Missing", "sum"),
            TotalRows=("Total", "sum"),
            AvgFMI=("FMI", "mean"),
            MonthsObserved=("Year", "count"),
        )
        .reset_index()
    )

    FMI_summary["OverallFMI"] = FMI_summary["TotalMissing"] / FMI_summary["TotalRows"]

    def flag_and_rec_summary(fmi):
        if fmi < 0.05: return "Low", "Keep"
        elif fmi < 0.20: return "Moderate", "Impute/Keep"
        elif fmi < 0.40: return "High", "Structural Skip - Logical Imputation"
        else: return "Critical", "Validate/Drop"

    FMI_summary[["Flag", "Recommendation"]] = FMI_summary["OverallFMI"].apply(
        lambda x: pd.Series(flag_and_rec_summary(x))
    )

    all_out_primary = FMI_MONTHLY_ROOT / "fmi_profile_all.csv"
    FMI_summary.to_csv(all_out_primary, index=False)

    # Cleaned printout to match your preference
    print(f"[OK] FMI Audit (All Variables) saved to: {all_out_primary}")
    print(f"Total variables audited: {len(FMI_summary)}")
    
else:
    print("[ERROR] No reports were generated.")

[OK] FMI Audit (All Variables) saved to: H:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\NEW FMI Reports\fmi_profile_all.csv
Total variables audited: 76


In [4]:
# --- Create Consistent (Retained) Profile ---
fmi_all = pd.read_csv(FMI_MONTHLY_ROOT / "fmi_profile_all.csv")
stability_report_path = CONSISTENCY_ROOT / "structural_stability_report.csv"

if stability_report_path.exists():
    stability_df = pd.read_csv(stability_report_path)

    fmi_merged = pd.merge(
        fmi_all,
        stability_df[['Variable', 'Decision']],
        on="Variable",
        how="left"
    )

    # Filter only for variables intended for the final analysis
    fmi_consistent = fmi_merged[fmi_merged["Decision"] == "Retain"].copy()
    fmi_consistent = fmi_consistent.drop(columns=["Decision"])

    consistent_out = FMI_MONTHLY_ROOT / "fmi_profile_consistent.csv"
    fmi_consistent.to_csv(consistent_out, index=False)

    print(f" Consistent Variable Set Generated: {consistent_out.name}")
    print(f"Total variables ready for Notebook 09: {len(fmi_consistent)}")
else:
    print(" Error: structural_stability_report.csv not found in Consistency Root.")

 Consistent Variable Set Generated: fmi_profile_consistent.csv
Total variables ready for Notebook 09: 20


In [None]:
import pandas as pd
from pathlib import Path

# --- Configuration ---
RENAMED_ROOT = BASE_PATH / "NEW Renamed Fully Decoded Surveys"
CONSISTENT_ROOT = BASE_PATH / "FINAL Consistent Surveys"
FMI_CONSISTENT_PATH = FMI_MONTHLY_ROOT / "fmi_profile_consistent.csv"

def run_harmonization_comparison():
    if not FMI_CONSISTENT_PATH.exists():
        print("Error: Please run Code Block 4 first to generate the consistent variable list.")
        return

    # 1. Get the list of the 20 consistent variables
    fmi_cons = pd.read_csv(FMI_CONSISTENT_PATH)
    target_vars = fmi_cons["Variable"].tolist()
    
    print("\n" + "="*100)
    print(f" HARMONIZATION VALIDATION: MISSINGNESS COMPARISON ({len(target_vars)} Variables)")
    print("="*100)
    
    results = []

    def get_folder_missingness(root_path, vars_list):
        """Calculates total missingness for specific variables across a directory."""
        counts = {var: {"missing": 0, "total": 0} for var in vars_list}
        for file_path in root_path.rglob("*.csv"):
            try:
                # Only load columns that actually exist in the file to prevent errors
                available = pd.read_csv(file_path, nrows=0).columns.tolist()
                valid_cols = [v for v in vars_list if v in available]
                
                if valid_cols:
                    df = pd.read_csv(file_path, usecols=valid_cols, dtype=str)
                    for var in valid_cols:
                        mask = build_missing_mask(df[var])
                        counts[var]["missing"] += int(mask.sum())
                        counts[var]["total"] += int(len(df))
            except Exception:
                continue
        return counts

    # 2. Run scan for both directories
    print(" Scanning 'NEW Renamed " \
    " Surveys' folder (Previous Folder)...")
    before_stats = get_folder_missingness(RENAMED_ROOT, target_vars)
    
    print(" Scanning 'FINAL Consistent Surveys' folder (New Folder)...")
    after_stats = get_folder_missingness(CONSISTENT_ROOT, target_vars)

    # 3. Format Comparison Table
    print(f"\n{'VARIABLE':<40} | {'Previous Folder (%)':<15} | {'New Folder (%)':<15} | {'STATUS'}")
    print("-" * 100)

    for var in sorted(target_vars):
        b_pct = (before_stats[var]['missing'] / before_stats[var]['total'] * 100) if before_stats[var]['total'] > 0 else 100.0
        a_pct = (after_stats[var]['missing'] / after_stats[var]['total'] * 100) if after_stats[var]['total'] > 0 else 100.0
        
        # Status check: After should be <= Before (unless we intentionally preserved NaNs)
        diff = a_pct - b_pct
        if abs(diff) < 0.001:
            status = "Stable"
        elif diff < 0:
            status = "Improved"
        else:
            status = "Increased Gap"

        print(f"{var:<40} | {b_pct:>9.2f}% | {a_pct:>9.2f}% | {status}")

    print("\n" + "="*100)
    print(" SUMMARY: 'Stable' indicates the harmonization preserved the original data structure.")
    print("="*100)

# Run the comparison
run_harmonization_comparison()


 HARMONIZATION VALIDATION: MISSINGNESS COMPARISON (20 Variables)
 Scanning 'NEW Renamed Fully Decoded Surveys' folder (Previous Folder)...
 Scanning 'FINAL Consistent Surveys' folder (New Folder)...

VARIABLE                                 | Previous Folder (%) | New Folder (%)  | STATUS
----------------------------------------------------------------------------------------------------
Available for Work                       |     96.23% |     96.23% | Stable
C03-Relationship to Household Head       |      0.00% |      0.00% | Stable
C04-Sex                                  |      0.00% |      0.00% | Stable
C05-Age as of Last Birthday              |      0.00% |      0.00% | Stable
C06-Marital Status                       |      7.23% |      7.23% | Stable
C101-Line Number                         |      0.00% |      0.00% | Stable
Household Size                           |      0.00% |      0.00% | Stable
Look for Additional Work                 |     58.44% |     58.44% | Stable
