#### Recalling consistent variables throughout the datasets

In [4]:
import os
import pandas as pd

# Base path where decoded surveys are stored
base_path = r"G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\Fully Decoded Surveys"

# Years to inspect
years = [str(y) for y in range(2018, 2025)]

# Dictionary to store variables per year
vars_per_year = {}

for year in years:
    year_folder = os.path.join(base_path, year)
    if not os.path.isdir(year_folder):
        continue
    
    cols_this_year = set()
    for file in os.listdir(year_folder):
        if file.endswith(".CSV"):
            df = pd.read_csv(os.path.join(year_folder, file), nrows=10)  # read only first 10 rows for speed
            cols_this_year.update(df.columns.tolist())
    
    vars_per_year[year] = cols_this_year
    print(f"[OK] {year}: {len(cols_this_year)} variables detected.")

# Find variables consistent across all years
consistent_vars = set.intersection(*vars_per_year.values())

print("\n===============================================")
print("CONSISTENT VARIABLES ACROSS 2018–2024")
print("===============================================\n")
for var in sorted(consistent_vars):
    print(var)

print(f"\nTotal consistent variables: {len(consistent_vars)}")


[OK] 2018: 54 variables detected.
[OK] 2019: 50 variables detected.
[OK] 2022: 78 variables detected.
[OK] 2023: 81 variables detected.
[OK] 2024: 79 variables detected.

CONSISTENT VARIABLES ACROSS 2018–2024

C03-Relationship to Household Head
C04-Sex
C05-Age as of Last Birthday
C06-Marital Status
C07-Highest Grade Completed
C08-Currently Attending School
C09-Graduate of technical/vocational course
C09a - Currently Attending Non-formal Training for Skills Development
C10-Overseas Filipino Indicator
C101-Line Number
C11-Work Indicator
C12-Job Indicator
C14-Primary Occupation
C16-Kind of Business (Primary Occupation)
C17-Nature of Employment (Primary Occupation)
C18-Normal Working Hours per Day
C19-Total Number of Hours Worked during the past week
C20-Want More Hours of Work
C21-Look for Additional Work
C22-First Time to Work
C23-Class of Worker (Primary Occupation)
C24-Basis of Payment (Primary Occupation)
C25-Basic Pay per Day (Primary Occupation)
C26-Other Job Indicator
C27-Number of

#### Recall FMI Summary Results

In [5]:
import os
import pandas as pd

# Define your base path again in this notebook
base_path = r"G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey"

# Point to the *new* FMI Reports folder
reports_root = os.path.join(base_path, "New FMI Reports")

# Path to the saved overall summary file
summary_path = os.path.join(reports_root, "FMI_Summary_2018_2024.csv")

# Load the summary
FMI_summary = pd.read_csv(summary_path)

# Inspect results
FMI_summary.head()


Unnamed: 0,Column,TotalMissing,TotalRows,AvgFMI,MonthsObserved,OverallFMI,Flag,Recommendation
0,C03-Relationship to Household Head,0,4881364,0.0,34,0.0,Low,Keep
1,C04-Sex,0,4881364,0.0,34,0.0,Low,Keep
2,C05-Age as of Last Birthday,82751,4881364,0.019262,34,0.016952,Low,Keep
3,C05B - Ethnicity,0,707981,0.0,1,0.0,Low,Keep
4,C06-Marital Status,358820,4881364,0.069921,34,0.073508,Moderate,Consider imputation


Variable names shall be harmonized after FMI aggregation to ensure consistent matching across survey years and reports. This cleaning step did not alter FMI values, only standardized labels for comparability since inconsistent naming conventions can be observed.

#### Crucial Decision: Variable Scope for FMI Analysis

During processing we observed a discrepancy between the **FMI summary** and the **consistent variable intersection**:

- **FMI Summary (78 variables)**  
  - Contains *all variables ever observed* across 2018–2024 monthly reports.  
  - Includes one‑off or year‑specific variables (e.g., `C05B - Ethnicity`, `C19B - Time of work`, `household_seq_number`).  
  - Some variables appear multiple times under slightly different names (e.g., `C18-Total Number of Hours Worked…` vs `Normal Working Hours per Day`).

- **Consistent Intersection (49 variables)**  
  - Contains only variables present in *every year* from 2018-2024.  
  - Ensures comparability across time and reduces imputation burden.  
  - Excludes one‑off variables and naming duplicates.

#### Decision Rationale (for team approval)
For **longitudinal labor force analysis**, we will:
- Use the **49 consistent variables** as the *core analysis set* to guarantee comparability and avoid heavy imputation for variables absent in some years.  
- Retain the **78‑variable FMI summary** for *documentation and transparency*, showing the full scope of variables encountered across survey waves.  
- Clearly flag the 29 extra variables as *year‑specific or inconsistent*, excluded from imputation but acknowledged in the appendix.

**Summary:**  
- *Analysis set*: 49 consistent variables (stable across 2018–2024).  
- *Documentation set*: 78 FMI summary variables (complete record, including one‑offs).  
- This two‑tier approach balances methodological rigor with transparency.


In [6]:
import re

reports_root = os.path.join(base_path, "New FMI Reports")
summary_path = os.path.join(reports_root, "FMI_Summary_2018_2024.csv")

# --- Load FMI summary ---
FMI_summary = pd.read_csv(summary_path)

# --- Canonicalization helper ---
def clean_name(name: str) -> str:
    if pd.isna(name):
        return name
    s = str(name).strip()
    s = s.lower()
    s = re.sub(r"\s+", " ", s)
    s = s.replace("–", "-").replace("—", "-")
    s = s.replace("’", "'").replace("“", '"').replace("”", '"')
    s = s.replace("...", "")
    # drop leading codes like "c18-" if present
    if s.startswith("c") and "-" in s:
        parts = s.split("-", 1)
        if parts[0][1:].isdigit():
            s = parts[1].strip()
    return s

# --- Apply cleaning ---
FMI_summary["Column_clean"] = FMI_summary["Column"].apply(clean_name)

# Assume you already have consistent_vars as a Python set
consistent_df = pd.DataFrame({"Column": list(consistent_vars)})
consistent_df["Column_clean"] = consistent_df["Column"].apply(clean_name)

# --- Merge on cleaned names ---
merged = pd.merge(
    consistent_df,
    FMI_summary,
    on="Column_clean",
    how="left",
    suffixes=("_consistent", "_fmi")
)

# Prefer FMI label if available
merged["Column"] = merged["Column_fmi"].fillna(merged["Column_consistent"])

# Flag missing merges explicitly
missing_mask = merged["OverallFMI"].isna()
merged.loc[missing_mask, ["Flag", "Recommendation"]] = ["No data", "Exclude (no observations)"]

# --- Final clean output ---
final_summary = merged.drop(columns=["Column_clean", "Column_consistent", "Column_fmi"])
print("\n===============================================")
print("FINAL CONSISTENT FMI SUMMARY (cleaned names)")
print("===============================================\n")
final_summary.head(50)



FINAL CONSISTENT FMI SUMMARY (cleaned names)



Unnamed: 0,TotalMissing,TotalRows,AvgFMI,MonthsObserved,OverallFMI,Flag,Recommendation,Column
0,358820,4881364,0.069921,34,0.073508,Moderate,Consider imputation,C06-Marital Status
1,2843116,4881364,0.568276,34,0.582443,Critical,Candidate to drop (validate with business logic),Look for Additional Work
2,1444345,4881364,0.29263,34,0.29589,High,Strongly consider imputation,"New Employment Criteria (jul 05, 2005)"
3,2840513,4881364,0.567852,34,0.58191,Critical,Candidate to drop (validate with business logic),Other Job Indicator
4,2896714,4881364,0.579249,34,0.593423,Critical,Candidate to drop (validate with business logic),Normal Working Hours per Day
5,570516,764619,0.746114,17,0.746144,Critical,Candidate to drop (validate with business logic),C26-Reason for not Looking for Work
6,2963387,4116745,0.724277,17,0.719837,Critical,Candidate to drop (validate with business logic),C34-Reason for not Looking for Work
7,3969850,4116745,0.964375,17,0.964318,Critical,Candidate to drop (validate with business logic),C27-Number of Jobs during the past week
8,0,4881364,0.0,34,0.0,Low,Keep,C03-Relationship to Household Head
9,20366,4881364,0.004591,34,0.004172,Low,Keep,Psu Number


In [15]:
import pandas as pd

# --- Apply imputation decision framework ---
def imputation_action(flag):
    if flag == "Low":
        return "Keep as is"
    elif flag == "Moderate":
        return "Simple imputation (mean/mode/forward fill)"
    elif flag == "High":
        return "Advanced imputation (regression/ML)"
    elif flag == "Critical":
        return "Candidate to drop (validate with business logic)"
    else:
        return "Review"

# Add decision column
final_summary["ImputationDecision"] = final_summary["Flag"].apply(imputation_action)

# --- Separate views for clarity ---
alerts_df = final_summary[final_summary["Flag"].isin(["High", "Critical"])][
    ["Column", "Flag", "Recommendation", "ImputationDecision"]
].reset_index(drop=True)

low_moderate_df = final_summary[final_summary["Flag"].isin(["Low", "Moderate"])][
    ["Column", "Flag", "Recommendation", "ImputationDecision"]
].reset_index(drop=True)

# --- Display settings to show all rows/columns ---
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

# --- Notebook outputs (clean DataFrames) ---
alerts_df      # shows ALL High & Critical variables with decisions
low_moderate_df  # shows ALL Low & Moderate variables with decisions


Unnamed: 0,Column,Flag,Recommendation,ImputationDecision
0,C06-Marital Status,Moderate,Consider imputation,Simple imputation (mean/mode/forward fill)
1,C03-Relationship to Household Head,Low,Keep,Keep as is
2,Psu Number,Low,Keep,Keep as is
3,C05-Age as of Last Birthday,Low,Keep,Keep as is
4,Survey Month,Low,Keep,Keep as is
5,Survey Year,Low,Keep,Keep as is
6,C09-Work Indicator,Moderate,Consider imputation,Simple imputation (mean/mode/forward fill)
7,C11-Work Indicator,Moderate,Consider imputation,Simple imputation (mean/mode/forward fill)
8,Replicate,Low,Keep,Keep as is
9,C07-Highest Grade Completed,Moderate,Consider imputation,Simple imputation (mean/mode/forward fill)
