This notebook performs dimension-specific exploratory factor analysis (FA) on imputed Labor Force Survey data, following the approved theoretical framework of (Sensitivity, Resilience and Exposure).

Note that **these dimensions are not discovered by FA.** They are theory-driven constructs defined in a basis paper on financial vulnerability, which argues that vulnerability cannot be captured by a single latent dimension.


FA is used here to:

- test internal coherence of variables within each dimension
- identify possible subfactors inside Sensitivity, Resilience, or Exposure
- validate whether assigned variables behave as expected statistically
- FA is not used to invent or redefine Sensitivity, Resilience, or Exposure.

Pipeline Data Dependencies

**NEW Variable Consistency Check**
- defines which variables are eligible for FA and which dimension they belong to

**Imputed Data for Analysis**
- contains the actual numeric values to be analyzed

**FA_Results**
- purely derived outputs, always safe to delete and regenerate in shared drive.

In [1]:
!pip install factor-analyzer



In [2]:
import json
from pathlib import Path
import pandas as pd
import numpy as np

from factor_analyzer import (
    FactorAnalyzer,
    calculate_kmo,
    calculate_bartlett_sphericity
)

import matplotlib.pyplot as plt
import sklearn.utils.validation as suv

### Config loader

In [3]:

with open(Path("./data/interim/config.json")) as f:
    cfg = json.load(f)

BASE_PATH = Path(cfg["BASE_PATH"])

CONSISTENCY_ROOT = BASE_PATH / "NEW Variable Consistency Check"
IMPUTED_ROOT     = BASE_PATH / "Imputed Data for Analysis"
DECISION_ROOT    = BASE_PATH / "Decision Matrix for Imputation"

OUTPUT_RESULTS = BASE_PATH / "FA_Results"

# Always recreate results directory
if OUTPUT_RESULTS.exists():
    for p in OUTPUT_RESULTS.iterdir():
        if p.is_dir():
            for f in p.iterdir():
                f.unlink()
            p.rmdir()
        else:
            p.unlink()
OUTPUT_RESULTS.mkdir(exist_ok=True)


In [4]:
decision_df = pd.read_csv(DECISION_ROOT / "Decision_Matrix.csv")
decision_df.head()

Unnamed: 0,Variable,ConsistencyTag,OverallFMI,Flag,Dimension,Action
0,Available for Work,consistent,0.96537,Critical,Sensitivity,consider_drop_or_advanced
1,C03-Relationship to Household Head,consistent,0.0,Low,Resilience,keep
2,C04-Sex,consistent,0.0,Low,Resilience,keep
3,C05-Age as of Last Birthday,consistent,0.016952,Low,Resilience,keep
4,C05B - Ethnicity,inconsistent,0.0,Low,Resilience,exclude_from_FA


The Decision Matrix (built in 09_Imputation.ipynb) already encodes:

- variable consistency
- missingness severity (FMI)
- theoretical dimension (S / R / E)
- This notebook does not override those decisions (subject to change). 

Imputation (09) was performed to preserve dataset completeness and comparability across survey waves. **However, factor analysis will be restricted to variables with acceptable FMI and conceptual reliability, as documented in our Decision Matrix.** Variables with higher FMI were either excluded or treated as supporting diagnostics, not as core drivers of latent structure.

### Defining FA Eligibility Rules

In [5]:
FA_ELIGIBLE_ACTIONS = [
    "keep",
    "impute_light",
    "impute_cautious"
]

**Interpretation:**

- exclude_from_FA - never used for factor extraction
- sensitivity_only - diagnostic or robustness checks only

high-risk variables will be  intentionally excluded at this stage.

In [6]:
def get_fa_variables(dimension: str):
    return decision_df[
        (decision_df["Dimension"] == dimension) &
        (decision_df["Action"].isin(FA_ELIGIBLE_ACTIONS))
    ]["Variable"].tolist()

SENSITIVITY_VARS = get_fa_variables("Sensitivity")
RESILIENCE_VARS  = get_fa_variables("Resilience")
EXPOSURE_VARS    = get_fa_variables("Exposure")

print("Sensitivity:", len(SENSITIVITY_VARS))
print("Resilience :", len(RESILIENCE_VARS))
print("Exposure   :", len(EXPOSURE_VARS))


Sensitivity: 2
Resilience : 7
Exposure   : 4


This codeblock enforces the split FA by dimension rule:

- FA(Sensitivity only)
- FA(Resilience only)
- FA(Exposure only)

In [7]:
def normalize_name(name: str) -> str:
    return (
        str(name)
        .strip()
        .lower()
        .replace("\xa0", " ")
        .replace("-", " ")
        .replace("_", " ")
    )

This codeblock normalizes column names to avoid mismatches (e.g., “Available for Work” vs “available_for_work” given that they have the same meaning). This guarantees consistent variable matching across years and pipelines. **If you tweak this, keep transformations symmetric with imputation (view 09_Imputation. I applied the same rules for 10_Factor_Analysis). Changing normalization can break matching.**

In [8]:
all_dfs = []

for year_dir in IMPUTED_ROOT.iterdir():
    if not year_dir.is_dir():
        continue

    for file in year_dir.glob("imputed_*.csv"):
        df = pd.read_csv(file)
        df.columns = [normalize_name(c) for c in df.columns]
        all_dfs.append(df)

full_df = pd.concat(all_dfs, ignore_index=True)
print("Final shape:", full_df.shape)


Final shape: (4881364, 63)


In [9]:
SENSITIVITY_VARS = [normalize_name(v) for v in SENSITIVITY_VARS]
RESILIENCE_VARS  = [normalize_name(v) for v in RESILIENCE_VARS]
EXPOSURE_VARS    = [normalize_name(v) for v in EXPOSURE_VARS]

DIMENSION_MAP = {
    "Sensitivity": SENSITIVITY_VARS,
    "Resilience": RESILIENCE_VARS,
    "Exposure": EXPOSURE_VARS
}


In [10]:
def run_fa(df: pd.DataFrame, variables: list, dimension: str):
    X = df[variables].copy()

    # Convert non-numeric columns
    for col in X.columns:
        if not pd.api.types.is_numeric_dtype(X[col]):
            X[col] = pd.Categorical(X[col]).codes

    # Clean
    X = X.replace([np.inf, -np.inf], np.nan).dropna()
    X = X.loc[:, X.nunique() > 1]

    # Diagnostics
    kmo_all, kmo_model = calculate_kmo(X)
    chi2, bartlett_p = calculate_bartlett_sphericity(X)

    # Eigenvalues via correlation matrix (stable)
    corr_matrix = np.corrcoef(X, rowvar=False)
    eigenvalues = np.linalg.eigvals(corr_matrix).real

    n_factors = max(1, np.sum(eigenvalues > 1))

    # Final FA
    fa = FactorAnalyzer(n_factors=n_factors, rotation="varimax")
    fa.fit(X)

    loadings = pd.DataFrame(
        fa.loadings_,
        index=X.columns,
        columns=[f"Factor_{i+1}" for i in range(n_factors)]
    )

    return {
        "dimension": dimension,
        "n_obs": X.shape[0],
        "n_vars": X.shape[1],
        "kmo": kmo_model,
        "bartlett_p": bartlett_p,
        "n_factors": n_factors,
        "eigenvalues": eigenvalues,
        "loadings": loadings
    }


In [11]:
summary_rows = []

for dim, vars_ in DIMENSION_MAP.items():
    print(f"\nRunning FA for {dim}")

    result = run_fa(full_df, vars_, dim)

    dim_dir = OUTPUT_RESULTS / dim
    dim_dir.mkdir(exist_ok=True)

    # Save loadings
    result["loadings"].to_csv(dim_dir / "factor_loadings.csv")

    # Save eigenvalues
    pd.DataFrame({
        "Eigenvalue": result["eigenvalues"]
    }).to_csv(dim_dir / "eigenvalues.csv", index=False)

    # Collect summary
    summary_rows.append({
        "Dimension": dim,
        "Observations": result["n_obs"],
        "Variables": result["n_vars"],
        "KMO": result["kmo"],
        "Bartlett_p": result["bartlett_p"],
        "Factors_Extracted": result["n_factors"]
    })

summary_df = pd.DataFrame(summary_rows)
summary_df.head()



Running FA for Sensitivity





Running FA for Resilience





Running FA for Exposure




Unnamed: 0,Dimension,Observations,Variables,KMO,Bartlett_p,Factors_Extracted
0,Sensitivity,4881364,2,0.5,0.0,1
1,Resilience,4881364,7,0.520984,0.0,3
2,Exposure,4881364,4,0.430576,0.0,2


In [12]:
summary_df.to_csv(OUTPUT_RESULTS / "FA_Summary.csv", index=False)
