### 11. RFVI Construction: From Households to Regions

**Objective:**
To transform raw survey data into a single, comparable numberâ€”the **Regional Financial Vulnerability Index (RFVI)**.

**The Logic (Plain English):**
1.  **Scoring Households:** Imagine a household with 8 kids and no job. We use the weights from *Factor Analysis* (Notebook 10) to give this household a "Vulnerability Score."
2.  **Aggregating to Regions:** We cannot analyze 4 million households individually. We group them by Region (e.g., "all households in Bicol") and calculate the **average** score. This creates a "Representative Household Profile" for that region.
3.  **The Index:** We combine the three dimensions (Sensitivity, Resilience, Exposure) into one final index using the formula:
    $$RFVI = \frac{\text{Sensitivity} + (1 - \text{Resilience}) + \text{Exposure}}{3}$$
    *(Note: We invert Resilience because high resilience is good, meaning it lowers vulnerability.)*

In [6]:
import pandas as pd
import numpy as np
import json
from pathlib import Path
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# --- Load Central Config ---
# We use the shared config to ensure paths are identical to previous steps
with open(Path("./data/interim/config.json")) as f:
    cfg = json.load(f)

BASE_PATH = Path(cfg["BASE_PATH"])
IMPUTED_ROOT = BASE_PATH / "Imputed Data for Analysis"
FA_RESULTS   = BASE_PATH / "FA_Results"
OUTPUT_INDEX = BASE_PATH / "RFVI_Results"

OUTPUT_INDEX.mkdir(parents=True, exist_ok=True)
print(f"Reading FA Weights from: {FA_RESULTS}")
print(f"Saving Index to: {OUTPUT_INDEX}")

Reading FA Weights from: G:\My Drive\Labor Force Survey\FA_Results
Saving Index to: G:\My Drive\Labor Force Survey\RFVI_Results


### Step 1: Loading Validated Variables
**Why this matters:**
We do not manually pick variables here. Instead, we "ask" the Factor Analysis results (Notebook 10) which variables were statistically valid.
* **Safety Check:** If a variable was dropped in the previous step (e.g., due to low KMO), it will automatically be excluded here. This prevents bias and ensures we only use reliable data.

In [8]:
def load_dimension_metadata(dim_name):
    """
    Reads the factor loadings from Notebook 10 to identify surviving variables.
    Returns: A dictionary of {Variable: Weight}
    """
    path = FA_RESULTS / dim_name / "factor_loadings.csv"
    if not path.exists():
        print(f"[WARNING] No FA results found for {dim_name}. Assuming weights=0.")
        return {}
    
    # Read the CSV (Index = Variable Name, Col 0 = Factor Weight)
    df = pd.read_csv(path, index_col=0)
    
    # We use the primary factor (Factor_1) as the weight
    return df.iloc[:, 0].to_dict()

# --- Load Weights ---
sensitivity_weights = load_dimension_metadata("Sensitivity")
resilience_weights  = load_dimension_metadata("Resilience")
exposure_weights    = load_dimension_metadata("Exposure")

print(f"Loaded {len(sensitivity_weights)} Sensitivity Variables.")
print(f"Loaded {len(resilience_weights)} Resilience Variables.")
print(f"Loaded {len(exposure_weights)} Exposure Variables.")

Loaded 2 Sensitivity Variables.
Loaded 7 Resilience Variables.
Loaded 4 Exposure Variables.


### Step 2: Standardization (Z-Scores)
**Why we do this:**
Our data has different units. *Income* is in thousands, while *Household Size* is single digits (1-10). If we just added them, Income would overwhelm the calculation.
* **The Fix:** We convert everything to **Z-Scores**. This puts all variables on the same scale (centered at 0). A score of +1.0 means "Higher than average," and -1.0 means "Lower than average."

In [9]:
def calculate_score(df, weights_dict):
    """
    Standardizes data and calculates the weighted sum based on FA loadings.
    """
    # 1. Filter to valid variables only
    valid_vars = [v for v in weights_dict.keys() if v in df.columns]
    if not valid_vars:
        return np.zeros(len(df))

    # 2. Extract Data
    X = df[valid_vars].copy()
    
    # 3. Handle Non-Numeric Safety Check
    X = X.apply(pd.to_numeric, errors='coerce').fillna(0)
    
    # 4. Standardize (Z-Score transformation)
    # This ensures Age (0-100) and Income (0-1M) are comparable
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # 5. Calculate Score: Data * Weight
    w = np.array([weights_dict[v] for v in valid_vars])
    return np.dot(X_scaled, w)

### Step 3: Regional Aggregation
**What happens here:**
We calculated a score for every single household. Now, we group them by **Region** and take the **Mean (Average)**.
* **Interpretation:** The resulting number represents the "Average Vulnerability Level" of that region for that specific year. This condenses millions of rows into a clean table of ~17 regions.

In [10]:
regional_aggregates = []

# Loop through all imputed years
for year_dir in IMPUTED_ROOT.iterdir():
    if not year_dir.is_dir(): continue
    
    print(f"Processing Year: {year_dir.name}...")
    
    # Load all CSVs for that year
    year_dfs = []
    for f in year_dir.glob("imputed_*.csv"):
        d = pd.read_csv(f)
        # Normalize columns to match the weights dictionary
        d.columns = [str(c).strip().lower().replace("\xa0", " ").replace("-", " ").replace("_", " ") for c in d.columns]
        year_dfs.append(d)
        
    if not year_dfs: continue
    full_df = pd.concat(year_dfs, ignore_index=True)
    
    # --- CALCULATE HOUSEHOLD SCORES ---
    full_df['S_Score'] = calculate_score(full_df, sensitivity_weights)
    full_df['R_Score'] = calculate_score(full_df, resilience_weights)
    full_df['E_Score'] = calculate_score(full_df, exposure_weights)
    
    # --- AGGREGATE TO REGION ---
    # Group by Region -> Take Mean
    grouped = full_df.groupby('region')[['S_Score', 'R_Score', 'E_Score']].mean().reset_index()
    grouped['Year'] = year_dir.name
    
    regional_aggregates.append(grouped)

# Combine all years into one final dataset
final_df = pd.concat(regional_aggregates, ignore_index=True)
print(f"Aggregation Complete. Final shape: {final_df.shape}")

Processing Year: 2024...


  d = pd.read_csv(f)


Processing Year: 2023...
Processing Year: 2022...
Processing Year: 2019...
Processing Year: 2018...
Aggregation Complete. Final shape: (118, 5)


### Step 4: Normalization & Index Calculation
**Why normalize?**
Factor scores can be strange numbers like -1.5 or +2.3. To make the index intuitive (like a grade from 0 to 100%), we scale them using **Min-Max Normalization**.
* **0.0** = Lowest in the dataset (Best)
* **1.0** = Highest in the dataset (Worst)

**The RFVI Formula:**
`RFVI = (Sensitivity + (1 - Resilience) + Exposure) / 3`
* We use **(1 - Resilience)** because Resilience is good. If a region has High Resilience (1.0), it should contribute **0** to vulnerability.

In [11]:
# 1. Normalize Components (0 to 1 scale)
scaler = MinMaxScaler()
cols = ['S_Score', 'R_Score', 'E_Score']
norm_cols = ['Sensitivity_Norm', 'Resilience_Norm', 'Exposure_Norm']

final_df[norm_cols] = scaler.fit_transform(final_df[cols])

# 2. Invert Resilience (High Resilience = Low Vulnerability)
final_df['Inv_Resilience'] = 1 - final_df['Resilience_Norm']

# 3. Calculate Final RFVI (Simple Average)
final_df['RFVI'] = (
    final_df['Sensitivity_Norm'] + 
    final_df['Inv_Resilience'] + 
    final_df['Exposure_Norm']
) / 3

# 4. Save
out_path = OUTPUT_INDEX / "Regional_Financial_Vulnerability_Index.csv"
final_df.to_csv(out_path, index=False)

print("Index Construction Complete.")
print("Sample Results:")
print(final_df[['Year', 'region', 'RFVI']].head())

Index Construction Complete.
Sample Results:
   Year                                        region      RFVI
0  2024  Autonomous Region in Muslim Mindanao  (ARMM)  0.100328
1  2024   Autonomous Region in Muslim Mindanao (ARMM)  0.332981
2  2024       Cordillera Administrative Region  (CAR)  0.111880
3  2024        Cordillera Administrative Region (CAR)  0.354247
4  2024                               MIMAROPA Region  0.115327


### 5. Interpretation of Results

The table above represents the finalized **Regional Financial Vulnerability Index (RFVI)**. Here is how to interpret the specific columns generated by the pipeline:

#### **A. Raw Factor Scores (`S_Score`, `R_Score`, `E_Score`)**
These are the aggregated Z-scores derived from the household microdata.
* **Scale:** Centered around 0.
* **Interpretation:**
    * **Positive values (+):** The region is *above the national average* for this dimension.
    * **Negative values (-):** The region is *below the national average*.
    * *Note:* A negative `S_Score` (Sensitivity) is good (less sensitive), while a negative `R_Score` (Resilience) is bad (less resilient).

#### **B. Normalized Scores (`_Norm` columns)**
To make the index comparable, we scaled the raw scores from 0 to 1 based on the minimum and maximum values found across the entire dataset.
* **0.0:** The lowest value observed in the dataset.
* **1.0:** The highest value observed in the dataset.

#### **C. The "Inverse Resilience" Logic (`Inv_Resilience`)**
This is a critical transformation step in our methodology.
* **Concept:** Resilience acts as a *shield*. High resilience reduces vulnerability.
* **The Math:** `Inv_Resilience = 1 - Resilience_Norm`
* **Result:**
    * If a region has **High Resilience** (e.g., 0.9), its vulnerability contribution becomes **Low** (0.1).
    * If a region has **Low Resilience** (e.g., 0.2), its vulnerability contribution becomes **High** (0.8).

#### **D. The Final Index (`RFVI`)**
This is the composite score used for ranking and clustering.
* **Formula:** Average of `Sensitivity_Norm`, `Exposure_Norm`, and `Inv_Resilience`.
* **Scale:** 0 to 1.
    * **Closer to 0:** **Low Vulnerability** (The region is stable, resilient, and safe).
    * **Closer to 1:** **High Vulnerability** (The region is highly sensitive, exposed, and lacks resilience).