# 00_repo_audit_and_config_2024.ipynb

## Part A — What we are doing

We **discover and standardize** the real (dataset-specific) column names for core fields used throughout the California VAT rebate analysis. Because PolicyEngine builds can rename variables, this notebook writes a canonical mapping to `config/columns.yaml` so every later step can use consistent names.

**Core fields we must resolve (household level, 2024):**
- AGI (e.g., `adjusted_gross_income`)
- Wages (e.g., `employment_income`)
- Household size (e.g., `household_size`)
- Household weight (e.g., `household_weight`)
- Federal income tax (e.g., `income_tax`)
- California income tax (e.g., `ca_income_tax`)
- State code (e.g., `state_code`)

**Outputs**
- `config/columns.yaml` — mapping from generic → actual names.
- A short audit printout (counts, missing checks).

**Why this matters**
- Locks column names so downstream notebooks don’t break when variable labels shift.
- Prevents subtle errors (e.g., summing the wrong weight field).

---

## Part B — How we do it

1. **Probe candidates**  
   For each generic field, we test an ordered list of candidate variable names (e.g., for weights try `household_weight`, `hh_weight`, `weight`). We pick the first that exists and passes validity checks.

2. **Sanity checks**
   - Verify that **California households exist** by scanning a robust state field (strings like `CA` or FIPS `6`).
   - Ensure **non-missing** values for the chosen columns on the 2024 period.
   - Confirm **positive total weight** overall and **within CA**.

3. **Write the map**
   - Save the resolved mapping to `config/columns.yaml` (YAML, UTF-8).
   - Echo the map in the cell output for quick review.

---

## Part C — Dependencies & connections

- **Inputs**: PolicyEngine household arrays for 2024 (no files required).
- **Downstream**: All later notebooks (`01`–`06`) import `config/columns.yaml`.  
  If any later notebook can’t find a column, re-run this one.

---

## Part D — Deliverables & acceptance checks

**File written**
- `config/columns.yaml`

**Acceptance checks**
- File exists and includes keys for: `agi`, `wages`, `household_size`, `household_weight`, `fed_income_tax`, `ca_income_tax`, `state_code`.
- Printed counts show **non-zero CA households**.
- No missing or obviously broken values in resolved columns.

---

## Part E — Troubleshooting

- **Different working directory**: If a later notebook cannot find `config/columns.yaml`, ensure you’re running from the repo root or use absolute paths.
- **Field not found**: Add more candidate names for the missing variable and rerun.
- **Weights zero/negative**: Confirm you’re using **household** weights (not person-level) and that household aggregation was requested from PolicyEngine.


In [3]:
# 00 — Repo audit & config (2024 only)
import os, sys, yaml, numpy as np, pandas as pd
from policyengine_us import Microsimulation

print("Step 00 start.")

sim = Microsimulation()
YEAR = 2024

def try_household(var, decode=True):
    """Try household-mapped; return (ok, series_or_error_str)."""
    try:
        s = sim.calculate(var, map_to="household", period=YEAR, decode_enums=decode)
        return True, pd.Series(s)
    except Exception as e:
        return False, str(e)

def pick_first(candidates, *, decode=True, required=True):
    for v in candidates:
        ok, s = try_household(v, decode=decode)
        if ok:
            print(f"  ✓ {v} (household) len={len(s)}")
            return v, s
        else:
            print(f"  · {v} unavailable ({s})")
    if required:
        raise KeyError(f"None available: {candidates}")
    return None, None

print("\nDetecting columns for household entity…")
agi_var,  agi_s   = pick_first(["adjusted_gross_income","household_agi","agi_household","agi_hh","agi"], decode=False)
wage_var, wage_s  = pick_first(["employment_income","wages","wage_income","labor_income"], decode=False)
size_var, size_s  = pick_first(["household_size","hh_size","household_members","family_size"], decode=False)
wt_var,   wt_s    = pick_first(["household_weight","hh_weight","weight","marsupwt","asec_weight"], decode=False)
fed_var,  fed_s   = pick_first(["income_tax"], decode=False)
st_var,   st_s    = pick_first(["ca_income_tax"], decode=False)

# Filing status is often not household-mapped. We'll still record the name if it exists anywhere,
# but we WON'T depend on it later.
fs_candidates = ["filing_status","tax_unit_filing_status","filingstatus"]
fs_avail = []
for v in fs_candidates:
    ok, s = try_household(v, decode=True)
    if ok:
        fs_avail.append(v)
if not fs_avail:
    # Just store the canonical key so downstream code has a column name to write to.
    fs_avail = ["filing_status"]
print("\nFiling status candidates (record only):", fs_avail)

# Basic CA sample check
ok_state, state_s = try_household("state_code", decode=True)
if not ok_state:
    raise RuntimeError(f"state_code not available at household level: {state_s}")
mask_ca = state_s.astype(str).str.upper().eq("CA")
print("CA households (raw, 2024):", int(mask_ca.sum()))

# Build lightweight frame to sanity check
df0 = pd.DataFrame({
    agi_var:  agi_s,
    wage_var: wage_s,
    size_var: size_s,
    wt_var:   wt_s,
    fed_var:  fed_s,
    st_var:   st_s
})
print("\nSample rows (any state):")
print(df0.head(3))

# Write config/columns.yaml
os.makedirs("../config", exist_ok=True)
col_map = {
    "agi": agi_var,
    "wages": wage_var,
    "hh_size": size_var,
    "weight": wt_var,
    "fed_tax": fed_var,
    "state_tax": st_var,
    "filing_status": fs_avail[0],  # will be overwritten in Step 01 with derived statuses
}
with open("../config/columns.yaml", "w") as f:
    yaml.safe_dump(col_map, f, sort_keys=False)

print("\nWrote ../config/columns.yaml:")
print(col_map)

# Quick checks
assert mask_ca.any(), "No CA households found with state_code=='CA'."
assert df0[agi_var].notna().any(), "AGI appears all missing."
assert df0[wage_var].notna().any(), "Wages appear all missing."
print("\n✅ Step 00 complete.")


Step 00 start.

Detecting columns for household entity…
  ✓ adjusted_gross_income (household) len=21251
  ✓ employment_income (household) len=21251
  ✓ household_size (household) len=21251
  ✓ household_weight (household) len=21251
  ✓ income_tax (household) len=21251
  ✓ ca_income_tax (household) len=21251

Filing status candidates (record only): ['filing_status']
CA households (raw, 2024): 1777

Sample rows (any state):
   adjusted_gross_income  employment_income  household_size  household_weight  \
0          107805.242188        4022.857178               2      24047.990234   
1           85387.771484       92190.474609               3      13475.582031   
2           23692.609901           0.000000               2        186.740341   

    income_tax  ca_income_tax  
0  8968.628906            0.0  
1  4438.604889            0.0  
2   188.025589            0.0  

Wrote ../config/columns.yaml:
{'agi': 'adjusted_gross_income', 'wages': 'employment_income', 'hh_size': 'household_size'