# 01_data_prep_ca_2024.ipynb

## Part A — What we are doing

We construct the **California analysis panel (2024)**, the master dataset used by all subsequent notebooks. This includes:
- Filtering to California households.
- Excluding **negative AGI** households.
- Deriving analysis status (**Single vs Married Households**) from **spouse presence in the household**.
- Creating size buckets (cap at 7).
- **Calibrating weights** to official 2023 household distribution by status × size.
- **Scaling weights** to the 2024 benchmark (14.8 million households).
- Computing the **consumption allowance** schedule.
- Applying the **AGI-based phase-out**.
- Writing the panel to `intermediate/ca_panel_2024.parquet` (or `.csv`).

**Outputs (core)**
- `intermediate/ca_panel_2024.parquet` or `intermediate/ca_panel_2024.csv`
- `intermediate/weight_scaling_2024.json` (metadata on calibration & scaling)
- A weighted **household size × status** diagnostic table.

**Why this matters**  
All totals, distributional analysis, and MTR calculations rely on this standardized, reproducible panel. Anchoring weights to ACS-based benchmarks ensures statewide totals and household composition are accurate.

---

## Part B — Inputs & prior steps

- **Reads:** `config/columns.yaml` produced by `00_repo_audit_and_config_2024.ipynb`.
- **Draws:** PolicyEngine household arrays for 2024, mapped to the resolved names.
- **Benchmarks:** 2023 ACS/DOF household counts by status × size bucket.

---

## Part C — Identify California households

We build a **robust CA filter** that recognizes:
- String codes: `"CA"`, `"California"`
- Numeric FIPS: `6`, `06`

We keep only households where `state_code` matches CA according to the rules above.

**Result:** A boolean mask `MASK_CA` used to slice all arrays and fields.

---

## Part D — Select weights

We map the chosen **household weight** to `household_weight` in the panel. We verify:
- Total weight overall and within CA are **positive**.
- The weight is at the household level (not person-level).

**Why this matters**  
All totals and shares depend on this weight. Using the wrong weight distorts the entire analysis.

---

## Part E — Derive analysis “filing status” (Single vs Married Households)

We do **not** trust tax-unit filing status for household grouping because spouses can file jointly while not co-residing. Instead:

- **Married Households** (= MFJ proxy):  
  spouse present (e.g., `spouse_present == True` or `head_spouse_count ≥ 2`) **AND** not Head-of-Household **AND** `household_size ≥ 2`.
- **Single Households**: everyone else (including HOH and MFS).

**Invariant**  
There should be **no Married Households with size 1**.  
If such rows appear, we reclassify them to Single but preserve their weights.  

**Singles can have size ≥ 2** (single parent + kids, multigenerational without a spouse).

We store:
- `is_married_couple` ∈ {0,1}
- `filing_status` ∈ {`"single"`, `"mfj"`}

---

## Part F — Calibrate weights to ACS/DOF 2023 benchmarks

- Household weights from CPS scaling run “hot” (~16M households).  
- We recalibrate to match **ACS/DOF 2023 household distribution** by status × size bucket:  

  - Singles: 1-person = 3.379M, 2-person = 1.255M, … 7+ = 0.069M  
  - Married: 2-person = 2.472M, 3-person = 1.359M, … 7+ = 0.241M  

- Calibration step computes bucket-specific scale factors = target ÷ CPS.  
- Each household’s weight is multiplied by the factor for its status × size bucket.  
- This preserves within-bucket variation but fixes statewide composition.

---

## Part G — Scale weights to 2024 total

- After bucket calibration, the 2023 total is ≈13.6M households.  
- We apply a **uniform scale factor** so the final weighted total equals **14.8M households** (Kyle’s 2024 benchmark).  
- Metadata on both calibration and scaling is saved to `intermediate/weight_scaling_2024.json`.

---

## Part H — Compute consumption allowance & phase-out

- `size_bucket = min(household_size, 7)` (integer 1..7).
- Consumption-allowance schedules (poverty-guideline based) applied by status and size.
- **Guard:** married allowance is never applied to size 1.
- Phase-out thresholds and ranges (after update):
  - `THRESHOLDS  = {"single": 50_000, "mfj": 100_000}`
  - `PHASE_RANGE = {"single": 50_000, "mfj": 100_000}`  
  → Singles phase out 50k–100k, MFJ phase out 100k–200k.

We store:
- `consumption_allowance`
- `rebate_after_phaseout`
- `allowance_no_phaseout`, `allowance_phaseout` (compatibility aliases)
- `excess_over_threshold`

---

## Part I — Diagnostics, deliverables & acceptance checks

**Diagnostics (printed)**
- Weighted household **size × status** table (thousands).
- Counts of **excluded negative-AGI** households.
- Pre- vs post-calibration totals, scale factors.

**Files written**
- `intermediate/ca_panel_2024.parquet` or `intermediate/ca_panel_2024.csv`
- `intermediate/weight_scaling_2024.json`

**Acceptance checks**
- Panel exists in `intermediate/`.
- No “Married Households, size 1”.
- Household totals ≈14.8M after scaling.
- `consumption_allowance ≥ rebate_after_phaseout` row-wise.
- All required columns present:
  - `household_size`, `household_weight`, `household_agi`, `employment_income`
  - `filing_status`, `is_married_couple`, `size_bucket`
  - `consumption_allowance`, `rebate_after_phaseout`, `excess_over_threshold`

---

## Part J — Troubleshooting

- **Totals not ≈14.8M**: Check that bucket calibration factors were applied before final scaling.  
- **Married size 1 appears**: Ensure reclassification guard runs before calibration.  
- **Parquet engine missing**: Notebook falls back to `.csv`. Downstream supports both.  
- **Downstream totals don’t match Step 02**: Make sure you’re using the scaled `household_weight`, not `household_weight_raw`.

---

## Part K — How to rerun

1. Re-run `00` to refresh column mappings if inputs changed.  
2. Re-run this notebook to rebuild the calibrated 2024 CA panel.  
3. Verify diagnostics (size × status table, totals ≈14.8M).  
4. Proceed to Step 02 for rebate costs.  


In [13]:
# 01 — Data prep CA (2024; household-level; MFJ via spouse/HOH; exclude AGI<0; apply 11% weight deflator)
import os, yaml, numpy as np, pandas as pd, importlib.util, json
from policyengine_us import Microsimulation

print("Step 01 start.")

# Load vat_rebate helpers
vat_path = os.path.abspath("../policy/vat_rebate.py")
spec = importlib.util.spec_from_file_location("vat_rebate", vat_path)
vr = importlib.util.module_from_spec(spec); spec.loader.exec_module(vr)
print("Loaded:", vr.__file__)

# Load column mapping
with open("../config/columns.yaml") as f:
    col_map = yaml.safe_load(f)
print("col_map:", col_map)

os.makedirs("../intermediate", exist_ok=True)
sim = Microsimulation()
YEAR = 2024

def hcalc(var, decode_enums=True):
    return pd.Series(sim.calculate(var, map_to="household", period=YEAR, decode_enums=decode_enums))

# 1) Pull household-level arrays 
state_code       = hcalc("state_code", decode_enums=True).astype(str).str.strip().str.upper()
household_size   = hcalc(col_map["hh_size"], decode_enums=False)
household_weight = hcalc(col_map["weight"],  decode_enums=False)
agi              = hcalc(col_map["agi"],     decode_enums=False)
wages            = hcalc(col_map["wages"],   decode_enums=False)
fed_tax          = hcalc(col_map["fed_tax"], decode_enums=False)
state_tax        = hcalc(col_map["state_tax"], decode_enums=False)

# Household-level spouse/HOH signals
def try_household(var, decode=False):
    try:
        return pd.Series(sim.calculate(var, map_to="household", period=YEAR, decode_enums=decode))
    except Exception:
        return None

has_spouse   = try_household("has_spouse", decode=False)
spouse_pres  = try_household("spouse_present", decode=False)
spouse_count = try_household("head_spouse_count", decode=False)
hoh_elig     = try_household("head_of_household_eligible", decode=False)

# 2) Build CA DataFrame
df = pd.DataFrame({
    "state_code": state_code,
    "household_size": pd.to_numeric(household_size, errors="coerce"),
    "household_weight": pd.to_numeric(household_weight, errors="coerce"),
    "household_agi": pd.to_numeric(agi, errors="coerce"),
    "employment_income": pd.to_numeric(wages, errors="coerce"),
    "fed_income_tax": pd.to_numeric(fed_tax, errors="coerce"),
    "ca_income_tax": pd.to_numeric(state_tax, errors="coerce"),
})
mask_ca = df["state_code"].eq("CA")
df = df.loc[mask_ca].reset_index(drop=True)
print("CA households (raw rows):", len(df))

# 3) Align spouse/HOH to df
def align_to_df(s):
    if s is None: 
        return None
    s = pd.to_numeric(pd.Series(s), errors="coerce")
    return s.loc[mask_ca].reset_index(drop=True)

has_spouse   = align_to_df(has_spouse)
spouse_pres  = align_to_df(spouse_pres)
spouse_count = align_to_df(spouse_count)
hoh_elig     = align_to_df(hoh_elig)

# 4) Derive filing_status: HOH ⇒ single; else spouse ⇒ mfj; else single.
if has_spouse is not None:
    spouse_any = has_spouse.fillna(0).astype(bool)
    source_used = "has_spouse"
elif spouse_pres is not None:
    spouse_any = spouse_pres.fillna(0).astype(bool)
    source_used = "spouse_present"
elif spouse_count is not None:
    uniq = np.sort(spouse_count.dropna().unique())
    if len(uniq) and uniq.max() >= 2:
        spouse_any = (spouse_count.fillna(0) >= 2)
        source_used = "head_spouse_count>=2"
    else:
        spouse_any = (spouse_count.fillna(0) > 0)
        source_used = "head_spouse_count>0"
else:
    spouse_any = pd.Series(False, index=df.index)
    source_used = "no_spouse_signal"

hoh_any = (hoh_elig.fillna(0) > 0) if hoh_elig is not None else pd.Series(False, index=df.index)

filing_status = np.where(hoh_any, "single", np.where(spouse_any, "mfj", "single"))
df["filing_status"] = filing_status.astype(str)
df["is_married_couple"] = (df["filing_status"].str.lower() == "mfj").astype(int)

print(f"[info] spouse signal used: {source_used}")
print("filing_status counts:", df["filing_status"].value_counts().to_dict())

# Guard: reclassify impossible "married & size<2" to single
df["household_size"] = df["household_size"].fillna(1).round().astype(int)
bad_m1 = (df["is_married_couple"] == 1) & (df["household_size"] < 2)
if bad_m1.any():
    n_bad = int(bad_m1.sum())
    w_bad = float(df.loc[bad_m1, "household_weight"].sum())
    print(f"[fix] Reclassifying {n_bad:,} rows (weighted {w_bad:,.0f}) from married->single because size<2.")
    df.loc[bad_m1, "is_married_couple"] = 0
    df.loc[bad_m1, "filing_status"] = "single"

# 5) Exclude negative AGI, set size bucket
before = len(df)
df = df.loc[df["household_agi"] >= 0].reset_index(drop=True)
print("Excluded negative-AGI households:", before - len(df))
df["size_bucket"] = np.clip(df["household_size"], 1, 7).astype(int)

# 6) Apply simple 11% deflator (reduce weights by 11%)
wt_pre = float(df["household_weight"].sum())
df["household_weight"] = df["household_weight"] * 0.89
wt_post = float(df["household_weight"].sum())
print(f"[fix] Applied 11% deflator to weights: total {wt_pre:,.0f} → {wt_post:,.0f}")

# 7) Compute allowance + phaseout
df = vr.compute_allowance(df)
df = vr.apply_phaseout(df)

df["allowance_no_phaseout"] = df["consumption_allowance"]
df["allowance_phaseout"]    = df["rebate_after_phaseout"]

# 8) Save intermediate
os.makedirs("../intermediate", exist_ok=True)
parq = "../intermediate/ca_panel_2024.parquet"
csv  = "../intermediate/ca_panel_2024.csv"
try:
    df.to_parquet(parq, index=False)
    print("saved", parq, "rows:", len(df))
except Exception as e:
    print("parquet save failed; writing CSV:", e)
    df.to_csv(csv, index=False)
    print("saved", csv, "rows:", len(df))

# Metadata file
meta = {
    "year": 2024,
    "pre_deflator_total": wt_pre,
    "post_deflator_total": wt_post,
    "deflator_applied": 0.89,
    "notes": "Applied uniform 11% downward adjustment to household weights."
}
with open("../intermediate/weight_scaling_2024.json", "w") as f:
    json.dump(meta, f, indent=2)
print("wrote ../intermediate/weight_scaling_2024.json")

# 9) Sanity print
w = df["household_weight"].fillna(0.0)
tab = (w.groupby([df["size_bucket"], np.where(df["is_married_couple"]==1,"Married","Single")])
         .sum().unstack(1).fillna(0.0)/1_000).round(1)
tab.index.name = "size_bucket"
print("\nWeighted CA households (thousands) by size × status (final, after deflator):")
print(tab.to_string())

print("\n✅ Step 01 complete.")

Step 01 start.
Loaded: c:\Users\Ali.Melad\Dropbox\Ali Work\Kyle\California VAT\policy_engile_cali_v2\policy\vat_rebate.py
col_map: {'agi': 'adjusted_gross_income', 'wages': 'employment_income', 'hh_size': 'household_size', 'weight': 'household_weight', 'fed_tax': 'income_tax', 'state_tax': 'ca_income_tax', 'filing_status': 'filing_status'}
CA households (raw rows): 1777
[info] spouse signal used: head_spouse_count>=2
filing_status counts: {'mfj': 1131, 'single': 646}
Excluded negative-AGI households: 30
[fix] Applied 11% deflator to weights: total 16,215,270 → 14,431,592
parquet save failed; writing CSV: A type extension with name pandas.period already defined
saved ../intermediate/ca_panel_2024.csv rows: 1747
wrote ../intermediate/weight_scaling_2024.json

Weighted CA households (thousands) by size × status (final, after deflator):
                 Married       Single
size_bucket                          
1               0.000000  4224.299805
2            3397.500000  1033.199951
3  