# 04_distribution_baseline_vs_noiit_rebate_2024.ipynb

## Part A — What we are doing

We produce a **distributional analysis** comparing:
- **Baseline burden** = federal income tax + CA income tax.
- **Reform burden** = baseline burden − rebate (income taxes removed in the reform scenario; rebate added as a negative burden).

We group by **equivalized income deciles** (AGI ÷ size) and extend with **Top 5%** and **Top 1%** groups using **weighted percentiles**.

**Output**
- `outputs/vat/distribution_2024.csv` — for each group: mean baseline, mean reform, mean change, share of total change, population share.

---

## Part B — Inputs & dependencies

- **Reads:** `intermediate/ca_panel_2024.(parquet|csv)` from `01`.
- Fields needed:
  - `fed_income_tax`, `ca_income_tax`, `rebate_after_phaseout`
  - `household_weight`, `household_agi`, `household_size`

---

## Part C — Group construction

1. **Equivalized income**: `equiv_income = household_agi / household_size`.
2. **Deciles**: weighted deciles D1–D10 using `household_weight`.
3. **Top 5% / Top 1%**: find weighted percentile cutoffs **within the full distribution** and tag households accordingly. These groups can overlap with D10; typically we present them as **additional rows**.

---

## Part D — Measures & checks

For each group:
- `baseline_burden = fed_income_tax + ca_income_tax`
- `reform_burden = baseline_burden − rebate_after_phaseout`
- `change = reform_burden − baseline_burden = −rebate_after_phaseout`
- Compute **means** using `household_weight`.
- Compute **group population share** (weight share).
- Compute **share of total change** = group’s total change / statewide total change.

**Consistency checks**
- Population shares (deciles) sum to ~**100%** (within rounding).
- Sum of **group total changes** equals statewide total change (matches totals from `02`).

---

## Part E — Deliverables & acceptance checks

**File written**
- `outputs/vat/distribution_2024.csv`

**Acceptance checks**
- No missing values in burdens or weights.
- Population shares ≈ 100% across deciles.
- Aggregated totals align with `02_rebate_costs_2024.csv` (with-phase totals).

---

## Part F — Troubleshooting

- **Shares don’t sum to ~100%**: verify weighted decile construction matches that in `02` and includes all households (no dropped rows).
- **Top 1%/5%** weird: confirm weighted percentile calculation and that the groups are additional (not replacing decile rows).


In [1]:
# 04 — Distribution (2024): Baseline vs No income tax + VAT rebate (phase-out)
import os, numpy as np, pandas as pd, importlib.util

# Load helpers
vat_path = os.path.abspath("../policy/vat_rebate.py")
spec = importlib.util.spec_from_file_location("vat_rebate", vat_path)
vr = importlib.util.module_from_spec(spec); spec.loader.exec_module(vr)

os.makedirs("../outputs/vat", exist_ok=True)

# Load Step 01 panel
parq = "../intermediate/ca_panel_2024.parquet"
csv  = "../intermediate/ca_panel_2024.csv"
panel_path = parq if os.path.exists(parq) else (csv if os.path.exists(csv) else None)
if panel_path is None:
    raise FileNotFoundError("Missing panel; run Step 01.")
df = pd.read_parquet(panel_path) if panel_path.endswith(".parquet") else pd.read_csv(panel_path)
print("Panel shape:", df.shape)

# Normalize weight
if "weight" not in df.columns:
    wcol = [c for c in df.columns if c.lower() in ("household_weight","weight","hh_weight")]
    if not wcol:
        raise KeyError("No weight column found.")
    df["weight"] = pd.to_numeric(df[wcol[0]], errors="coerce").fillna(0.0)
else:
    df["weight"] = pd.to_numeric(df["weight"], errors="coerce").fillna(0.0)

print(f"[diag] Weighted CA households (after Step 01 deflator): {df['weight'].sum():,.0f}")

# Ensure required columns; (re)compute allowance/phaseout if missing
need_cols = {
    "household_agi","household_size","fed_income_tax","ca_income_tax",
    "consumption_allowance","rebate_after_phaseout"
}
missing = [c for c in need_cols if c not in df.columns]
if missing:
    recompute = set(["consumption_allowance","rebate_after_phaseout"]).intersection(missing)
    if recompute:
        if {"size_bucket","is_married_couple"}.issubset(df.columns):
            if "consumption_allowance" not in df: df = vr.compute_allowance(df)
            if "rebate_after_phaseout" not in df: df = vr.apply_phaseout(df)
        else:
            raise KeyError(f"Need size_bucket + is_married_couple to compute allowance/phaseout; missing: {missing}")
    # check again
    missing2 = [c for c in need_cols if c not in df.columns]
    if missing2:
        raise KeyError(f"Still missing required columns: {missing2}")

# Baseline vs reform burdens (household-level)
baseline_tax = (
    df["fed_income_tax"].astype(float).fillna(0.0)
  + df["ca_income_tax"].astype(float).fillna(0.0)
)
# Reform removes income taxes and adds the rebate as a negative burden
reform_tax = - df["rebate_after_phaseout"].astype(float).fillna(0.0)

# Equivalized income and weighted deciles (AGI / size)
df["equiv_income"] = (
    df["household_agi"].astype(float) / np.maximum(df["household_size"].astype(float), 1.0)
)
df = vr.add_weighted_deciles(df, income_col="equiv_income", weight_col="weight", label="decile")

# Weighted percentiles for Top 5% and Top 1%
s = df[["equiv_income","weight"]].sort_values("equiv_income").reset_index(drop=True)
cw = s["weight"].cumsum()
tot = s["weight"].sum()
p95 = s.loc[cw >= 0.95*tot, "equiv_income"].iloc[0]
p99 = s.loc[cw >= 0.99*tot, "equiv_income"].iloc[0]
df["top_5pct"] = (df["equiv_income"] >= p95).astype(int)
df["top_1pct"] = (df["equiv_income"] >= p99).astype(int)

def wmean(x, w):
    x = x.astype(float); w = w.astype(float); T = w.sum()
    return float((x*w).sum()/T) if T>0 else np.nan

rows = []

# Deciles 1..10 (sorted numerically)
deciles_sorted = sorted(map(int, df["decile"].dropna().astype(int).unique()))
for d in deciles_sorted:
    g = df[df["decile"].astype(int) == d]
    w = g["weight"]
    mb = wmean(baseline_tax.loc[g.index], w)
    mr = wmean(reform_tax.loc[g.index], w)
    dlt = (reform_tax.loc[g.index] - baseline_tax.loc[g.index]) * w
    rows.append({
        "year": 2024,
        "group": f"decile_{d}",
        "mean_tax_baseline": mb,
        "mean_tax_reform":   mr,
        "mean_change":       mr - mb,
        "total_change":      float(dlt.sum()),
        "pop_share":         float(100.0 * w.sum()/df["weight"].sum()),
    })

# Top 5% and Top 1%
for label, mask in [("top_5pct", df["top_5pct"].eq(1)), ("top_1pct", df["top_1pct"].eq(1))]:
    g = df[mask]
    w = g["weight"]
    mb = wmean(baseline_tax.loc[g.index], w)
    mr = wmean(reform_tax.loc[g.index], w)
    dlt = (reform_tax.loc[g.index] - baseline_tax.loc[g.index]) * w
    rows.append({
        "year": 2024,
        "group": label,
        "mean_tax_baseline": mb,
        "mean_tax_reform":   mr,
        "mean_change":       mr - mb,
        "total_change":      float(dlt.sum()),
        "pop_share":         float(100.0 * w.sum()/df["weight"].sum()),
    })

dist = pd.DataFrame(rows)

# Share of total change (deciles only; should sum ~100 across deciles)
dec_mask = dist["group"].str.startswith("decile_")
total_delta_deciles = dist.loc[dec_mask, "total_change"].sum()
dist["share_of_total_change"] = 100.0 * dist["total_change"] / total_delta_deciles

# Integrity checks: decile pop shares ~100; statewide change consistency
assert np.isclose(dist.loc[dec_mask, "pop_share"].sum(), 100.0, atol=0.2), "Decile pop shares should sum to ~100%"

statewide_change = float(((reform_tax - baseline_tax) * df["weight"]).sum())
decile_sum_change = float(dist.loc[dec_mask, "total_change"].sum())
assert np.isclose(decile_sum_change, statewide_change, rtol=1e-8, atol=1.0), \
    "Decile total_change does not match statewide change."

# Save
out = "../outputs/vat/distribution_2024.csv"
dist.to_csv(out, index=False)
print("✅ wrote", out)
print(dist.head(12).to_string(index=False))


Panel shape: (1747, 15)
[diag] Weighted CA households (after Step 01 deflator): 14,431,591
✅ wrote ../outputs/vat/distribution_2024.csv
 year     group  mean_tax_baseline  mean_tax_reform    mean_change  total_change  pop_share  share_of_total_change
 2024  decile_1       -2872.822215    -25898.613460  -23025.791244 -3.873445e+10  11.656511               5.310214
 2024  decile_2       -3991.946629    -33841.191272  -29849.244643 -4.167621e+10   9.674770               5.713509
 2024  decile_3       -2409.696341    -25423.767794  -23014.071454 -3.051337e+10   9.187187               4.183164
 2024  decile_4        1268.987251    -31509.576082  -32778.563333 -4.700118e+10   9.935841               6.443524
 2024  decile_5        3402.169128    -23703.028007  -27105.197135 -3.743732e+10   9.570574               5.132388
 2024  decile_6        7719.960272    -20086.725324  -27806.685595 -4.283080e+10  10.673153               5.871795
 2024  decile_7        7858.732347    -26360.531810  -34219