# 05_sales_tax_offmodel_inputs_2024.ipynb

## Part A — What we are doing

We prepare a **compact decile dataset** for external VAT/sales-tax incidence modeling. It aligns exactly with our decile definitions from earlier steps and carries the key aggregates needed by off-model tools.

**Output**
- `outputs/vat/sales_tax_inputs_2024.csv`

---

## Part B — Inputs & fields

- **Reads:** `intermediate/ca_panel_2024.(parquet|csv)` from `01`.
- Reuses the **same equivalized decile assignment** as `02` (AGI ÷ size, weighted by household weights).

**Aggregates per decile**
- `households_weighted` — sum of `household_weight`
- `agi_sum` — weighted sum of `household_agi`
- `wages_sum` — weighted sum of `employment_income`
- `consumption_allowance_sum` — weighted sum of `consumption_allowance`
- `rebate_after_phaseout_sum` — weighted sum of `rebate_after_phaseout`

---

## Part C — Deliverables & acceptance checks

**File written**
- `outputs/vat/sales_tax_inputs_2024.csv`

**Acceptance checks**
- No missing values in weights, AGI, or wages.
- Decile weights match those used in `02` and `04` (consistency in population totals).
- Sums across deciles equal statewide totals.

---

## Part D — Troubleshooting

- **Mismatch with other decile files**: confirm identical decile construction (same weights, same period, no dropped rows).
- **NaNs**: ensure all five fields exist on the panel and are numeric.


In [1]:
# 05 — Sales-tax off-model inputs (2024)
# Builds decile-level inputs for external VAT/sales tax incidence modeling.
# Assumes Step 01 already applied the 11% deflator to household_weight.

import os, numpy as np, pandas as pd, importlib.util

os.makedirs("../outputs/vat", exist_ok=True)

# Load vat_rebate helpers (for deciles + ensure allowance/phaseout)
vat_path = os.path.abspath("../policy/vat_rebate.py")
spec = importlib.util.spec_from_file_location("vat_rebate", vat_path)
vr = importlib.util.module_from_spec(spec); spec.loader.exec_module(vr)
print("Loaded:", vr.__file__)

# Load panel from Step 01
parq = "../intermediate/ca_panel_2024.parquet"
csv  = "../intermediate/ca_panel_2024.csv"
panel_path = parq if os.path.exists(parq) else (csv if os.path.exists(csv) else None)
if panel_path is None:
    raise FileNotFoundError("Missing panel; run Step 01.")
df = pd.read_parquet(panel_path) if panel_path.endswith(".parquet") else pd.read_csv(panel_path)
print("Panel shape:", df.shape)

# Normalize weight column → df['weight']
if "weight" not in df.columns:
    wcol = next((c for c in df.columns if c.lower() in ("household_weight","weight","hh_weight")), None)
    if wcol is None:
        raise KeyError("No weight column found (looked for household_weight/weight/hh_weight).")
    df["weight"] = pd.to_numeric(df[wcol], errors="coerce").fillna(0.0)
else:
    df["weight"] = pd.to_numeric(df["weight"], errors="coerce").fillna(0.0)

print(f"[diag] Weighted CA households (after Step 01 deflator): {df['weight'].sum():,.0f}")

# Ensure allowance & phaseout present (recompute if needed)
if "consumption_allowance" not in df.columns:
    must = {"size_bucket","is_married_couple"}
    missing = [m for m in must if m not in df.columns]
    if missing:
        raise KeyError(f"Missing {missing} required to compute allowance.")
    df = vr.compute_allowance(df)

if "rebate_after_phaseout" not in df.columns:
    if "household_agi" not in df.columns:
        raise KeyError("household_agi missing; cannot compute phaseout.")
    df = vr.apply_phaseout(df)

# Equivalized income and weighted deciles consistent with Steps 02/04
if ("household_agi" not in df.columns) or ("household_size" not in df.columns):
    raise KeyError("Need household_agi and household_size for deciles.")
df["equiv_income"] = df["household_agi"].astype(float) / np.maximum(df["household_size"].astype(float), 1.0)
df = vr.add_weighted_deciles(df, income_col="equiv_income", weight_col="weight", label="decile")

# Build decile inputs
by_dec = (
    df.groupby("decile", as_index=False)
      .apply(lambda g: pd.Series({
          "households_weighted": float(g["weight"].sum()),
          "agi_sum": float(g["household_agi"].sum()),
          "wages_sum": float(g["employment_income"].clip(lower=0).sum()),
          # handy proxies for external modeling:
          "consumption_allowance_sum": float(g["consumption_allowance"].sum()),
          "rebate_after_phaseout_sum": float(g["rebate_after_phaseout"].sum()),
      }))
      .reset_index(drop=True)
)

# Save
out = "../outputs/vat/sales_tax_inputs_2024.csv"
by_dec.to_csv(out, index=False)

# Checks
assert by_dec["households_weighted"].sum() > 0, "Zero weighted households?"
print("✅ wrote", out)
print(by_dec.head().to_string(index=False))


Loaded: c:\Users\Ali.Melad\Dropbox\Ali Work\Kyle\California VAT\policy_engile_cali_v2\policy\vat_rebate.py
Panel shape: (1747, 15)
[diag] Weighted CA households (after Step 01 deflator): 14,431,591
✅ wrote ../outputs/vat/sales_tax_inputs_2024.csv
decile  households_weighted      agi_sum    wages_sum  consumption_allowance_sum  rebate_after_phaseout_sum
     1         1.682220e+06 6.264963e+05 5.208819e+05                  4640780.0               4.640780e+06
     2         1.396223e+06 1.659327e+06 1.706502e+06                  2192280.0               2.178867e+06
     3         1.325857e+06 7.642576e+06 7.431737e+06                  6429660.0               6.237590e+06
     4         1.433900e+06 7.349458e+06 6.539761e+06                  4059700.0               3.464091e+06
     5         1.381186e+06 2.177773e+07 1.853373e+07                  8678680.0               6.627383e+06


  df.groupby("decile", as_index=False)
  .apply(lambda g: pd.Series({
