## Part A — What we are doing

We are building California-only tax distribution tables for 2024 and 2025 under baseline policy.  
Each table shows how taxes are distributed across income groups and includes a **Top 1 percent** row.

**What we output**
- A State income tax table and a Federal income tax table for each year  
- Columns that match how we usually read distribution tables:  
  - Percent of households  
  - Percent of taxpayers  
  - Mean tax dollar household average  
  - Share of total tax percent  
- Separate CSV files for state and federal distributions, plus revenue totals for each  

**Why we anchor to 2024**
- In this build, the microdata and weights live at 2024  
- We use 2024 to identify California households and to weight the sample, then we calculate tax and income at each target year  

---

## Part A — Identify California households

The microdata contains households from all states. We only want California.  

This step tries several candidate variables (`state_code`, `state`, `residence_state`), looking first for string-coded identifiers equal to `"CA"`.  
If those do not work, it falls back to numeric FIPS codes, where California equals `6`.  

The result is a **boolean mask** (`MASK_CA`) that marks which households are in California.  
This mask is the backbone of the entire build, because every subsequent step is filtered to these households.

---

## Part B — Select weights

To make our sample representative of California’s population, we need a household weight.  

This step tests multiple candidate variables (`household_weight`, `weight`, `person_weight`, `sample_weight`, `cps_weight`, `asec_weight`) and picks the first one with **positive totals both overall and within California**.  

We store:
- The full set of weights (`W_ALL`)  
- The California slice (`W_CA_ALL`), which we reuse everywhere  

**Why this matters**  
Without correct weights, totals and shares would not scale to the California population.  
Anchoring weights to 2024 ensures comparability across 2024 and 2025 results.

---

In [None]:
# Part A — imports and constants
import numpy as np
import pandas as pd
from policyengine_us import Microsimulation

ENTITY        = "household"
MASK_PERIOD   = 2024   # where state codes exist
WEIGHT_PERIOD = 2024   # where weights are positive
YEARS         = [2024, 2025]
TOP_P         = 0.01   # Top 1%

# Part B — CA mask + weights (from 2024)
sim = Microsimulation()

def _calc(var, year, map_to=ENTITY):
    return np.asarray(sim.calculate(var, map_to=map_to, period=year))

def get_ca_mask_from_2024():
    # string-coded states first
    for var in ("state_code","state","residence_state"):
        try:
            arr = _calc(var, MASK_PERIOD)
            if arr.dtype.kind in ("U","S","O"):
                m = np.array([str(x) == "CA" for x in arr])
                if m.any():
                    print(f"[STATE] using '{var}' @ {MASK_PERIOD} | CA={int(m.sum())}/{m.size}")
                    return m
        except Exception:
            pass
    # numeric FIPS fallback
    for var in ("state_fips","statefips","fips_state","state_code"):
        try:
            arr = _calc(var, MASK_PERIOD)
            if arr.dtype.kind in ("i","u","f"):
                vals = np.rint(arr.astype(float)).astype(int)
                m = (vals == 6)
                if m.any():
                    print(f"[STATE] using numeric '{var}' @ {MASK_PERIOD} | CA={int(m.sum())}/{m.size}")
                    return m
        except Exception:
            pass
    raise RuntimeError("Cannot construct a California mask from 2024 state codes.")

MASK_CA = get_ca_mask_from_2024()

def get_weights_from_2024():
    for var in ("household_weight","weight","person_weight","sample_weight","cps_weight","asec_weight"):
        try:
            w = _calc(var, WEIGHT_PERIOD).astype(float)
            tot_all = float(np.nansum(w))
            tot_ca  = float(np.nansum(w[MASK_CA]))
            print(f"[WEIGHT] try '{var}' @ {WEIGHT_PERIOD}: total_all={tot_all:,.2f} | total_CA={tot_ca:,.2f}")
            if tot_all > 0 and tot_ca > 0:
                print(f"[WEIGHT] using '{var}' @ {WEIGHT_PERIOD}")
                return w, var
        except Exception:
            continue
    raise RuntimeError("No usable weights with positive totals at 2024.")

W_ALL, WEIGHT_VAR = get_weights_from_2024()
W_CA_ALL = W_ALL[MASK_CA].astype(float)


  from .autonotebook import tqdm as notebook_tqdm


## Part C — Choose ranking income

To group households into deciles and the Top 1 percent, we need an income measure.  

This step looks through a sequence of candidates:  
- `equiv_household_net_income`  
- `household_net_income`  
- `household_market_income`  
- `adjusted_gross_income`  

We pick the first variable that is finite on the California sample and has meaningful variation.  
This becomes the **ranking income** for that year.

---


In [3]:
# Part C — helpers: pick variables and compute weighted bands
def get_income_for_year(year):
    for var in ("equiv_household_net_income","household_net_income",
                "household_market_income","adjusted_gross_income"):
        try:
            arr = _calc(var, year).astype(float)[MASK_CA]
            if np.isfinite(arr).any() and np.nanstd(arr) > 0:
                print(f"[INCOME] {year}: '{var}'")
                return arr, var
        except Exception:
            continue
    raise RuntimeError(f"No usable ranking income at {year}.")

def get_tax_for_year(year, level="state"):
    cands = ("ca_income_tax","state_income_tax") if level=="state" else ("federal_income_tax","income_tax","irs_income_tax")
    for var in cands:
        try:
            arr = _calc(var, year).astype(float)[MASK_CA]
            print(f"[{level.upper()}_TAX] {year}: '{var}'")
            return arr, var
        except Exception:
            continue
    raise RuntimeError(f"No {level} tax variable found at {year}.")

def wquantiles(values, weights, qs):
    order = np.argsort(values)
    v, w = values[order], weights[order]
    cw = np.cumsum(w)
    cut = qs * cw[-1]
    return np.interp(cut, cw, v)

def assign_weighted_deciles(values, weights, deciles=10):
    v = np.asarray(values, dtype=np.float64)
    w = np.asarray(weights, dtype=np.float64)
    qs = np.linspace(0.0, 1.0, deciles + 1)
    cuts = wquantiles(v, w, qs)
    cuts[-1] = np.nextafter(cuts[-1], cuts[-1] + 1)
    labels = np.digitize(v, cuts[1:-1], right=True) + 1  # 1..10
    return labels.astype(np.int16), cuts


## Part D — Build distribution tables

For each year and level (state and federal):

- Keep only rows with finite income, finite tax, and positive weight.  
- Define **taxpayers** as units with `tax > 0`.  
- Aggregate results for each decile and the Top 1 percent.  

**Each row reports:**
- Percent of households (population share)  
- Percent of taxpayers (among tax-positive filers)  
- Mean tax dollars per household (including zeros and negatives)  
- Share of total tax collected  

**Why include both households and taxpayers**  
- Percent of households shows population composition  
- Percent of taxpayers highlights who actually owes positive tax  

---

In [4]:
# Part D — one function to build the “pretty” table (baseline only)
def pretty_distribution(year, level="state"):
    # inputs
    inc, inc_var = get_income_for_year(year)
    tax, tax_var = get_tax_for_year(year, level)
    w            = W_CA_ALL.copy()

    # keep usable rows
    keep = np.isfinite(inc) & np.isfinite(tax) & np.isfinite(w) & (w > 0)
    inc, tax, w = inc[keep], tax[keep], w[keep]

    # deciles and top1 cut
    labels, cuts = assign_weighted_deciles(inc, w, deciles=10)
    p99 = wquantiles(inc, w, np.array([1.0 - TOP_P]))[0]
    in_top1 = inc >= p99

    # taxpayers = positive liability
    payers = tax > 0
    total_w_households = float(np.sum(w))
    total_w_payers     = float(np.sum(w[payers])) if np.any(payers) else np.nan
    total_tax          = float(np.sum(tax * w))

    rows = []

    # 10 decile bands
    for d in range(1, 11):
        m = (labels == d)
        w_bin = float(np.sum(w[m]))
        w_payers_bin = float(np.sum(w[m & payers]))
        tax_bin = float(np.sum((tax[m]) * (w[m])))

        band_label = f"{(d-1)*10}-{d*10}%"
        lo, hi = cuts[d-1], cuts[d]

        rows.append({
            "year": year,
            "level": level,
            "band": band_label,
            "income_min": lo,
            "income_max": hi,
            "% of households": 100.0 * w_bin / total_w_households if total_w_households else np.nan,
            "% of taxpayers": 100.0 * w_payers_bin / total_w_payers if total_w_payers else np.nan,
            "mean tax $ (household avg)": (tax_bin / w_bin) if w_bin else np.nan,
            "share of total tax %": 100.0 * tax_bin / total_tax if total_tax else np.nan
        })

    # Top 1%
    w_top1 = float(np.sum(w[in_top1]))
    w_top1_payers = float(np.sum(w[in_top1 & payers]))
    tax_top1 = float(np.sum((tax[in_top1]) * (w[in_top1])))
    rows.append({
        "year": year,
        "level": level,
        "band": "Top 1%",
        "income_min": p99,
        "income_max": float(np.max(inc)) if inc.size else np.nan,
        "% of households": 100.0 * w_top1 / total_w_households if total_w_households else np.nan,
        "% of taxpayers": 100.0 * w_top1_payers / total_w_payers if total_w_payers else np.nan,
        "mean tax $ (household avg)": (tax_top1 / w_top1) if w_top1 else np.nan,
        "share of total tax %": 100.0 * tax_top1 / total_tax if total_tax else np.nan
    })

    out = pd.DataFrame(rows)
    order = [f"{i*10}-{(i+1)*10}%" for i in range(10)] + ["Top 1%"]
    out["band"] = pd.Categorical(out["band"], categories=order, ordered=True)
    out = out.sort_values(["year","level","band"]).reset_index(drop=True)

    print(f"\nBuilt {level.upper()} distribution for {year}")
    return out


## Part F — Compute revenue totals

Beyond distributions, we want total tax collections.  

For each year and level:
- Multiply each household’s tax by its weight and sum = **baseline revenue**  
- Also compute weighted counts of:
  - All households  
  - Taxpayers with `tax > 0`  

This produces a compact revenue table that ties the distribution back to a single bottom-line number.

---

In [5]:
# Part E — run for 2024 and 2025 (STATE and FEDERAL), baseline only
tables = []
for y in YEARS:
    tables.append(pretty_distribution(y, level="state"))
    tables.append(pretty_distribution(y, level="federal"))

dist_all = pd.concat(tables, ignore_index=True)


[INCOME] 2024: 'equiv_household_net_income'
[STATE_TAX] 2024: 'ca_income_tax'

Built STATE distribution for 2024
[INCOME] 2024: 'equiv_household_net_income'
[FEDERAL_TAX] 2024: 'income_tax'

Built FEDERAL distribution for 2024
[INCOME] 2025: 'equiv_household_net_income'
[STATE_TAX] 2025: 'ca_income_tax'

Built STATE distribution for 2025
[INCOME] 2025: 'equiv_household_net_income'
[FEDERAL_TAX] 2025: 'income_tax'

Built FEDERAL distribution for 2025


In [6]:
# Part F — build baseline revenue tables for 2024 & 2025 (CA residents; state & federal)

def revenue_table(year, level="state"):
    # Grab taxes for the year/level and the CA weights you built in Part B
    tax, tax_var = get_tax_for_year(year, level)
    w            = W_CA_ALL.copy()

    # Keep finite rows with positive weights
    keep = np.isfinite(tax) & np.isfinite(w) & (w > 0)
    tax, w = tax[keep], w[keep]

    total_revenue = float(np.sum(tax * w))
    weighted_taxpayers = float(np.sum(w[tax > 0])) if np.any(tax > 0) else 0.0
    weighted_households = float(np.sum(w))

    return pd.DataFrame([{
        "year": year,
        "level": level,
        "revenue_baseline_$": total_revenue,
        "weighted_households": weighted_households,
        "weighted_taxpayers": weighted_taxpayers
    }])

# Build both tables for both years
rev_rows = []
for y in YEARS:
    rev_rows.append(revenue_table(y, level="state"))
    rev_rows.append(revenue_table(y, level="federal"))

rev_all = pd.concat(rev_rows, ignore_index=True).sort_values(["year","level"]).reset_index(drop=True)

# Split by level (just for convenience when saving)
rev_state  = rev_all[rev_all["level"]=="state"].drop(columns=["level"]).reset_index(drop=True)
rev_fed    = rev_all[rev_all["level"]=="federal"].drop(columns=["level"]).reset_index(drop=True)


[STATE_TAX] 2024: 'ca_income_tax'
[FEDERAL_TAX] 2024: 'income_tax'
[STATE_TAX] 2025: 'ca_income_tax'
[FEDERAL_TAX] 2025: 'income_tax'


In [7]:
# Part G — save CSVs for both distribution and revenue (baseline only)

# If you haven’t split the distribution yet, do it now:
dist_state = dist_all[dist_all["level"]=="state"].drop(columns=["level"]).reset_index(drop=True)
dist_fed   = dist_all[dist_all["level"]=="federal"].drop(columns=["level"]).reset_index(drop=True)

# Save
dist_state.to_csv("CA_state_distribution_2024_2025_baseline.csv", index=False)
dist_fed.to_csv("CA_federal_distribution_2024_2025_baseline.csv", index=False)
rev_state.to_csv("CA_state_revenue_2024_2025_baseline.csv", index=False)
rev_fed.to_csv("CA_federal_revenue_2024_2025_baseline.csv", index=False)

print("Saved:")
print("  CA_state_distribution_2024_2025_baseline.csv")
print("  CA_federal_distribution_2024_2025_baseline.csv")
print("  CA_state_revenue_2024_2025_baseline.csv")
print("  CA_federal_revenue_2024_2025_baseline.csv")


Saved:
  CA_state_distribution_2024_2025_baseline.csv
  CA_federal_distribution_2024_2025_baseline.csv
  CA_state_revenue_2024_2025_baseline.csv
  CA_federal_revenue_2024_2025_baseline.csv


In [8]:
# Part H — preview summaries

print("\nSTATE — distribution (first 12 rows)")
display(dist_state.head(12))

print("\nSTATE — revenue totals")
display(rev_state)

print("\nFEDERAL — distribution (first 12 rows)")
display(dist_fed.head(12))

print("\nFEDERAL — revenue totals")
display(rev_fed)



STATE — distribution (first 12 rows)


Unnamed: 0,year,band,income_min,income_max,% of households,% of taxpayers,mean tax $ (household avg),share of total tax %
0,2024,0-10%,-137380.90625,14911.34,9.706951,4.194656,129.377763,0.156291
1,2024,10-20%,14911.335099,24641.46,9.386637,2.276735,-458.848843,-0.536008
2,2024,20-30%,24641.462352,31616.5,8.927419,5.401206,-127.389373,-0.141531
3,2024,30-40%,31616.500197,36169.25,9.528479,9.432212,-511.029206,-0.605984
4,2024,40-50%,36169.24572,53586.87,12.277292,10.37719,945.259405,1.444259
5,2024,50-60%,53586.874273,56979.95,9.829036,13.942015,1298.397562,1.588218
6,2024,60-70%,56979.945111,78196.01,10.250772,11.654886,2711.158229,3.458621
7,2024,70-80%,78196.008275,132291.4,9.496844,13.444647,1642.810438,1.941594
8,2024,80-90%,132291.387524,173813.1,9.297193,13.215251,11254.174305,13.021383
9,2024,90-100%,173813.069448,2265947.0,11.299377,16.061202,56658.609172,79.673156



STATE — revenue totals


Unnamed: 0,year,revenue_baseline_$,weighted_households,weighted_taxpayers
0,2024,133822100000.0,16654030.0,11716450.0
1,2025,138903900000.0,16654030.0,11147200.0



FEDERAL — distribution (first 12 rows)


Unnamed: 0,year,band,income_min,income_max,% of households,% of taxpayers,mean tax $ (household avg),share of total tax %
0,2024,0-10%,-137380.90625,14911.34,9.706951,4.606349,551.835664,0.243174
1,2024,10-20%,14911.335099,24641.46,9.386637,3.270611,-3263.947573,-1.39084
2,2024,20-30%,24641.462352,31616.5,8.927419,7.932659,-296.765594,-0.120272
3,2024,30-40%,31616.500197,36169.25,9.528479,5.75963,593.741342,0.256829
4,2024,40-50%,36169.24572,53586.87,12.277292,4.417575,713.137936,0.397466
5,2024,50-60%,53586.874273,56979.95,9.829036,15.094037,5674.280366,2.531894
6,2024,60-70%,56979.945111,78196.01,10.250772,13.162419,8678.741141,4.038659
7,2024,70-80%,78196.008275,132291.4,9.496844,13.94307,6725.569198,2.899561
8,2024,80-90%,132291.387524,173813.1,9.297193,14.360529,30455.101578,12.853925
9,2024,90-100%,173813.069448,2265947.0,11.299377,17.453121,152625.01884,78.289603



FEDERAL — revenue totals


Unnamed: 0,year,revenue_baseline_$,weighted_households,weighted_taxpayers
0,2024,366855900000.0,16654030.0,10782040.0
1,2025,363189400000.0,16654030.0,10190770.0
