# 02 — Lasso feature discovery (Omitted Variable Bias)

Identify high-impact omitted variables from 1,200+ raw NHGIS columns using **Lasso (L1) regularization**. All screening logic lives in `scripts/advanced_metrics.run_lasso_feature_selection`; this notebook only runs it and displays results.

- **Raw data**: read-only from `data/raw/nhgis/` (no source CSVs modified).
- **Output**: `output/lasso_feature_shortlist.csv` with NHGIS codes, standardized coefficients, and codebook mapping (B25014 Overcrowding, B25070 Rent squeeze prioritized).

In [12]:
import os
import sys
import pandas as pd

REPO_ROOT = os.path.dirname(os.getcwd()) if os.path.basename(os.getcwd()) == "notebooks" else os.getcwd()
sys.path.insert(0, os.path.join(REPO_ROOT, "scripts"))
DATA_DIR = os.path.join(REPO_ROOT, "data")
OUTPUT_DIR = os.path.join(REPO_ROOT, "output")
os.makedirs(OUTPUT_DIR, exist_ok=True)

from advanced_metrics import run_lasso_feature_selection, NHGIS_CODEBOOK

## Load and examine codebook

Codebook is built from row 2 of each raw NHGIS CSV (variable descriptions). Columns whose description starts with "Margins of error" are excluded from Lasso.

In [14]:
import os
import glob
import pandas as pd

REPO_ROOT = os.path.dirname(os.getcwd()) if os.path.basename(os.getcwd()) == "notebooks" else os.getcwd()
DATA_DIR = os.path.join(REPO_ROOT, "data")
raw_nhgis = os.path.join(DATA_DIR, "raw", "nhgis")
if not os.path.isdir(raw_nhgis):
    raw_nhgis = DATA_DIR
nhgis_pattern = os.path.join(raw_nhgis, "nhgis*.csv")
nhgis_files = sorted(glob.glob(nhgis_pattern))

# Build codebook from row 0 (codes) and row 1 (descriptions) of each CSV (same logic as scripts/ingest_nhgis.load_nhgis_codebook)
codebook = {}
exclude_set = set()
for filepath in nhgis_files:
    head = pd.read_csv(filepath, header=None, nrows=2, low_memory=False)
    if head.shape[0] < 2:
        continue
    names = head.iloc[0].astype(str).tolist()
    descs = head.iloc[1].tolist()
    n = min(len(names), len(descs))
    for i in range(n):
        col = names[i]
        desc = descs[i] if pd.notna(descs[i]) else ""
        desc_str = str(desc).strip()
        if col not in codebook:
            codebook[col] = desc_str
        if desc_str.lower().startswith("margins of error"):
            exclude_set.add(col)

# As a DataFrame for easier inspection
codebook_df = pd.DataFrame([
    {"nhgis_code": k, "description": v, "is_margin_of_error": k in exclude_set}
    for k, v in codebook.items()
])
codebook_df = codebook_df.sort_values("nhgis_code").reset_index(drop=True)

# Examine
display(codebook_df)

Unnamed: 0,nhgis_code,description,is_margin_of_error
0,AIANHHA,American Indian Area/Alaska Native Area/Hawaii...,False
1,AIHHTLI,American Indian/Hawaiian Home Land Trust Land ...,False
2,AITSA,Tribal Subdivision/Remainder Code,False
3,ANRCA,Alaska Native Regional Corporation Code,False
4,AU08E001,Estimates: Total,False
...,...,...,...
1637,TRACTA,Census Tract Code,False
1638,TRUSTA,American Indian Area (Off-Reservation Trust La...,False
1639,UAA,Urban Area Code,False
1640,YEAR,Data File Year,False


## Run Lasso feature selection

Loads raw NHGIS from `data/raw/nhgis/`, computes household-level **Multigen_Rate** (AU46E002/AU46E001×100), takes top 100 columns by correlation with the target, runs **LassoCV** (standardized), and writes the top 30 non-zero coefficients to `output/lasso_feature_shortlist.csv`.

In [16]:
results = run_lasso_feature_selection(
    data_dir=DATA_DIR,
    output_path=os.path.join(OUTPUT_DIR, "lasso_feature_shortlist.csv"),
    target_col="Multigen_Rate",
    top_corr_n=100,
    top_nonzero_n=30,
    n_alphas=100,
    cv=5,
)

print(f"Optimal Lasso alpha: {results['optimal_alpha']:.6f}")
print(f"Shortlist written to: {results['output_path']}")
print(f"Top 30 non-zero features: {len(results['shortlist_codes'])}")

  ns["GEOID"] = ns["STATEA"] + ns["COUNTYA"] + ns["TRACTA"]
  ns["COUNTY_GEOID"] = ns["STATEA"] + ns["COUNTYA"]
  ns["Multigen_Rate"] = (ns["AU46E002"] / total_hh) * 100


Optimal Lasso alpha: 0.007599
Shortlist written to: /Users/elyas/vscode/capstone_multigen_housing_econometric_analysis/output/lasso_feature_shortlist.csv
Top 30 non-zero features: 30


## Shortlist: NHGIS codes and standardized coefficients

Exact NHGIS codes with standardized coefficient values. **Codebook mapping** prioritizes Table **B25014** (Occupants per room / Overcrowding) and **B25070** (Gross rent as % of income / Economic squeeze).

In [17]:
shortlist = results["shortlist"]
display(shortlist)

Unnamed: 0,nhgis_code,standardized_coef,abs_coef,codebook_table
0,AU46E026,-7.612892,7.612892,B11017 — Household type (incl. multigenerational)
1,AVA1E001,1.972548,1.972548,B19083 — Gini index
2,AUOVE007,1.560446,1.560446,B01001 — Sex by age
3,AU46E002,1.171221,1.171221,B11017 — Household type (incl. multigenerational)
4,AUOVM007,-0.981149,0.981149,— (see NHGIS codebook)
5,AUVGE001,0.866782,0.866782,— (see NHGIS codebook)
6,AURNE008,-0.770895,0.770895,— (see NHGIS codebook)
7,AUPWM019,-0.721026,0.721026,— (see NHGIS codebook)
8,AUPWE019,0.683256,0.683256,B08301 — Commute
9,AUOVE031,0.65738,0.65738,B01001 — Sex by age


In [18]:
# Highlight rows that map to B25014 (Overcrowding) or B25070 (Rent squeeze)
b25014 = shortlist["codebook_table"].str.contains("B25014", na=False)
b25070 = shortlist["codebook_table"].str.contains("B25070", na=False)
print("Rows mapping to B25014 (Overcrowding):")
print(shortlist.loc[b25014].to_string())
print()
print("Rows mapping to B25070 (Rent squeeze):")
print(shortlist.loc[b25070].to_string())

Rows mapping to B25014 (Overcrowding):
Empty DataFrame
Columns: [nhgis_code, standardized_coef, abs_coef, codebook_table]
Index: []

Rows mapping to B25070 (Rent squeeze):
Empty DataFrame
Columns: [nhgis_code, standardized_coef, abs_coef, codebook_table]
Index: []


## Top 3 for schema integration

Copy the **nhgis_code** values below into `scripts/core_metrics.py`: add them to `ANALYSIS_READY_SCHEMA["feature_cols"]` and add human-readable labels to `FEATURE_LABELS`. Raw NHGIS columns are preserved in the wide merge, so these codes will be available in analysis-ready data when included in the schema.

In [19]:
top3 = shortlist.head(3)
print("Top 3 Lasso-selected variables (add to core_metrics.py):")
for _, row in top3.iterrows():
    print(f"  {row['nhgis_code']}: {row['codebook_table']}  (std coef = {row['standardized_coef']:.4f})")

Top 3 Lasso-selected variables (add to core_metrics.py):
  AU46E026: B11017 — Household type (incl. multigenerational)  (std coef = -7.6129)
  AVA1E001: B19083 — Gini index  (std coef = 1.9725)
  AUOVE007: B01001 — Sex by age  (std coef = 1.5604)
