# Disparities in Air Pollution Exposure in Texas

This notebook reproduces the analysis for the project **Disparities in Air Pollution Exposure**.
The goal is to examine how PM2.5 concentrations relate to traffic intensity and
socioeconomic characteristics across Texas ZIP Code Tabulation Areas (ZCTAs).

The workflow follows an end-to-end data science pipeline:
data ingestion-- preprocessing-- modeling-- evaluation.


In [8]:
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [9]:
# Project modules
from src.load_data import (
    load_pm25_data,
    load_traffic_data,
    load_acs_data,
    load_zcta_state_crosswalk
)

from src.preprocess import (
    clean_pm25,
    clean_traffic,
    clean_acs,
    merge_all,
    filter_texas
)

from src.models import (
    linear_regression,
    lasso_regression,
    random_forest,
    gradient_boosting,
    xgboost_model
)

from src.pipeline import (
    build_rf_pipeline,
    tune_random_forest
)


ModuleNotFoundError: No module named 'src'

In [None]:
PROJECT_ROOT = Path("..")
DATA_DIR = PROJECT_ROOT / "data" / "raw"


In [None]:
pm25_raw = load_pm25_data(DATA_DIR)
traffic_raw = load_traffic_data(DATA_DIR)
acs_raw = load_acs_data(DATA_DIR)
crosswalk = load_zcta_state_crosswalk(DATA_DIR)


NameError: name 'load_pm25_data' is not defined

In [None]:
pm25 = clean_pm25(pm25_raw)
traffic = clean_traffic(traffic_raw)
acs = clean_acs(acs_raw)

df = merge_all(pm25, traffic, acs, crosswalk)
df_tx = filter_texas(df)

df_tx.shape


In [None]:
TARGET = "pm25"

DROP_COLS = [
    "year", "zcta", "STATE",
    "Geo_FIPS", "Geo_GEOID"
]

X = df_tx.drop(columns=DROP_COLS + [TARGET])
y = df_tx[TARGET]

X.shape, y.shape


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, random_state=0
)


In [None]:
traffic_cols = [
    'i_mean_traffic',
    'i_total_traffic',
    'i_mean_hw_traffic',
    'i_total_hw_traffic',
    'i_mean_nonhw_traffic',
    'i_total_nonhw_traffic'
]

corr = df_tx[traffic_cols].corr()

plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation of Traffic Variables")
plt.show()


In [None]:
lin_model, lin_r2 = linear_regression(
    X_train, y_train, X_test, y_test
)

print(f"Linear Regression R²: {lin_r2:.3f}")


In [None]:
lasso_model, lasso_r2 = lasso_regression(
    X_train, y_train, X_test, y_test
)

print(f"Lasso Regression Test R²: {lasso_r2:.3f}")


In [None]:
rf_model, rf_r2 = random_forest(X_train, y_train, X_test, y_test)
gb_model, gb_r2 = gradient_boosting(X_train, y_train, X_test, y_test)
xgb_model, xgb_r2 = xgboost_model(X_train, y_train, X_test, y_test)

results = pd.DataFrame({
    "Model": [
        "Random Forest",
        "Gradient Boosting",
        "XGBoost"
    ],
    "Test R2": [
        rf_r2,
        gb_r2,
        xgb_r2
    ]
})

results


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

scaled_X = StandardScaler().fit_transform(X)

pca = PCA(n_components=2)
pca_proj = pca.fit_transform(scaled_X)

plt.figure(figsize=(7,6))
plt.scatter(
    pca_proj[:, 0],
    pca_proj[:, 1],
    c=y,
    cmap="viridis",
    s=6
)
plt.colorbar(label="PM2.5")
plt.title("PCA Projection Colored by PM2.5")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()


In [None]:
rf_pipe = build_rf_pipeline(use_pca=False)
rf_pipe.fit(X_train, y_train)

pipe_r2 = rf_pipe.score(X_test, y_test)
print(f"Pipeline Random Forest R²: {pipe_r2:.3f}")


In [None]:
grid = tune_random_forest(rf_pipe, X_train, y_train)

print("Best Test R²:", grid.score(X_test, y_test))
grid.best_params_


## Conclusion

- Traffic variables are highly correlated; representative metrics were selected
- Tree-based models outperform linear and regularized regression
- Random Forest Regression achieved the best predictive performance (~0.80 R²)
- Socioeconomic and demographic characteristics show differentiated associations
  with PM2.5 exposure across Texas ZCTAs

This notebook demonstrates a reproducible, modular data science workflow
suitable for academic research and applied policy analysis.
