# Capstone Project: Environmental and Socioeconomic Factors Impacting Cardiovascular Disease (CVD) in California
This project investigates the relationship between environmental exposures, social vulnerabilities, and cardiovascular disease (CVD) prevalence across communities in California. Using the CalEnviroScreen 4.0 dataset, which compiles detailed metrics at the census-tract level, the analysis explores how factors such as air quality (e.g., PM2.5, ozone), pollution exposure (e.g., diesel particulate matter, toxic releases), and demographic pressures (e.g., poverty, housing burden, linguistic isolation) contribute to disparities in cardiovascular health outcomes.

## Data Source: 
Link https://oehha.ca.gov/calenviroscreen/report/calenviroscreen-40
The primary dataset for this analysis is the CalEnviroScreen 4.0 dataset provided by the California Office of Environmental Health Hazard Assessment (OEHHA), which includes over 80 variables for 8,000+ census tracts in California. The dataset contains environmental indicators (e.g., ozone, PM2.5, diesel particulate matter), population health data (e.g., asthma, low birth weight, cardiovascular disease), and demographic factors (e.g., poverty rate, educational attainment, housing burden)

## Goals
- Understand the spatial and statistical patterns of cardiovascular disease across the state
- Identify which environmental and socioeconomic factors are most strongly correlated with CVD rates
- Build a baseline regression model to predict CVD risk based on these features
- Inform public health and environmental justice efforts by identifying high-risk communitie

### 1. Project Setup

#### Import Libraries

In [71]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import cross_val_score, KFold

#### Load Dataset


In [73]:

# Define file path and target sheet name
file_path = "../data/calenviroscreen40resultsdatadictionary_F_2021.xlsx"
sheet_name = "CES4.0FINAL_results"

# Check if file exists
if not os.path.exists(file_path):
    raise FileNotFoundError(f"File not found: '{file_path}'. Please ensure the file is in the 'data/' directory.")

try:
    # Try reading the Excel file
    xls = pd.ExcelFile(file_path)
    
    # Check if the sheet exists
    if sheet_name not in xls.sheet_names:
        raise ValueError(f"Sheet '{sheet_name}' not found in '{file_path}'. Available sheets: {xls.sheet_names}")
    
    # Load the sheet into a DataFrame
    df = pd.read_excel(xls, sheet_name=sheet_name)
    print("Dataset loaded successfully!")

except Exception as e:
    print(f"Error loading data: {e}")

Dataset loaded successfully!


#### Overview of dataframe

In [75]:
print(f"Dataset Shape: {df.shape}")
df.info()
df.head(3)


Dataset Shape: (8035, 58)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8035 entries, 0 to 8034
Data columns (total 58 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Census Tract                 8035 non-null   int64  
 1   Total Population             8035 non-null   int64  
 2   California County            8035 non-null   object 
 3   ZIP                          8035 non-null   int64  
 4   Approximate Location         8035 non-null   object 
 5   Longitude                    8035 non-null   float64
 6   Latitude                     8035 non-null   float64
 7   CES 4.0 Score                7932 non-null   float64
 8   CES 4.0 Percentile           7932 non-null   float64
 9   CES 4.0 Percentile Range     7932 non-null   object 
 10  Ozone                        8035 non-null   float64
 11  Ozone Pctl                   8035 non-null   float64
 12  PM2.5                        8035 non-null   float

Unnamed: 0,Census Tract,Total Population,California County,ZIP,Approximate Location,Longitude,Latitude,CES 4.0 Score,CES 4.0 Percentile,CES 4.0 Percentile Range,...,Linguistic Isolation Pctl,Poverty,Poverty Pctl,Unemployment,Unemployment Pctl,Housing Burden,Housing Burden Pctl,Pop. Char.,Pop. Char. Score,Pop. Char. Pctl
0,6019001100,2780,Fresno,93706,Fresno,-119.781696,36.709695,93.18357,100.0,95-100% (highest scores),...,79.374746,76.0,98.919598,12.8,93.831338,30.3,91.03929,93.155109,9.663213,99.722642
1,6077000700,4680,San Joaquin,95206,Stockton,-121.287873,37.943173,86.65379,99.987393,95-100% (highest scores),...,95.533902,73.2,98.39196,19.8,99.206143,31.2,92.281369,93.165408,9.664281,99.73525
2,6037204920,2751,Los Angeles,90023,Los Angeles,-118.197497,34.0175,82.393909,99.974786,95-100% (highest scores),...,81.553661,62.6,93.39196,6.4,61.530453,20.3,63.967047,83.751814,8.687785,95.789208


### 2. Data Preparation

#### Check for missing values

In [78]:
# Check total missing values per column
missing = df.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

Unemployment Pctl              335
Unemployment                   335
Linguistic Isolation Pctl      320
Linguistic Isolation           320
Low Birth Weight               227
Low Birth Weight Pctl          227
Housing Burden Pctl            145
Housing Burden                 145
CES 4.0 Score                  103
CES 4.0 Percentile             103
Pop. Char. Score               103
Pop. Char.                     103
Education Pctl                 103
Education                      103
Pop. Char. Pctl                103
CES 4.0 Percentile Range       103
Lead Pctl                       96
Lead                            96
Poverty                         75
Poverty Pctl                    75
Traffic Pctl                    35
Traffic                         35
Drinking Water Pctl             28
Drinking Water                  28
Cardiovascular Disease          11
Asthma Pctl                     11
Asthma                          11
Cardiovascular Disease Pctl     11
dtype: int64

In [79]:
# Drop rows with any missing or null values in the selected columns
df_cleaned = df.dropna()

print(f"Cleaned dataset shape: {df_cleaned.shape}")


Cleaned dataset shape: (7355, 58)


In [80]:
# Count duplicates
duplicate_count = df_cleaned.duplicated().sum()
print(f"Duplicate rows: {duplicate_count}")

# Remove duplicates
df_cleaned = df_cleaned.drop_duplicates()

Duplicate rows: 0


In [81]:
# Define relevant variables
environmental_factors = [
    "Ozone", "PM2.5", "Diesel PM", "Drinking Water", "Lead", "Pesticides",
    "Tox. Release", "Traffic", "Cleanup Sites", "Groundwater Threats",
    "Haz. Waste", "Imp. Water Bodies", "Solid Waste"
]

social_factors = [
    "Education", "Linguistic Isolation", "Poverty", "Unemployment", "Housing Burden"
]

target_variables = ["Cardiovascular Disease", "Asthma", "Low Birth Weight"]

# Combine into final list of columns of interest
selected_columns = environmental_factors + social_factors + target_variables

# Subset the dataset
df_selected = df_cleaned[selected_columns].copy()

In [82]:
# Display descriptive statistics for all selected variables
summary_stats = df_selected.describe().T
summary_stats = summary_stats[["count", "mean", "std", "min", "25%", "50%", "75%", "max"]]
summary_stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Ozone,7355.0,0.048655,0.010423,0.026554,0.041926,0.047165,0.0568,0.073132
PM2.5,7355.0,10.241618,2.109606,3.115233,8.606971,10.335114,11.953324,16.394748
Diesel PM,7355.0,0.227852,0.258413,0.000214,0.072389,0.15077,0.292595,4.751602
Drinking Water,7355.0,479.928371,218.645377,32.568922,305.292089,433.100961,685.634214,1179.478774
Lead,7355.0,49.557652,23.127495,0.0,31.335271,49.558321,67.46557,99.352332
Pesticides,7355.0,273.021943,2391.26052,0.0,0.0,0.0,0.153506,80811.08945
Tox. Release,7355.0,1627.293615,3671.554116,0.0,122.445833,484.488345,1673.394786,96985.62996
Traffic,7355.0,1131.298055,995.684541,20.748148,571.318021,887.573098,1397.21309,45752.0
Cleanup Sites,7355.0,8.545942,16.166135,0.0,0.0,2.0,10.7,300.95
Groundwater Threats,7355.0,16.476621,32.769498,0.0,0.3,6.0,18.75,673.75


In [83]:
# Save cleaned subset for modeling/EDA use
output_path = "../data/cvd_cleaned_environmental_social.csv"
df_selected.to_csv(output_path, index=False)
print(f"Cleaned dataset saved to: {output_path}")

Cleaned dataset saved to: ../data/cvd_cleaned_environmental_social.csv


In [93]:
# Print initial rows of cleaned data
df_selected.head(5)

Unnamed: 0,Ozone,PM2.5,Diesel PM,Drinking Water,Lead,Pesticides,Tox. Release,Traffic,Cleanup Sites,Groundwater Threats,...,Imp. Water Bodies,Solid Waste,Education,Linguistic Isolation,Poverty,Unemployment,Housing Burden,Cardiovascular Disease,Asthma,Low Birth Weight
0,0.060311,13.906348,1.122712,733.946935,89.600854,1.001925,4859.094604,1037.095744,70.5,54.25,...,0,6.0,44.5,16.0,76.0,12.8,30.3,21.47,129.54,7.8
1,0.045884,11.884085,0.538105,389.846569,77.302272,63.132574,519.628001,856.395935,61.9,78.6,...,13,9.25,46.4,29.7,73.2,19.8,31.2,20.26,105.88,6.88
2,0.04792,12.25164,0.780833,787.940335,92.56366,0.0,3682.693278,2522.622269,38.75,20.5,...,7,4.85,52.2,17.1,62.6,6.4,20.3,20.87,76.1,7.11
3,0.060311,13.520939,0.173815,733.946935,68.385084,44.574874,1630.342707,690.502159,16.5,9.5,...,0,5.75,41.4,15.7,65.7,15.7,35.4,22.68,139.45,10.65
4,0.060311,13.818959,1.389658,733.946935,75.414535,16.625496,1975.207988,909.650882,10.5,28.25,...,0,0.0,43.6,20.0,72.7,13.7,32.7,22.64,139.08,10.25
