# Project 3

**Aileen Yang cy2830**

In [2]:
import plotly.io as pio
pio.renderers.default = "notebook_connected+plotly_mimetype"

**Dataset(s) to be used:** <br>
1. Counts and rates of asthma emergency department visits by zip code, California
https://healthdata.gov/State/Asthma-Emergency-Department-Visit-Rates/28nb-65xq/about_data 

2. California FIRE Damage Inspection Data
https://hub-calfire-forestry.hub.arcgis.com/datasets/cal-fire-damage-inspection-dins-data/about <br>


**Analysis question:** Do California areas that experience more wildfire activity have higher rates of asthma emergency department visits in children in the same year? Do these associations differ between children and adults?

**Columns that will (likely) be used:**

From asthma dataset:
- Year
- Zip Code
- County
- Age_Group
- Number_of_Asthma_ED_Visits
- Age_Adjusted_Rate_of_Asthma_ED_V

From wildfire dataset:
- OBJECTID (unique structure record)
- Incident Name
- Incident Start Date
- County

**Columns to be used to merge/join them:**
  - Asthma: Year, County
  - Wildfire: Year (from Incident Start Date), County

**Hypothesis**: California counties with more wildfire activity in a given year will have higher age-adjusted rates of asthma ED visits, particularly among children. 

## Background
California’s wildfire seasons are becoming longer and more intense. We often see images of orange skies and hear warnings to "stay indoors," but does this environmental crisis translate into an immediate public health crisis? 

Specifically, do counties with extreme wildfire activity actually see higher rates of asthma emergencies compared to those without? 

Smoke from fires is a dangerous irritant to the respiratory system, which can trigger asthma attacks. It could be linked to increased asthma-related emergency department (ED) visits. Among all, wildfire smoke significantly worsens asthma in children.

Therefore, I would like to investigate whether there is a prominent linkage between wildfire occurrence and ED visits in the same year.

## Dataset Preparation

In [3]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from scipy import stats

**Import & Cleansing of Asthma Data**

In [4]:
asthma = pd.read_csv('asthmaedvisitrates-by-zipcode.csv')
asthma.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19190 entries, 0 to 19189
Data columns (total 6 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Year                              19190 non-null  int64  
 1   Zip_Code                          19190 non-null  int64  
 2   County                            19190 non-null  object 
 3   Age_Group                         19190 non-null  object 
 4   Number_of_Asthma_ED_Visits        19190 non-null  int64  
 5   Age_Adjusted_Rate_of_Asthma_ED_V  19190 non-null  float64
dtypes: float64(1), int64(3), object(2)
memory usage: 899.7+ KB


In [5]:
# Rename Columns
asthma = asthma.rename(columns={'Zip_Code':'Zip Code'})
asthma = asthma.rename(columns={'Age_Adjusted_Rate_of_Asthma_ED_V': 'Age-adjusted Rate'})
asthma = asthma.rename(columns={'Number_of_Asthma_ED_Visits': 'total_ed_visits'})

In [6]:
asthma["County"] = asthma["County"].astype(str).str.strip()
asthma = asthma.dropna(subset=["Year", "County", "Age_Group", "Age-adjusted Rate"]).copy()

I then aggregated them by age group, (child and adult), which will be used later for comparison later.

In [7]:
# Aggregate ED visits for each county by age group for each year
asthma_county = (
    asthma
    .groupby(["Year", "County", "Age_Group"])
    .agg(
        total_ed_visits=("total_ed_visits", "sum"),
        mean_age_adj_rate=("Age-adjusted Rate", "mean"),
        median_age_adj_rate=("Age-adjusted Rate", "median")
    )
    .reset_index()
)

asthma_county.head()


Unnamed: 0,Year,County,Age_Group,total_ed_visits,mean_age_adj_rate,median_age_adj_rate
0,2013,Alameda,Adult,6653,55.506522,44.05
1,2013,Alameda,Child,3213,100.317949,101.3
2,2013,Amador,Adult,195,78.233333,73.5
3,2013,Amador,Child,12,64.4,64.4
4,2013,Butte,Adult,648,46.1,44.45


I aggregated asthma ED visits to the county year level instead of noisy ZIP code data. This will provide a clearer picture of county year snapshots by age group.

**Importing Wildfire Dataset**

In [8]:
wildfire = pd.read_csv('Cali_Fire.csv')
wildfire.head()


Columns (12,36,37) have mixed types. Specify dtype option on import or set low_memory=False.



Unnamed: 0,OBJECTID,* Damage,* Street Number,* Street Name,"* Street Type (e.g. road, drive, lane, etc.)","Street Suffix (e.g. apt. 23, blding C)",* City,State,Zip Code,* CAL FIRE Unit,...,Fire Name (Secondary),APN (parcel),Assessed Improved Value (parcel),Year Built (parcel),Site Address (parcel),GLOBALID,Latitude,Longitude,x,y
0,1,No Damage,8376.0,Quail Canyon,Road,,Winters,CA,,LNU,...,Quail,101090290,510000.0,1997.0,8376 QUAIL CANYON RD VACAVILLE CA 95688,e1919a06-b4c6-476d-99e5-f0b45b070de8,38.47496,-122.044465,-13585930.0,4646741.0
1,2,Affected (1-9%),8402.0,Quail Canyon,Road,,Winters,CA,,LNU,...,Quail,101090270,573052.0,1980.0,8402 QUAIL CANYON RD VACAVILLE CA 95688,b090eeb6-5b18-421e-9723-af7c9144587c,38.477442,-122.043252,-13585790.0,4647094.0
2,3,No Damage,8430.0,Quail Canyon,Road,,Winters,CA,,LNU,...,Quail,101090310,350151.0,2004.0,8430 QUAIL CANYON RD VACAVILLE CA 95688,268da70b-753f-46aa-8fb1-327099337395,38.479358,-122.044585,-13585940.0,4647366.0
3,4,No Damage,3838.0,Putah Creek,Road,,Winters,CA,,LNU,...,Quail,103010240,134880.0,1981.0,3838 PUTAH CREEK RD WINTERS CA 95694,64d4a278-5ee9-414a-8bf4-247c5b5c60f9,38.487313,-122.015115,-13582660.0,4648497.0
4,5,No Damage,3830.0,Putah Creek,Road,,Winters,CA,,LNU,...,Quail,103010220,346648.0,1980.0,3830 PUTAH CREEK RD WINTERS CA 95694,1b44b214-01fd-4f06-b764-eb42a1ec93d7,38.485636,-122.016122,-13582770.0,4648259.0


In [9]:
wildfire.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130722 entries, 0 to 130721
Data columns (total 46 columns):
 #   Column                                                        Non-Null Count   Dtype  
---  ------                                                        --------------   -----  
 0   OBJECTID                                                      130722 non-null  int64  
 1   * Damage                                                      130722 non-null  object 
 2   * Street Number                                               126302 non-null  float64
 3   * Street Name                                                 125236 non-null  object 
 4   * Street Type (e.g. road, drive, lane, etc.)                  116260 non-null  object 
 5   Street Suffix (e.g. apt. 23, blding C)                        62017 non-null   object 
 6   * City                                                        98991 non-null   object 
 7   State                                                   

There are 45 columns in total for this dataset. For clarity, only essential columns that are useful for analysis will be kept. A new column 'Year' will be derived from the incident date to match the Asthma dataset year.

In [10]:
# Drop noise columns
wildfire = wildfire.rename(columns = {'* Incident Name':'Incident Name'})
wildfire = wildfire[["OBJECTID", "Incident Name", "Incident Start Date", "County"]].copy()

# Extract year to match the format of asthma years
wildfire["Incident Start Date"] = pd.to_datetime(
    wildfire["Incident Start Date"], errors="coerce"
)
wildfire = wildfire.dropna(subset=["Incident Start Date"])
wildfire["Year"] = wildfire["Incident Start Date"].dt.year

# Clean County column
wildfire["County"] = wildfire["County"].astype(str).str.strip().str.title()

wildfire.head()


Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.



Unnamed: 0,OBJECTID,Incident Name,Incident Start Date,County,Year
0,1,Quail,2020-06-06,Solano,2020
1,2,Quail,2020-06-06,Solano,2020
2,3,Quail,2020-06-06,Solano,2020
3,4,Quail,2020-06-06,Solano,2020
4,5,Quail,2020-06-06,Solano,2020


In [11]:
# Aggregate County and Year to prevent duplication
wildfire = (
    wildfire
    .groupby(["Year", "County"])
    .agg(
        wildfire_records=("OBJECTID", "count"),
        wildfire_incidents=("Incident Name", "nunique")
    )
    .reset_index()
)

wildfire.head()

Unnamed: 0,Year,County,wildfire_records,wildfire_incidents
0,2013,Riverside,58,1
1,2013,Shasta,222,1
2,2014,Mendocino,11,1
3,2014,San Diego,49,1
4,2014,Siskiyou,250,1


**Merge the two Datasets**

In [12]:
merged = asthma_county.merge(
    wildfire,
    how="left",
    on=["Year", "County"]
)

# If there are no wildfire incidents, replace the value by 0
merged["wildfire_records"] = merged["wildfire_records"].fillna(0).astype(int)
merged["wildfire_incidents"] = merged["wildfire_incidents"].fillna(0).astype(int)
merged.head(10)


Unnamed: 0,Year,County,Age_Group,total_ed_visits,mean_age_adj_rate,median_age_adj_rate,wildfire_records,wildfire_incidents
0,2013,Alameda,Adult,6653,55.506522,44.05,0,0
1,2013,Alameda,Child,3213,100.317949,101.3,0,0
2,2013,Amador,Adult,195,78.233333,73.5,0,0
3,2013,Amador,Child,12,64.4,64.4,0,0
4,2013,Butte,Adult,648,46.1,44.45,0,0
5,2013,Butte,Child,258,61.0,53.65,0,0
6,2013,Calaveras,Adult,90,89.25,89.25,0,0
7,2013,Calaveras,Child,27,84.5,84.5,0,0
8,2013,Colusa,Adult,22,42.6,42.6,0,0
9,2013,Colusa,Child,32,75.45,75.45,0,0


The exposure to wildfire level will be categoirzed into four categories:
1. No wildfire: the area does not have wildfire in that particular year
2. Low wildfire: the area has 1 - 19 cases of wildfire in that particular year
3. Moderate wildfire: the area has 20 - 99 cases of wildfire in that particular year
4. High wildfire: the area has more than 100 cases of wildfire in that particular year, at a very high level

In [13]:
# Categorize exposures according to the occurrence
def exposure_category(n):
    if n == 0:
        return "No wildfire"
    elif n < 20:
        return "Low wildfire (1–19)"
    elif n < 100:
        return "Moderate wildfire (20–99)"
    else:
        return "High wildfire (100+)"
        
merged["exposure_cat"] = merged["wildfire_records"].apply(exposure_category)
merged["exposure_cat"].value_counts()


exposure_cat
No wildfire                  745
High wildfire (100+)         130
Low wildfire (1–19)          114
Moderate wildfire (20–99)     61
Name: count, dtype: int64

## Summary of county-level age-adjusted rate and number of ED Visits

In [14]:
overall_summary = (
    merged
    .groupby("exposure_cat")
    .agg(
        counties = ("County", "nunique"),
        mean_total_ed = ("total_ed_visits", "mean"),    
        median_total_ed = ("total_ed_visits", "median"), 
    )
    .reset_index()
    .sort_values("mean_total_ed", ascending=True)  
)
overall_summary

Unnamed: 0,exposure_cat,counties,mean_total_ed,median_total_ed
1,Low wildfire (1–19),36,952.719298,284.0
2,Moderate wildfire (20–99),25,1108.213115,206.0
3,No wildfire,56,1381.734228,412.0
0,High wildfire (100+),35,2155.361538,367.5


We can observe that the "No Wildfire" category has the second highest mean ED visits (1381.7) and the highest median (412). This could suggest that high asthma burden exists independently of wildfire exposure, in which we will further explore in the later session.

## Charts for Insights

In [18]:
merged.head()

Unnamed: 0,Year,County,Age_Group,total_ed_visits,mean_age_adj_rate,median_age_adj_rate,wildfire_records,wildfire_incidents,exposure_cat
0,2013,Alameda,Adult,6653,55.506522,44.05,0,0,No wildfire
1,2013,Alameda,Child,3213,100.317949,101.3,0,0,No wildfire
2,2013,Amador,Adult,195,78.233333,73.5,0,0,No wildfire
3,2013,Amador,Child,12,64.4,64.4,0,0,No wildfire
4,2013,Butte,Adult,648,46.1,44.45,0,0,No wildfire


In [None]:
# Distribution of Rates
box_plot_df = merged.dropna(subset=['mean_age_adj_rate', 'exposure_cat'])

# Group by Year, Exposure Category, and Age Group 
trend_df = merged.groupby(['Year', 'exposure_cat', 'Age_Group'])['mean_age_adj_rate'].mean().reset_index()

trend_df.rename(columns={'mean_age_adj_rate': 'Average Asthma Rate'}, inplace=True)

In [24]:
# Box Plot to show distribution of Asthma Rates
fig_box = px.box(
    box_plot_df,
    x="exposure_cat",
    y="mean_age_adj_rate",
    color="exposure_cat",
    title="Distribution of Asthma Rates by Wildfire Severity",
    labels={
        "exposure_cat": "Wildfire Severity",
        "mean_age_adj_rate": "Age-Adjusted Asthma Rate (per 10k)"
    },
    category_orders={"exposure_cat": ["Low wildfire", "Moderate wildfire", "High wildfire", "No wildfire"]}
)
fig_box.show()

**Analysis: Does Fire Severity Correlate with Asthma Burden?**

The box plot above breaks down the Age-Adjusted Asthma Rates by fire severity categories.
This visualization allows us to see the "typical" asthma burden (the median line inside the box) for each category, rather than just the outliers.

**Key Findings:**
1.  **Severity Correlation:** We observe that counties categorized as **"High Wildfire"** generally exhibit higher median asthma rates compared to "Low Wildfire" zones. This supports the hypothesis that intense fire activity contributes to respiratory distress in the local population.

2.  **The "No Wildfire" Context:** You may notice that the "No Wildfire" category has a wide range of values or a surprisingly high median.
    - This category includes major metropolitan areas (e.g. San Francisco or parts of Los Angeles) that may not have forest fires but suffer from high urban pollution (traffic, industry).
    - Takeaway: This suggests that while wildfires are a significant acute stressor, chronic urban air quality remains a major driver of asthma, sometimes rivaling the impact of seasonal fires.

In [23]:
# Child vs. Adult Comparison
fig_line = px.line(
    trend_df,
    x="Year",
    y="Average Asthma Rate",
    color="exposure_cat",
    facet_col="Age_Group", 
    markers=True,
    title="Average Asthma ED Visit Rates Over Time (2013-2022)",
    labels={"exposure_cat": "Fire Severity"},
    category_orders={"exposure_cat": ["High wildfire", "Moderate wildfire", "Low wildfire", "No wildfire"]}
)

# Add a vertical line for 2020
fig_line.add_vline(x=2020, line_width=1, line_dash="dash", line_color="gray", annotation_text="2020 CA Fires")

fig_line.show()

### Timeline Analysis

The line chart above tracks the average asthma rates over time, split by age group. 

This view reveals the temporal impact of specific fire seasons that the aggregate box plot might miss.

**Key Observations:**
1.  **The 2020 Spike:** There is a distinct upward spike in asthma rates in 2020, particularly in the "High" and "Moderate" wildfire zones.
    - This corresponds to California's record-breaking 2020 wildfire season (including the August Complex fire), which blanketed the state in smoke for weeks. The data confirms that this environmental crisis translated directly into a public health spike.
2.  **Vulnerable Populations:** The most consistent finding across all years and all fire categories is the gap between children and adults.
    - Children (blue line) consistently show asthma rates roughly **2x higher** than adults (red line).
    * *Conclusion:* Children are disproportionately vulnerable to air quality changes, likely due to their higher breathing rates relative to body size and developing respiratory systems.

## Conclusion

Do California wildfires directly translate into an asthma crisis? The data suggests the answer is a "Yes," but with a critical plot twist regarding where people live.

1. Hypothesis Examination
My hypothesis held true: when we control for population size, counties with High Wildfire Severity generally exhibit higher median asthma rates than those with Low Severity. The impact becomes undeniable when looking at the massive spike in asthma ED visits during the historic 2020 wildfire season acts as a "smoking gun," proving that extreme environmental events have immediate, measurable consequences on public health.

2. The Urban Paradox
One of the most surprising findings in this study was the high asthma burden in "No Wildfire" zones.

Possible Cause: These zones typically represent dense urban centers (San Francisco or Los Angeles). While these areas may not have burning forests, they have burning gasoline.

This highlights a dual threat. Californians are squeezed between acute risks (wildfire smoke in rural/suburban areas) and chronic risks (traffic and industrial pollution in cities). There is no "safe" zone for respiratory health, only different sources of danger.

3. The Vulnerability Gap
Perhaps the most urgent finding is the disparity between age groups. Across every single year and fire category, children suffered asthma rates roughly 2x to 3x higher than adults.

This suggests that current safety protocols (like "stay indoors" alerts) may not be enough to protect the developing lungs of the state’s youngest residents.

### Final Thoughts
As climate change extends the duration and intensity of wildfire seasons, we can no longer treat smoke as a temporary nuisance. 

The data shows it is a significant public health driver that hits children the hardest. Future policy should focus on putting out retrofitting schools and homes in high-severity zones with better filtration systems to protect the most vulnerable populations.