# Project 3

### Aileen Yang cy2830

**Dataset(s) to be used:** <br>
1. Counts and rates of asthma emergency department visits by zip code, California
https://healthdata.gov/State/Asthma-Emergency-Department-Visit-Rates/28nb-65xq/about_data 

2. California FIRE Damage Inspection Data
https://hub-calfire-forestry.hub.arcgis.com/datasets/cal-fire-damage-inspection-dins-data/about <br>


**Analysis question:** Do California areas that experience more wildfire activity have higher rates of asthma emergency department visits in children in the same year? Do these associations differ between children and adults?

**Columns that will (likely) be used:**

From asthma dataset:
- Year
- Zip Code
- County
- Age_Group
- Number_of_Asthma_ED_Visits
- Age_Adjusted_Rate_of_Asthma_ED_V

From wildfire dataset:
- OBJECTID (unique structure record)
- Incident Name
- Incident Start Date
- County

**Columns to be used to merge/join them:**
  - Asthma: Year, County
  - Wildfire: Year (from Incident Start Date), County

**Hypothesis**: California counties with more wildfire activity in a given year will have higher age-adjusted rates of asthma ED visits, particularly among children. 

In [80]:
import pandas as pd
import numpy as np
import plotly.express as px

**Import & Cleansing of Asthma Data**

In [81]:
asthma = pd.read_csv('asthmaedvisitrates-by-zipcode.csv')
asthma.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19190 entries, 0 to 19189
Data columns (total 6 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Year                              19190 non-null  int64  
 1   Zip_Code                          19190 non-null  int64  
 2   County                            19190 non-null  object 
 3   Age_Group                         19190 non-null  object 
 4   Number_of_Asthma_ED_Visits        19190 non-null  int64  
 5   Age_Adjusted_Rate_of_Asthma_ED_V  19190 non-null  float64
dtypes: float64(1), int64(3), object(2)
memory usage: 899.7+ KB


In [82]:
# Rename Columns
asthma = asthma.rename(columns={'Zip_Code':'Zip Code'})
asthma = asthma.rename(columns={'Age_Adjusted_Rate_of_Asthma_ED_V': 'Age-adjusted Rate'})
asthma = asthma.rename(columns={'Number_of_Asthma_ED_Visits': 'Number of ED visits'})

In [83]:
asthma["County"] = asthma["County"].astype(str).str.strip()
asthma = asthma.dropna(subset=["Year", "County", "Age_Group", "Age-adjusted Rate"]).copy()

I then aggregated them by age group, (child and adult), which will be used later for comparison later.

In [84]:
# Aggregate ED visits for each county by age
asthma_by_age = (
    asthma
    .groupby(["Year", "County", "Age_Group"])
    .agg(
        total_ed_visits   = ("Number of ED visits", "sum"),
        mean_age_adj_rate = ("Age-adjusted Rate", "mean"),
    )
    .reset_index()
)

asthma_by_age.head()


Unnamed: 0,Year,County,Age_Group,total_ed_visits,mean_age_adj_rate
0,2013,Alameda,Adult,6653,55.506522
1,2013,Alameda,Child,3213,100.317949
2,2013,Amador,Adult,195,78.233333
3,2013,Amador,Child,12,64.4
4,2013,Butte,Adult,648,46.1


**Importing Wildfire Dataset**

In [85]:
wildfire = pd.read_csv('Cali_Fire.csv')
wildfire.head()


Columns (12,36,37) have mixed types. Specify dtype option on import or set low_memory=False.



Unnamed: 0,OBJECTID,* Damage,* Street Number,* Street Name,"* Street Type (e.g. road, drive, lane, etc.)","Street Suffix (e.g. apt. 23, blding C)",* City,State,Zip Code,* CAL FIRE Unit,...,Fire Name (Secondary),APN (parcel),Assessed Improved Value (parcel),Year Built (parcel),Site Address (parcel),GLOBALID,Latitude,Longitude,x,y
0,1,No Damage,8376.0,Quail Canyon,Road,,Winters,CA,,LNU,...,Quail,101090290,510000.0,1997.0,8376 QUAIL CANYON RD VACAVILLE CA 95688,e1919a06-b4c6-476d-99e5-f0b45b070de8,38.47496,-122.044465,-13585930.0,4646741.0
1,2,Affected (1-9%),8402.0,Quail Canyon,Road,,Winters,CA,,LNU,...,Quail,101090270,573052.0,1980.0,8402 QUAIL CANYON RD VACAVILLE CA 95688,b090eeb6-5b18-421e-9723-af7c9144587c,38.477442,-122.043252,-13585790.0,4647094.0
2,3,No Damage,8430.0,Quail Canyon,Road,,Winters,CA,,LNU,...,Quail,101090310,350151.0,2004.0,8430 QUAIL CANYON RD VACAVILLE CA 95688,268da70b-753f-46aa-8fb1-327099337395,38.479358,-122.044585,-13585940.0,4647366.0
3,4,No Damage,3838.0,Putah Creek,Road,,Winters,CA,,LNU,...,Quail,103010240,134880.0,1981.0,3838 PUTAH CREEK RD WINTERS CA 95694,64d4a278-5ee9-414a-8bf4-247c5b5c60f9,38.487313,-122.015115,-13582660.0,4648497.0
4,5,No Damage,3830.0,Putah Creek,Road,,Winters,CA,,LNU,...,Quail,103010220,346648.0,1980.0,3830 PUTAH CREEK RD WINTERS CA 95694,1b44b214-01fd-4f06-b764-eb42a1ec93d7,38.485636,-122.016122,-13582770.0,4648259.0


In [86]:
wildfire.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130722 entries, 0 to 130721
Data columns (total 46 columns):
 #   Column                                                        Non-Null Count   Dtype  
---  ------                                                        --------------   -----  
 0   OBJECTID                                                      130722 non-null  int64  
 1   * Damage                                                      130722 non-null  object 
 2   * Street Number                                               126302 non-null  float64
 3   * Street Name                                                 125236 non-null  object 
 4   * Street Type (e.g. road, drive, lane, etc.)                  116260 non-null  object 
 5   Street Suffix (e.g. apt. 23, blding C)                        62017 non-null   object 
 6   * City                                                        98991 non-null   object 
 7   State                                                   

There are 45 columns in total for this dataset. For clarity, only essential columns that are useful for analysis will be kept. A new column 'Year' will be derived from the incident date to match the Asthma dataset year.

In [87]:
# Drop noise columns
wildfire = wildfire.rename(columns = {'* Incident Name':'Incident Name'})
wildfire = wildfire[["OBJECTID", "Incident Name", "Incident Start Date", "County"]].copy()

# Extract year to match the format of asthma years
wildfire["Incident Start Date"] = pd.to_datetime(
    wildfire["Incident Start Date"], errors="coerce"
)
wildfire = wildfire.dropna(subset=["Incident Start Date"])
wildfire["Year"] = wildfire["Incident Start Date"].dt.year

# Clean County column
wildfire["County"] = wildfire["County"].astype(str).str.strip().str.title()

wildfire.head()


Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.



Unnamed: 0,OBJECTID,Incident Name,Incident Start Date,County,Year
0,1,Quail,2020-06-06,Solano,2020
1,2,Quail,2020-06-06,Solano,2020
2,3,Quail,2020-06-06,Solano,2020
3,4,Quail,2020-06-06,Solano,2020
4,5,Quail,2020-06-06,Solano,2020


In [None]:
# Aggregate County and Year to prevent duplication
wildfire = (
    wildfire
    .groupby(["Year", "County"])
    .agg(
        wildfire_records=("OBJECTID", "count"),
        wildfire_incidents=("Incident Name", "nunique")
    )
    .reset_index()
)

wildfire.head()

Unnamed: 0,Year,County,wildfire_records,wildfire_incidents
0,2013,Riverside,58,1
1,2013,Shasta,222,1
2,2014,Mendocino,11,1
3,2014,San Diego,49,1
4,2014,Siskiyou,250,1


**Merge the two Datasets**

In [89]:
merged = asthma.merge(
    wildfire,
    how="left",
    on=["Year", "County"]
)

# If there are no wildfire incidents, replace the value by 0
merged["wildfire_records"] = merged["wildfire_records"].fillna(0).astype(int)
merged["wildfire_incidents"] = merged["wildfire_incidents"].fillna(0).astype(int)
merged.head(10)


Unnamed: 0,Year,Zip Code,County,Age_Group,Number of ED visits,Age-adjusted Rate,wildfire_records,wildfire_incidents
0,2013,90001,Los Angeles,Child,201,100.1,0,0
1,2013,90002,Los Angeles,Child,210,111.8,0,0
2,2013,90003,Los Angeles,Child,307,130.1,0,0
3,2013,90004,Los Angeles,Child,169,129.4,0,0
4,2013,90005,Los Angeles,Child,52,64.0,0,0
5,2013,90006,Los Angeles,Child,119,81.9,0,0
6,2013,90007,Los Angeles,Child,88,126.2,0,0
7,2013,90008,Los Angeles,Child,115,157.3,0,0
8,2013,90011,Los Angeles,Child,371,99.4,0,0
9,2013,90012,Los Angeles,Child,25,69.5,0,0


The exposure to wildfire level will be categoirzed into four categories:
1. No wildfire: the area does not have wildfire in that particular year
2. Low wildfire: the area has 1 - 19 cases of wildfire in that particular year
3. Moderate wildfire: the area has 20 - 99 cases of wildfire in that particular year
4. High wildfire: the area has more than 100 cases of wildfire in that particular year, at a very high level

In [90]:
# Categorize exposures according to the occurrence
def exposure_category(n):
    if n == 0:
        return "No wildfire"
    elif n < 20:
        return "Low wildfire (1–19)"
    elif n < 100:
        return "Moderate wildfire (20–99)"
    else:
        return "High wildfire (100+)"
        
merged["exposure_cat"] = merged["wildfire_records"].apply(exposure_category)
merged["exposure_cat"].value_counts()


exposure_cat
No wildfire                  12755
High wildfire (100+)          3823
Low wildfire (1–19)           1620
Moderate wildfire (20–99)      992
Name: count, dtype: int64

**Summary of county-level age-adjusted rate and number of ED Visits**

In [96]:
overall_summary = (
    merged
    .groupby("exposure_cat")
    .agg(
        counties       = ("County", "nunique"),
        mean_rate      = ("Age-adjusted Rate", "mean"),
        median_rate    = ("Age-adjusted Rate", "median"),
        mean_total_ed  = ("Number of ED visits", "mean"),
    )
    .reset_index()
    .sort_values("mean_rate", ascending=True)
)

overall_summary

Unnamed: 0,exposure_cat,counties,mean_rate,median_rate,mean_total_ed
2,Moderate wildfire (20–99),25,41.287802,34.9,68.146169
0,High wildfire (100+),35,41.388909,33.2,73.29244
1,Low wildfire (1–19),36,46.787222,37.8,67.04321
3,No wildfire,56,52.655123,42.3,80.704978


In [100]:
plot_df1 = merged[merged["exposure_cat"] != "No wildfire"]

fig = px.scatter(
    plot_df1,
    x="wildfire_records",
    y="Age-adjusted Rate",
    color="exposure_cat",
    hover_data=["County", "Year", "Number of ED visits"],
    title="County-Level Asthma ED Rate vs Wildfire Activity (Excluding No Wildfire Counties)",
    labels={
        "wildfire_records": "Wildfire structure records",
        "Age-adjusted Rate": "Age-adjusted asthma ED rate"
    }
)
fig.show()


In [102]:
trend = (
    merged
    .groupby(["Year", "exposure_cat"])["Age-adjusted Rate"].mean()                              
    .reset_index(name="mean_rate")   
)

trend = trend[trend["exposure_cat"] != "No wildfire"]

fig = px.line(
    trend,
    x="Year",
    y="mean_rate",
    color="exposure_cat",
    markers=True,
    title="Asthma ED Rates Over Time by Wildfire Exposure (County-level)",
)
fig.show()

## Child vs. Adults

Let's explore whether children are more prone to the impact of wildfire than adult.

In [103]:
# Keep only Child and Adult records
age_filtered = merged[merged["Age_Group"].isin(["Child", "Adult"])].copy()


In [107]:
plot_df = age_filtered[age_filtered["exposure_cat"] != "No wildfire"]
fig = px.scatter(
    plot_df,
    x="wildfire_records",
    y="Age-adjusted Rate",
    color="exposure_cat",
    facet_col="Age_Group",
    hover_data=["County", "Year", "Number of ED visits"],
    title="Asthma ED Rate vs Wildfire Activity, by Age Group (County × Year)",
    labels={
        "wildfire_records": "Wildfire structure records (CAL FIRE, County × Year)",
        "Age-adjusted Rate": "Mean age-adjusted asthma ED rate"
    }
)

fig.show()

In [109]:
trend_by_age = (
    age_filtered
    .groupby(["Year", "Age_Group", "exposure_cat"])
    .agg(mean_rate=("Age-adjusted Rate", "mean"))
    .reset_index()
)
trend_df = trend_by_age[trend_by_age["exposure_cat"] != "No wildfire"]

fig = px.line(
    trend_df,
    x="Year",
    y="mean_rate",
    color="exposure_cat",
    facet_row="Age_Group",
    markers=True,
    title="Asthma ED Rates Over Time by Wildfire Exposure and Age Group",
    labels={"mean_rate": "Mean age-adjusted asthma ED rate"}
)

fig.show()


Across the study period, children consistently have higher asthma ED visit rates than adults in every wildfire exposure category. Moreover, among children, the gap between ‘High wildfire’ and ‘No wildfire’ counties widens during years with particularly intense fire seasons (e.g., [insert year once you see it], when wildfires were especially severe in California).