# Overview

## Context
California, distinguished by its contrast between thriving metropolitan areas and expansive natural beauty, has long contended with the challenge of wildfires. In recent years, these fires have grown not only in frequency but also in intensity and devastation. The impact is not only measured in acres burned or structures lost but now in the profound human toll. As of January 31, 2025, wildfires have killed at least 29 people, forced over 200k residents to evacuate, and destroyed more than 18k homes and structures, consuming 57k+ acres of land.

This project will explore the question, *“What patterns, clusters, and trends can be uncovered in California’s fire damage data when examining structural characteristics, geographic distributions, and incident details?”* Rather than focusing on predictive modeling, our exploratory analysis seeks to uncover underlying relationships and clusters within the data. By leveraging detailed records of structural damage, incident specifics, and geospatial coordinates, we aim to pinpoint the areas most heavily impacted by wildfire damage and to raise further questions about the contributing factors.

Our analysis is intended for California's community leaders, public safety advocates, and other stakeholders dedicated to strengthening local resilience. The insights we uncover are designed to guide strategic investments in fire safety and emergency preparedness, paving the way for proactive, data-driven measures to mitigate wildfire rsks.


## Dataset
The California Wildfire Data dataset, provided by [Kaggle](https://www.kaggle.com/datasets/vijayveersingh/the-california-wildfire-data), comprises over 100k detailed records of wildfire incidents across the state. The dataset includes critical information on structural characteristics, incident specifics, damage assessments, and geospatial coordinates, offering a comprehensive view of wildfire impacts across diverse regions


# Import Libraries and Data

In [65]:
#!pip install geopandas

In [66]:
import pandas as pd
import geopandas as gpd # type: ignore
import plotly.express as px
import plotly.graph_objects as go
from sklearn.cluster import DBSCAN
import numpy as np

In [67]:
# Read in the data
df1 = pd.read_csv('Wildfire Data.csv', dtype={'Battalion': str, 'Fire Name (Secondary)': str, 'APN (parcel)': str})
df2 = gpd.read_file('Postfire_Master_Data_Share.geojson')

# Data Joining and Pre-processing
Before starting with analysis, we prepared the dataset through the following steps:

**Cleaning DF1 (CSV Data):**  
  - Strip extraneous characters (things like asterisks and leading spaces) from column names  
  - Drop redundant ID columns and remove any text enclosed in parentheses
  - Rename key columns for clarity

**Standardizing DF2 (GeoJSON Data):**  
  - Remove the `OBJECTID` column
  - Apply a rename mapping to standardize column names so they align with DF1

**Merging Datasets:**  
  - Merge DF1 and DF2 on the common key `GLOBALID` using an inner join
  - For overlapping columns, fill missing DF1 values using corresponding DF2 data and drop duplicates 
  - Handle columns with similar but not identical names via a manual mapping, then remove duplicates

**Handling Missing Values & Data Transformation:**  
  - Identify and drop rows missing key visual attributes (features like `Vent Screen`, `Window Pane`, etc.) 
  - Drop non-informative columns
  - Fill high-missing categorical fields with 'Unknown'
  - Convert high-missing numerical count columns to numeric and impute missing values with 0
  - For distance measurements and additional numeric fields (such as `Assessed Improved Value` & `Year Built`), convert to numeric and fill missing values with the median 
  - Impute any remaiing categorical fields


In [68]:
# Preprocess Columns in DF1

# Strip * and then leading spaces from start of column names
df1.columns = df1.columns.str.lstrip('*').str.lstrip()

# Drop ID columns since Pandas will add its own
df1.drop(columns=['_id','OBJECTID'], inplace=True)

# Remove parentheses (and text inside) from column names
df1.columns = df1.columns.str.replace(r"\(.*\)", "", regex=True).str.strip()

# Rename certain columns
df1.rename(columns={'If Affected 1-9% - Where did fire start?':'Fire Start Location',
                    'If Affected 1-9% - What started fire?' : 'Fire Cause'}, inplace=True)

df1.columns.tolist(), df1.shape

(['Damage',
  'Street Number',
  'Street Name',
  'Street Type',
  'Street Suffix',
  'City',
  'State',
  'Zip Code',
  'CAL FIRE Unit',
  'County',
  'Community',
  'Battalion',
  'Incident Name',
  'Incident Number',
  'Incident Start Date',
  'Hazard Type',
  'Fire Start Location',
  'Fire Cause',
  'Structure Defense Actions Taken',
  'Structure Type',
  'Structure Category',
  '# Units in Structure',
  '# of Damaged Outbuildings < 120 SQFT',
  '# of Non Damaged Outbuildings < 120 SQFT',
  'Roof Construction',
  'Eaves',
  'Vent Screen',
  'Exterior Siding',
  'Window Pane',
  'Deck/Porch On Grade',
  'Deck/Porch Elevated',
  'Patio Cover/Carport Attached to Structure',
  'Fence Attached to Structure',
  'Distance - Propane Tank to Structure',
  'Distance - Residence to Utility/Misc Structure &gt; 120 SQFT',
  'Fire Name',
  'APN',
  'Assessed Improved Value',
  'Year Built',
  'Site Address',
  'GLOBALID',
  'Latitude',
  'Longitude',
  'x',
  'y'],
 (100230, 45))

In [69]:
# Preprocess Columns in DF2

df2.drop(columns=['OBJECTID'], inplace=True)

rename_mapping = {
    "DAMAGE": "Damage",
    "STREETNUMBER": "Street Number",
    "STREETNAME": "Street Name",
    "STREETTYPE": "Street Type",
    "STREETSUFFIX": "Street Suffix",
    "CITY": "City",
    "STATE": "State",
    "ZIPCODE": "Zip Code",
    "CALFIREUNIT": "Calfire Unit",
    "COUNTY": "County",
    "COMMUNITY": "Community",
    "BATTALION": "Battalion",
    "INCIDENTNAME": "Incident Name",
    "INCIDENTNUM": "Incident Num",
    "INCIDENTSTARTDATE": "Incident Start Date",
    "HAZARDTYPE": "Hazard Type",
    "WHEREFIRESTARTEDONSTRUCTURE": "Where Fire Started On Structure",
    "WHATDIDFIRESTARTFROM": "What Did Fire Start From",
    "DEFENSIVEACTIONS": "Defensive Actions",
    "STRUCTURETYPE": "Structure Type",
    "STRUCTURECATEGORY": "Structure Category",
    "NUMBEROFUNITPERSTRUCTURE": "Number Of Unit Per Structure",
    "NOOUTBUILDINGSDAMAGED": "No Out Buildings Damaged",
    "NOOUTBUILDINGSNOTDAMAGED": "No Out Buildings Not Damaged",
    "ROOFCONSTRUCTION": "Roof Construction",
    "EAVES": "Eaves",
    "VENTSCREEN": "Ventscreen",
    "EXTERIORSIDING": "Exterior Siding",
    "WINDOWPANE": "Window Pane",
    "DECKPORCHONGRADE": "Deck Porch On Grade",
    "DECKPORCHELEVATED": "Deck Porch Elevated",
    "PATIOCOVERCARPORT": "Patio Cover Carport",
    "FENCEATTACHEDTOSTRUCTURE": "Fence Attached To Structure",
    "PROPANETANKDISTANCE": "Propane Tank Distance",
    "UTILITYMISCSTRUCTUREDISTANCE": "Utility Misc Structure Distance",
    "FIRENAME": "Fire Name",
    "APN": "Apn",
    "ASSESSEDIMPROVEDVALUE": "Assessed Improved Value",
    "YEARBUILT": "Year Built",
    "SITEADDRESS": "Site Address",
    "GLOBALID": "GLOBALID",  
    "LATITUDE": "Latitude",
    "LONGITUDE": "Longitude",
    "GEOMETRY": "Geometry"
}

df2.rename(columns=rename_mapping, inplace=True)
df2.columns.tolist(), df2.shape

(['Damage',
  'Street Number',
  'Street Name',
  'Street Type',
  'Street Suffix',
  'City',
  'State',
  'Zip Code',
  'Calfire Unit',
  'County',
  'Community',
  'Battalion',
  'Incident Name',
  'Incident Num',
  'Incident Start Date',
  'Hazard Type',
  'Where Fire Started On Structure',
  'What Did Fire Start From',
  'Defensive Actions',
  'Structure Type',
  'Structure Category',
  'Number Of Unit Per Structure',
  'No Out Buildings Damaged',
  'No Out Buildings Not Damaged',
  'Roof Construction',
  'Eaves',
  'Ventscreen',
  'Exterior Siding',
  'Window Pane',
  'Deck Porch On Grade',
  'Deck Porch Elevated',
  'Patio Cover Carport',
  'Fence Attached To Structure',
  'Propane Tank Distance',
  'Utility Misc Structure Distance',
  'Fire Name',
  'Apn',
  'Assessed Improved Value',
  'Year Built',
  'Site Address',
  'GLOBALID',
  'Latitude',
  'Longitude',
  'geometry'],
 (100230, 44))

In [70]:
# Merge Datasets 

merged_df = df1.merge(df2, on="GLOBALID", how="inner", suffixes=('', '_df2'))

# First, for columns with identical names (except the join key and geometry),
# fill in missing values in df1 using the corresponding df2 column.
overlap_cols = set(df1.columns).intersection(df2.columns) - {"GLOBALID", "geometry"}
for col in overlap_cols:
    col_df2 = col + '_df2'
    if col_df2 in merged_df.columns:
        merged_df[col] = merged_df[col].fillna(merged_df[col_df2])
        # Drop the duplicate column after imputation
        merged_df.drop(columns=[col_df2], inplace=True)

# Next, handle columns that have similar but not identical names.
# Create a mapping where the key is the df1 column to fill and the value is the df2 column.
manual_mapping = {
    'Vent Screen': 'Ventscreen',
    'Deck/Porch On Grade': 'Deck Porch On Grade',
    'Deck/Porch Elevated': 'Deck Porch Elevated',
    'Patio Cover/Carport Attached to Structure': 'Patio Cover Carport',
    'Distance - Propane Tank to Structure': 'Propane Tank Distance',
    'Distance - Residence to Utility/Misc Structure &gt; 120 SQFT': 'Utility Misc Structure Distance',
    'APN': 'Apn'
}

for col_df1, col_df2 in manual_mapping.items():
    if col_df2 in merged_df.columns:
        merged_df[col_df1] = merged_df[col_df1].fillna(merged_df[col_df2])
        # Drop the df2 column after filling
        merged_df.drop(columns=[col_df2], inplace=True)

# Optionally, if there remain any duplicate columns that weren't handled by the fillna,
# list them explicitly and drop them.
columns_to_drop = [
    'Ventscreen', 
    'Deck Porch On Grade', 
    'Deck Porch Elevated', 
    'Patio Cover Carport', 
    'Propane Tank Distance', 
    'Utility Misc Structure Distance', 
    'Apn'
]
merged_df = merged_df.drop(columns=columns_to_drop, errors='ignore')

# Display the remaining columns and shape
merged_df.columns.tolist(), merged_df.shape

(['Damage',
  'Street Number',
  'Street Name',
  'Street Type',
  'Street Suffix',
  'City',
  'State',
  'Zip Code',
  'CAL FIRE Unit',
  'County',
  'Community',
  'Battalion',
  'Incident Name',
  'Incident Number',
  'Incident Start Date',
  'Hazard Type',
  'Fire Start Location',
  'Fire Cause',
  'Structure Defense Actions Taken',
  'Structure Type',
  'Structure Category',
  '# Units in Structure',
  '# of Damaged Outbuildings < 120 SQFT',
  '# of Non Damaged Outbuildings < 120 SQFT',
  'Roof Construction',
  'Eaves',
  'Vent Screen',
  'Exterior Siding',
  'Window Pane',
  'Deck/Porch On Grade',
  'Deck/Porch Elevated',
  'Patio Cover/Carport Attached to Structure',
  'Fence Attached to Structure',
  'Distance - Propane Tank to Structure',
  'Distance - Residence to Utility/Misc Structure &gt; 120 SQFT',
  'Fire Name',
  'APN',
  'Assessed Improved Value',
  'Year Built',
  'Site Address',
  'GLOBALID',
  'Latitude',
  'Longitude',
  'x',
  'y',
  'Calfire Unit',
  'Incident N

In [71]:
# Show the columns with null values in descending order
null_columns = merged_df.isnull().sum().sort_values(ascending=False)
null_columns[null_columns > 0]

Battalion                                                       93551
Fire Cause                                                      91214
What Did Fire Start From                                        90904
Fire Start Location                                             89490
Where Fire Started On Structure                                 89177
Fire Name                                                       76883
Structure Defense Actions Taken                                 75760
Defensive Actions                                               74826
Distance - Residence to Utility/Misc Structure &gt; 120 SQFT    74040
No Out Buildings Not Damaged                                    69157
# of Non Damaged Outbuildings < 120 SQFT                        69157
No Out Buildings Damaged                                        69145
# of Damaged Outbuildings < 120 SQFT                            69145
# Units in Structure                                            69046
Number Of Unit Per S

In [72]:
# Drop rows missing key columns 
columns_to_drop = [
    'Vent Screen', 'Eaves', 'Window Pane', 'Exterior Siding', 
    'Roof Construction', 'APN', 'County', 'State'
]
merged_df = merged_df.dropna(subset=columns_to_drop)

# Drop Street Number column since it is not useful for analysis
merged_df = merged_df.drop(columns=['Street Number'], errors='ignore')

# Fill high-missing categorical columns with 'Unknown'
categorical_columns = [
    'Battalion', 
    'Fire Cause', 
    'What Did Fire Start From',
    'Fire Start Location', 
    'Where Fire Started On Structure', 
    'Fire Name',
    'Structure Defense Actions Taken', 
    'Defensive Actions',
    'Street Suffix', 
    'Zip Code', 
    'Community', 
    'Street Type', 
    'Street Name', 
    'City', 
    'Site Address'
]
for col in categorical_columns:
    if col in merged_df.columns:
        merged_df[col] = merged_df[col].fillna('Unknown')

# For high-missing numerical count-like columns, convert to numeric and fill with 0 
count_columns = [
    'No Out Buildings Not Damaged', 
    '# of Non Damaged Outbuildings < 120 SQFT',
    'No Out Buildings Damaged', 
    '# of Damaged Outbuildings < 120 SQFT',
    '# Units in Structure', 
    'Number Of Unit Per Structure'
]
for col in count_columns:
    if col in merged_df.columns:
        merged_df[col] = pd.to_numeric(merged_df[col], errors='coerce')
        merged_df[col] = merged_df[col].fillna(0)

# For distance columns, convert to numeric and fill missing values with the median
distance_columns = [
    'Distance - Residence to Utility/Misc Structure &gt; 120 SQFT',
    'Distance - Propane Tank to Structure'
]
for col in distance_columns:
    if col in merged_df.columns:
        merged_df[col] = pd.to_numeric(merged_df[col], errors='coerce')
        median_val = merged_df[col].median()
        if pd.isna(median_val):  # fallback if median is NaN
            median_val = 0
        merged_df[col] = merged_df[col].fillna(median_val)

# Convert additional numeric columns and fill missing values with the median
numeric_cols = [
    'Assessed Improved Value', 
    'Year Built'
]
for col in numeric_cols:
    if col in merged_df.columns:
        merged_df[col] = pd.to_numeric(merged_df[col], errors='coerce')
        median_val = merged_df[col].median()
        merged_df[col] = merged_df[col].fillna(median_val)

# Imputation for remaining categorical columns
if 'Fence Attached to Structure' in merged_df.columns:
    merged_df['Fence Attached to Structure'] = merged_df['Fence Attached to Structure'].fillna('None')

merged_df.isnull().sum()


Mean of empty slice



Damage                                                          0
Street Name                                                     0
Street Type                                                     0
Street Suffix                                                   0
City                                                            0
State                                                           0
Zip Code                                                        0
CAL FIRE Unit                                                   0
County                                                          0
Community                                                       0
Battalion                                                       0
Incident Name                                                   0
Incident Number                                                 0
Incident Start Date                                             0
Hazard Type                                                     0
Fire Start

The above pre-processing steps ensure that our merged dataset is clean, consistent, and ready for detailed exploratory analysis and visualization.

# Proposed Responsibilities and Timeline

**Ethan:** Write the project proposal after discussing as a group what the research question/problem statement/suggestion is along with the intended audience. Also, once project is nearly complete, will write the summary of data cleaning and transformation steps, key insights from exploratory analysis and justification of any data exclusions or assumptions (if any). If new dataset is picked, process it if needed for Timmy and Andy to use. 

**Timmy and Andy:** Work on the primary visualizations in Python to effectively communicate insights. For each graph, just bullet point reasons for design choices (colors, typography, chart types, interactivity) for Michael to incorporate in the slide deck. Visuals should have actionable titles and follow other class principles. 

**Michael:** Work on presentation flow and slide deck formatting. Also be the main person giving feedback on visuals made by Timmy and Andy. First half is a presentation to the intended audience. Second half is justifying design and storytelling choices. Exact time of presentation still TBD but can cut/add based on any changes.

**By Saturday evening**: Everyone review the dataset before the evening. If we like it, decide on the research question/problem statement/suggestion and intended audience. If not, pick new one and review together before deciding main idea and intended audience. Responsibilities can stay the same regardless of dataset choice. 

**By Tuesday evening**: Finish Visualizations and Write Ups

**By Wednesday evening**: Finish Slide Deck and Practice Presentation

# Exploratory Data Analysis
Below is an investigation of our data via visualizations to better grasp the data's key distributions.

In [73]:
fire_df = merged_df.copy()

In [74]:
fire_df['Incident Start Date'] = pd.to_datetime(fire_df['Incident Start Date'])
fire_df.groupby('Incident Name')['Incident Start Date'].nunique().reset_index()


Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.



Unnamed: 0,Incident Name,Incident Start Date
0,46th,1
1,Aborn,1
2,Aero,1
3,Agua,1
4,Airport,1
...,...,...
257,Winding,1
258,Windy,1
259,Woods,1
260,Woolsey,1


In [75]:
#Looks like incident start date is unique for each incident name
# unfortunately, this means we can't determine where the first started, limiting the analysis related
# to the start of the fire

## Wildfire Impact: Incident Count Per County

In [76]:
# Aggregate the count of rows (incidents) per county and incident
county_incident_counts = fire_df.groupby(['County', 'Incident Name']).size().reset_index(name='Incident Count')
#looking only at top 10% of incidents per name, lowers amount we have to look at
threshold = fire_df.groupby(['County', 'Incident Name']).size().quantile(0.90)
county_incident_counts = county_incident_counts[county_incident_counts['Incident Count'] > threshold]

counties_geojson_url = "https://raw.githubusercontent.com/codeforamerica/click_that_hood/master/public/data/california-counties.geojson"
counties_geojson = gpd.read_file(counties_geojson_url)
counties_geojson['name'] = counties_geojson['name'].str.strip().str.lower()
county_incident_counts['County'] = county_incident_counts['County'].str.strip().str.lower()

# Ensure all counties from both datasets are included
all_counties = pd.DataFrame({'County': pd.concat([
    pd.Series(counties_geojson['name'].unique()), 
    pd.Series(county_incident_counts['County'].unique())
]).unique()})

county_incident_counts = all_counties.merge(county_incident_counts, on='County', how='left').fillna(0)

# Aggregate total number of incidents per county for "All Incidents"
total_county_incidents = county_incident_counts.groupby('County', as_index=False)['Incident Count'].sum()
total_county_incidents['Incident Name'] = 'All Incidents'
county_incident_counts = pd.concat([total_county_incidents, county_incident_counts], ignore_index=True)

# Merge fire data with county boundaries, ensuring all counties stay visible
merged_data = counties_geojson.merge(county_incident_counts, left_on='name', right_on='County', how='left').fillna(0)
unique_incidents = list(map(str, county_incident_counts['Incident Name'].unique()))
unique_incidents = [inc for inc in unique_incidents if inc not in ["0", ""]]
unique_incidents = sorted(unique_incidents, key=lambda x: (x != "All Incidents", x))
dropdown_options = [{'label': inc, 'value': inc} for inc in unique_incidents]


def get_incident_data(selected_incident):
    """
    Returns a list of incident counts per county for a given incident
    and the maximum value for color scaling.
    """
    selected_data = merged_data[merged_data['Incident Name'] == selected_incident][['name', 'Incident Count']]
    # Map county name -> incident count
    county_map = dict(zip(selected_data['name'], selected_data['Incident Count']))
    incident_values = [county_map.get(county, 0) for county in counties_geojson['name']]
    max_value = max(incident_values)
    return incident_values, max_value

# Default data for "All Incidents"
all_incidents_values, all_incidents_max = get_incident_data('All Incidents')

# Create figure with "All Incidents" as the default view
fig = px.choropleth(
    merged_data[merged_data['Incident Name'] == 'All Incidents'],
    geojson=counties_geojson,
    locations='name',
    featureidkey="properties.name",
    color='Incident Count',
    color_continuous_scale="Reds",
    range_color=(0, all_incidents_max),  # Dynamic range based on selected incident
    title="Counties With Most Structural Damage From Incidents",
    labels={'Incident Count': 'Incident Count'}
)


buttons = []
for inc in unique_incidents:
    incident_values, max_value = get_incident_data(inc)
    buttons.append({
        'label': inc,
        'method': 'update',
        'args': [
            {'z': [incident_values]}, 
            {'coloraxis': {'colorscale': 'Reds', 'cmin': 0, 'cmax': max_value}} 
        ]
    })

fig.update_layout(
    updatemenus=[{
        'buttons': buttons,
        'direction': 'down',
        'showactive': True
    }],
    coloraxis_colorbar=dict(title="Incident Count") 
)

# Ensure all counties remain visible at all times
fig.update_geos(fitbounds="locations", visible=False)
fig.show()


The above interactive choropleth map visualizes wildfire incident counts across California’s counties. By using the dropdown menu, you can switch between 'All Incidents' and specific wildfires to see how different events impacted each county.

From the high-level view of all incidents, the map reveals that northern counties, particularly those in or near forested and mountainous regions, show notably higher incident counts. This pattern reflect the region’s known vulnerability to large-scale fires, usually driven by dense vegetation, varied terrain, and seasonal weather conditions. Meanwhile, counties with lighter shading, often in more urbanized or desert areas, indicate fewer recorded events (although they are not immune to wildfire risk).

## Wildfire Incident Density Map

In [77]:
fire_df['Latitude']

0         38.474960
1         38.477442
2         38.479358
3         38.487313
4         38.485636
            ...    
100225    34.033408
100226    34.033278
100227    34.033618
100228    34.032085
100229    34.031957
Name: Latitude, Length: 97897, dtype: float64

In [78]:
fire_df['Incident Name'].value_counts().head(20)

Incident Name
Camp                   23613
Tubbs                   6023
LNU Lightning Cmplx     5102
CZU Lightning Cmplx     4820
Glass                   4751
Caldor                  4444
Dixie                   3840
Creek                   3302
North Complex           3186
Valley                  2415
Park                    2086
Woolsey                 2007
Mountain                1905
Carr                    1890
Kincade                 1568
Nuns                    1554
Atlas                   1372
Thomas                  1365
Silverado               1216
SCU Lightning Cmplx      947
Name: count, dtype: int64

In [79]:
fire_df[fire_df['Incident Name'] == 'Butte']['County'].value_counts()

County
Calaveras    907
Name: count, dtype: int64

In [80]:
fig = px.density_mapbox(
    fire_df,
    lat='Latitude',
    lon='Longitude',
    radius=1,
    center=dict(lat=37.7783, lon=-119.4179),
    zoom=4,
    mapbox_style="carto-positron",
    title="Areas With Highest Density of Fires"
)

fig.show()



*density_mapbox* is deprecated! Use *density_map* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/



This density map highlights where wildfire incidents are most concentrated across California, with yellow areas indicating the highest density of reported events.

Notable clusters appear near major urban regions (the SF Bay Area & Sacramento) as well as in other mountainous terrain, underscoring the need for targeted prevention efforts in these hotspots.

In contrast, the southern part of the state shows fewer dense clusters, suggesting that this region experiences less frequent or less concentrated incidents.

## Locations of the Densest Fires

In [81]:
# Show only top 10 incidents
top_10_incidents = fire_df['Incident Name'].value_counts().head(10).index.tolist()
unique_incidents = ['All Incidents'] + top_10_incidents

#Runs each incident under DBSCAN
def cluster_data_for_incident(incident_name):
    if incident_name == 'All Incidents':
        eps_val = 0.0003
        min_samp = 3
        subset = fire_df.dropna(subset=['Latitude', 'Longitude']).copy()
    else:
        eps_val = 0.005
        min_samp = 7
        subset = fire_df.dropna(subset=['Latitude', 'Longitude'])
        subset = subset[subset['Incident Name'] == incident_name].copy()
    
    if subset.empty:
        return pd.DataFrame(columns=['Latitude', 'Longitude', 'Incident Count', 'marker_size', 'cluster'])
    
    coords = subset[['Latitude', 'Longitude']].values
    db = DBSCAN(eps=eps_val, min_samples=min_samp)
    cluster_labels = db.fit_predict(coords)
    subset['cluster'] = cluster_labels
    subset = subset[subset['cluster'] != -1].copy()
    if subset.empty:
        return pd.DataFrame(columns=['Latitude', 'Longitude', 'Incident Count', 'marker_size', 'cluster'])
    
    # Compute cluster centroids and counts.
    cluster_centroids = subset.groupby('cluster')[['Latitude', 'Longitude']].mean().reset_index()
    cluster_counts = subset.groupby('cluster').size().reset_index(name='Incident Count')
    cluster_data = pd.merge(cluster_centroids, cluster_counts, on='cluster')

    #Adjust for different cluster sizes
    scaling_factor = 5
    cluster_data['marker_size'] = np.log1p(cluster_data['Incident Count']) * scaling_factor
    return cluster_data

# Adjust for different map centers
def get_center_and_zoom(incident_name):
    if incident_name == 'All Incidents':
        subset = fire_df.dropna(subset=['Latitude', 'Longitude']).copy()
    else:
        subset = fire_df.dropna(subset=['Latitude', 'Longitude'])
        subset = subset[fire_df['Incident Name'] == incident_name].copy()
    if subset.empty:
        return dict(lat=37.0, lon=-120.0), 5
    center_lat = (subset['Latitude'].max() + subset['Latitude'].min()) / 2
    center_lon = (subset['Longitude'].max() + subset['Longitude'].min()) / 2
    lat_range = subset['Latitude'].max() - subset['Latitude'].min()
    lon_range = subset['Longitude'].max() - subset['Longitude'].min()
    extent = max(lat_range, lon_range)
    if extent < 1:
        zoom = 9
    elif extent < 2:
        zoom = 8
    elif extent < 7:
        zoom = 5
    else:
        zoom = 4
    return dict(lat=center_lat, lon=center_lon), zoom

# Scatterplots for each incident
fig = go.Figure()
incident_centers = {}
incident_zooms = {}

for i, inc in enumerate(unique_incidents):
    cdata = cluster_data_for_incident(inc)
    if inc == 'All Incidents':
        cdata = cdata.sort_values(by='Incident Count', ascending=True)
        current_opacity = 0.3
    else:
        current_opacity = 1
    center, zoom = get_center_and_zoom(inc)
    incident_centers[inc] = center
    incident_zooms[inc] = zoom

    if not cdata.empty:
        cmin = cdata['Incident Count'].min()
        cmax = cdata['Incident Count'].max()
    else:
        cmin, cmax = None, None

    trace = go.Scattermapbox(
        lat=cdata['Latitude'],
        lon=cdata['Longitude'],
        mode='markers',
        marker=go.scattermapbox.Marker(
            size=cdata['marker_size'],
            color=cdata['Incident Count'],
            colorscale='sunsetdark',
            reversescale=True,
            cmin=cmin,
            cmax=cmax,
            showscale=True,
            opacity=current_opacity,
            colorbar=dict(title="Incident Count", x=1.0, xpad=10)
        ),
        text=cdata.apply(lambda row: f"Cluster {row['cluster']}: {row['Incident Count']} incidents", axis=1),
        hoverinfo='text',
        name=inc,
        visible=True if i == 0 else False
    )
    fig.add_trace(trace)

buttons = []
n = len(unique_incidents)
for j, inc in enumerate(unique_incidents):
    vis = [False] * n
    vis[j] = True
    center = incident_centers[inc]
    zoom = incident_zooms[inc]
    buttons.append(
        dict(
            label=inc,
            method='update',
            args=[
                {'visible': vis},
                {'title': f"Wildfire Incident Clusters (DBSCAN) - {inc}",
                 'mapbox.center': center,
                 'mapbox.zoom': zoom}
            ]
        )
    )
    
fig.update_layout(
    mapbox=dict(
        style="carto-positron",
        center=incident_centers['All Incidents'],
        zoom=incident_zooms['All Incidents']
    ),
    margin={"r": 0, "t": 50, "l": 0, "b": 0},
    title="Locations of Densest Fire Incidents",
    updatemenus=[{
        'buttons': buttons,
        'direction': 'down',
        'showactive': True,
        'x': 0.05,
        'xanchor': 'left',
        'y': 1.05,
        'yanchor': 'top'
    }],
    legend=dict(x=1.02, y=0.95),
    width = 800
)

fig.show()



*scattermapbox* is deprecated! Use *scattermap* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/


*scattermapbox* is deprecated! Use *scattermap* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/


*scattermapbox* is deprecated! Use *scattermap* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/


*scattermapbox* is deprecated! Use *scattermap* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/


*scattermapbox* is deprecated! Use *scattermap* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/


*scattermapbox* is deprecated! Use *scattermap* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/


*scattermapbox* is deprecated! Use *scattermap* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/


*scattermapbox* is deprecated! Use *scattermap* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/


*scattermapbox* is deprecated! Use *sca

The above interactive map uses DBSCAN to highlight dense clusters of wildfire incidents across California. Each circle represents a cluster, with its size and color reflecting the number of incidents in that cluster. Like previously, selecting 'All Incidents' or a specific fire in the dropdown menu, you can observe how hotspots shift.

Across all incidents, there are several high-intensity wildfire hotspots, including notable clusters in Southern California. While our earlier analysis suggested fewer frequent incidents in the south, these dense clusters highlight that whenever fires do occur there, they can be high in density and form significant local hotspots. It's something to keep in mind for stakeholders when discussing the need for targeted, region-specific mitigation strategies.

## Number of Fire Incidents in 2024

In [82]:
fire_df['Incident Start Date'] = pd.to_datetime(fire_df['Incident Start Date'])

years = sorted(fire_df['Incident Start Date'].dt.year.unique())
default_params = {'eps': 0.0005, 'min_samples': 4}
year_params = {}
#some years have particularly few datapoints, so I'm manually adjusting the parameters for those years
for y in years:
    if y == 2015:
        year_params[y] = {'eps': 0.001, 'min_samples': 2}
    elif y == 2016:
        year_params[y] = {'eps': 0.001, 'min_samples': 2}
    elif y == 2019:
        year_params[y] = {'eps': 0.005, 'min_samples':2}
    elif y == 2023:
        year_params[y] = {'eps': 0.0005, 'min_samples': 2}
    else:
        year_params[y] = default_params

# Function to cluster data for a given year using the manual parameters.
def cluster_data_for_year(year):
    subset = fire_df[fire_df['Incident Start Date'].dt.year == year].copy()
    if subset.empty:
        return pd.DataFrame(columns=['Latitude', 'Longitude', 'Incident Count', 'marker_size', 'cluster'])
    
    params = year_params.get(year, default_params)
    eps_val = params['eps']
    min_samp = params['min_samples']
    
    coords = subset[['Latitude', 'Longitude']].values
    db = DBSCAN(eps=eps_val, min_samples=min_samp)
    cluster_labels = db.fit_predict(coords)
    subset['cluster'] = cluster_labels
    subset = subset[subset['cluster'] != -1].copy()
    if subset.empty:
        return pd.DataFrame(columns=['Latitude', 'Longitude', 'Incident Count', 'marker_size', 'cluster'])
    
    cluster_centroids = subset.groupby('cluster')[['Latitude', 'Longitude']].mean().reset_index()
    cluster_counts = subset.groupby('cluster').size().reset_index(name='Incident Count')
    cluster_data = pd.merge(cluster_centroids, cluster_counts, on='cluster')
    
    scaling_factor = 5
    cluster_data['marker_size'] = np.log1p(cluster_data['Incident Count']) * scaling_factor
    return cluster_data

# Function to compute map center and zoom based on the midpoint of max/min lat/lon for a given year.
def get_center_and_zoom_for_year(year):
    subset = fire_df[fire_df['Incident Start Date'].dt.year == year].copy()
    if subset.empty:
        return dict(lat=37.0, lon=-120.0), 5
    center_lat = (subset['Latitude'].max() + subset['Latitude'].min()) / 2
    center_lon = (subset['Longitude'].max() + subset['Longitude'].min()) / 2
    lat_range = subset['Latitude'].max() - subset['Latitude'].min()
    lon_range = subset['Longitude'].max() - subset['Longitude'].min()
    extent = max(lat_range, lon_range)
    zoom = 4
    return dict(lat=center_lat, lon=center_lon), zoom

all_cluster_data = {}
global_max = 0
for year in years:
    data = cluster_data_for_year(year)
    all_cluster_data[year] = data
    if not data.empty:
        current_max = data['Incident Count'].max()
        if current_max > global_max:
            global_max = current_max

traces = []
year_centers = {}
year_zooms = {}

for i, year in enumerate(years):
    cdata = all_cluster_data[year]
    if not cdata.empty:
        cdata = cdata.sort_values(by='Incident Count', ascending=True)
    center, zoom = get_center_and_zoom_for_year(year)
    year_centers[year] = center
    year_zooms[year] = zoom
    
    trace = go.Scattermapbox(
        lat=cdata['Latitude'],
        lon=cdata['Longitude'],
        mode='markers',
        marker=go.scattermapbox.Marker(
            size=cdata['marker_size'],
            color=cdata['Incident Count'],
            colorscale='sunsetdark', 
            reversescale=True,
            cmin=0,
            cmax=global_max,
            showscale=True,
            opacity=0.8,
            colorbar=dict(title="Incident Count", x=1.0, xpad=10)
        ),
        text=cdata.apply(lambda row: f"Cluster {row['cluster']}: {row['Incident Count']} incidents", axis=1),
        hoverinfo='text',
        name=str(year),
        visible=True if i == 0 else False
    )
    traces.append(trace)

# Slider that will adjust for the year
slider_steps = []
n = len(years)
for idx, year in enumerate(years):
    vis = [False] * n
    vis[idx] = True
    center = year_centers[year]
    zoom = year_zooms[year]
    slider_steps.append(dict(
        method="update",
        label=str(year),
        args=[{'visible': vis},
              {'title': f"Fire Incidents in Year {years[idx]}",
               'mapbox.center': center,
               'mapbox.zoom': zoom}]
    ))

sliders = [dict(
    active=0,
    currentvalue={"prefix": "Year: "},
    pad={"t": 50},
    steps=slider_steps
)]

fig = go.Figure(data=traces)
fig.update_layout(
    mapbox=dict(
        style="carto-positron",
        center=year_centers[years[0]],
        zoom=year_zooms[years[0]]
    ),
    margin={"r": 0, "t": 50, "l": 0, "b": 0},
    title=f"Fire Incidents in Year {years[0]}",
    sliders=sliders,
    width= 800
)

fig.show()



*scattermapbox* is deprecated! Use *scattermap* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/



Here is another map using DBSCAN-clustering with an interactive slider to display wildfire incidents over the years. By moving the slider, you can see how clusters shift from year to year—each marker represents a cluster of incidents, with larger or darker circles indicating higher incident counts. 

The default view of this plot highlights 2024, revealing several large clusters in the north and somewhat-moderate activity as you head south. However, not shown is the spread of southern fires that have since broken out from the Palisades, hitting cities like Malibu and Santa Monica.

## A Deeper Dive
After having gone through the geographical distributions of California's wildfires over the years, the next thing to do was dive deeper into the state's resulting damage and defensive efforts.

## Amount of Structural Damage by County

In [83]:
#histogram of structure damage (type) by county
fig = px.histogram(
    merged_df,
    x="County",
    color="Damage",
    barmode="group",
    height=600,
    title="Structure Damage by County"
)
fig.update_layout(xaxis_title="County", yaxis_title="Number of Structures")
fig.show()

Above is a grouped histogram illustrating the varying degrees of structural damage across California’s counties, from “No Damage” to “Destroyed (>50%).” 

In most counties, “No Damage” is the most frequently reported category, suggesting that many structures remain unscathed. However, a few counties stand out for having higher counts in the “Major (26-50%)” and “Destroyed (>50%)” categories, indicating a disproportionate share of severe fire losses in those areas. 

The damage disparity shows how wildfire impacts can be highly uneven, with some regions experiencing far more catastrophic structural damage than others. We'll take a more specific look next.

## Top Structurally-Damaged Counties

In [84]:
# Count structures per county and take top 10
county_counts = merged_df["County"].value_counts().head(10)

fig = px.bar(
    x=county_counts.index,
    y=county_counts.values,
    height=500,
    title="Top 10 Counties by Number of Structures Affected",
    labels={"x": "County", "y": "Number of Structures"}
)
fig.update_layout(xaxis_tickangle=-45)
fig.show()

This bar chart ranks the top 10 counties by the total number of structures affected by wildfires. Butte County leads with over 28,600 structures (nearly 10 times more than Lake County, sitting at just under 2,900). Many of these top-ranking counties lie in Northern California, where rugged terrain and the wildland-urban interface can exacerbate fire spread. However, it must be noted that older infrastructure or limited fire-resistant building materials may contribute to higher structural vulnerability in general.

The wide amount of structures affected (from sub-3,000 to well over 28,000) highlights how some regions bear a substantially heavier burden of fire-related damage, reflecting vulnerabilities that have a need for targeted mitigation strategies.

# Further Processing
Further processing was completed to employ more specific views into the data.

## Join geographical data with income, demographic, and population information.

### Step 1: Geocoding Process: Fetch Center Latitude and Center Longitude Values Corresponding to Sub-counties

In [85]:
# # Your provided data as a string
# df = pd.read_csv('List_of_California_locations_by_income_3.csv')
# import pandas as pd
# import numpy as np
# from sklearn.neighbors import BallTree
# from geopy.geocoders import Nominatim
# from geopy.exc import GeocoderTimedOut, GeocoderServiceError
# import io

# # Assuming merged_df is your wildfires dataset with 'Latitude', 'Longitude', and 'County' columns
# # Filter df to only include places whose counties are in merged_df['County'] *before* geocoding
# df_filtered = df[df['County'].isin(merged_df['County'])]
# print(f"Number of places to geocode (filtered by merged_df counties): {len(df_filtered)}")

# # Initialize geolocator with a custom user agent
# geolocator = Nominatim(user_agent="california_wildfire_study")

# def get_lat_lon(place_name, county_name, max_attempts=3):
#     """Geocodes a place name and county to latitude and longitude with retry logic."""
#     query = f"{place_name}, {county_name} County, California"
#     for attempt in range(max_attempts):
#         try:
#             location = geolocator.geocode(query, timeout=10)
#             if location:
#                 return location.latitude, location.longitude
#             return None, None  # Return None if no location found
#         except (GeocoderTimedOut, GeocoderServiceError) as e:
#             if attempt == max_attempts - 1:
#                 print(f"Failed to geocode {query} after {max_attempts} attempts: {e}")
#                 return None, None
#             continue

# # Apply geocoding to the filtered dataframe with progress tracking
# print("Starting geocoding process...")
# latitudes = []
# longitudes = []
# for index, row in df_filtered.iterrows():
#     lat, lon = get_lat_lon(row['Place'], row['County'])
#     latitudes.append(lat)
#     longitudes.append(lon)
#     if index % 50 == 0:  # Progress update every 50 rows
#         print(f"Processed {index} of {len(df_filtered)} locations")

# df_filtered['Center Latitude'] = latitudes
# df_filtered['Center Longitude'] = longitudes

# # Remove rows where geocoding failed
# df_clean = df_filtered.dropna(subset=['Center Latitude', 'Center Longitude'])
# print(f"Successfully geocoded {len(df_clean)} out of {len(df_filtered)} locations")


### Step 2: Join merged_df with df_clean Using Closest-match County Information

In [86]:
# df_clean.to_csv('County_data_with_coords.csv')

In [87]:
# df_filtered['Center Latitude'] = latitudes
# df_filtered['Center Longitude'] = longitudes

# # Remove rows where geocoding failed
# df_clean = df_filtered.dropna(subset=['Center Latitude', 'Center Longitude'])
# print(f"Successfully geocoded {len(df_clean)} out of {len(df_filtered)} locations")

# # BallTree setup
# unique_counties = merged_df['County'].unique()
# print(f"Number of unique counties in merged_df: {len(unique_counties)}")

# county_places = {}
# for county in unique_counties:
#     places = df_clean[df_clean['County'] == county]
#     if places.empty:
#         county_places[county] = None
#         print(f"No places found in {county} County")
#         continue
#     coords_deg = places[['Center Latitude', 'Center Longitude']].values
#     coords_rad = np.deg2rad(coords_deg)
#     tree = BallTree(coords_rad, metric='haversine')
#     county_places[county] = {
#         'tree': tree,
#         'indices': places.index.values
#     }

# # Function to find nearest place
# def find_nearest_in_county(wildfire_row):
#     county = wildfire_row['County']
#     if county not in county_places or county_places[county] is None:
#         return None, None
#     wildfire_coord_deg = (wildfire_row['Latitude'], wildfire_row['Longitude'])
#     wildfire_coord_rad = np.deg2rad(wildfire_coord_deg)
#     tree = county_places[county]['tree']
#     indices = county_places[county]['indices']
#     distance_rad, idx = tree.query([wildfire_coord_rad], k=1)
#     distance_rad = distance_rad[0][0]
#     idx = idx[0][0]
#     place_index = indices[idx]
#     distance_km = distance_rad * 6371
#     return place_index, distance_km

# # Apply matching
# print("Finding nearest places for each wildfire...")
# results = merged_df.apply(find_nearest_in_county, axis=1, result_type='expand')
# merged_df['Nearest Place Index'] = results[0]
# merged_df['Distance to Nearest Place (km)'] = results[1]

# # Rejoin merged_df with df_clean
# columns_to_merge = ['Place', 'Population', 'Population\ndensity', 'Per capita income', 
#                     'Median household income', 'Median family income', 
#                     'Center Latitude', 'Center Longitude']
# result_df = merged_df.merge(
#     df_clean[columns_to_merge], 
#     left_on='Nearest Place Index', 
#     right_index=True, 
#     how='left'
# )

# # Clean up
# result_df = result_df.drop(columns=['Nearest Place Index'])

# # Diagnostics
# print("Join completed. Sample of result_df:")
# print(result_df[['Incident Name', 'County', 'Place', 'Median household income', 'Distance to Nearest Place (km)']].head())

# # Check for null Median Household Income
# null_income_df = result_df[result_df['Median household income'].isna()]
# print("\nNumber of rows with null Median Household Income:", len(null_income_df))
# print("Counties with null Median Household Income:", null_income_df['County'].unique().tolist())
# print("Sample of null Median Household Income rows:")
# print(null_income_df[['Incident Name', 'County', 'Place', 'Median household income']].head())

In [88]:
# result_df[['Latitude','Longitude','Center Latitude', 'Center Longitude']]

# result_df.dropna().to_csv('merged_df_with_add_info.csv')

result_df = pd.read_csv('merged_df_with_add_info.csv')

# Back to visual analysis

## Normalized Fire Incidents by Population Density and Damage Type

In [89]:
result_df = pd.read_csv('merged_df_with_add_info.csv')

# Convert Population density to numeric (coerce errors to NaN)
result_df["Population density"] = pd.to_numeric(result_df["Population density"], errors="coerce")

# Drop NaNs (optional: if necessary)
result_df = result_df.dropna(subset=["Population density"])

bins = [0, 200, 500, 750, float("inf")]
labels = ["Low Density (<200)", "Medium Density (200-500)", "High Density (500-750)", "Very High Density (>750)"]

result_df["Population Density Category"] = pd.cut(result_df["Population density"], bins=bins, labels=labels)

In [90]:
# Aggregate fire incidents by population density category and damage type
density_damage_counts = result_df.groupby(["Population Density Category", "Damage"]).size().reset_index(name="Count")

# Normalize counts within each Population Density Category
density_damage_counts["Total"] = density_damage_counts.groupby("Population Density Category")["Count"].transform("sum")
density_damage_counts["Percentage"] = (density_damage_counts["Count"] / density_damage_counts["Total"]) * 100

# Stacked bar chart (normalized percentages)
fig = px.bar(density_damage_counts, 
             x="Population Density Category", 
             y="Percentage", 
             color="Damage",
             title="Normalized Fire Incidents by Population Density and Damage Type",
             text_auto=".1f",  # Show percentages with 1 decimal place
             barmode="stack")

# Update y-axis to show percentages
fig.update_layout(yaxis_title="Percentage of Total Incidents (%)")

fig.show()








The above stacked bar chart normalizes wildfire incidents by population density category. Each bar represents a population density category, and the stacked segments within each bar show the percentage of wildfire incidents falling into each damage type.

It reveals that lower-density areas tend to have a higher share of 'No Damage' reports, while very high density regions show a greater proportion of more severe outcomes ('Major' or 'Destroyed'). The pattern suggests that densely populated areas may be more susceptible to, or more heavily impacted by, wildfire damage, which makes sense especially if large numbers of structures and inhabitants are concentrated in California's tighter regions (thus amplifying the potential scope of any single fire event).

## Proportion of Defensive Actions Taken Across Bins of Assessed Improved Value

In [93]:
from dash import Dash, dcc, html, Input, Output, callback
import plotly.express as px
import numpy as np
from sklearn.cluster import DBSCAN
import pandas as pd

app = Dash(__name__)

# Define app styles with white background
app_style = {
    'backgroundColor': 'white',
    'padding': '20px',
    'fontFamily': 'Arial, sans-serif'
}

filter_style = {
    'backgroundColor': 'white',
    'padding': '10px',
    'marginBottom': '15px',
    'borderRadius': '5px',
    'boxShadow': '0px 0px 5px rgba(0,0,0,0.1)'
}

label_style = {
    'fontWeight': 'bold',
    'marginBottom': '5px'
}

# Function to cluster nearby points
def cluster_points(df, eps=0.05, min_samples=2):
    # Extract coordinates for clustering
    coords = df[['Latitude', 'Longitude']].values
    
    # Apply DBSCAN clustering
    clustering = DBSCAN(eps=eps, min_samples=min_samples).fit(coords)
    
    # Add cluster labels to the dataframe
    df_clustered = df.copy()
    df_clustered['cluster'] = clustering.labels_
    
    # Aggregate data by clusters
    aggregated = []
    
    # Handle noise points (cluster = -1)
    noise_points = df_clustered[df_clustered['cluster'] == -1]
    
    # Aggregate clustered points
    for cluster_id in np.unique(clustering.labels_):
        if cluster_id == -1:  # Skip noise points as we'll add them separately
            continue
            
        cluster_points = df_clustered[df_clustered['cluster'] == cluster_id]
        
        # Calculate centroid and aggregate metrics
        centroid_lat = cluster_points['Latitude'].mean()
        centroid_lon = cluster_points['Longitude'].mean()
        total_damage = cluster_points['Damage'].sum()
        avg_income = cluster_points['Per capita income'].mean()
        
        # Get the most common income group
        income_group = cluster_points['Income Group'].mode()[0]
        
        # Get all places in this cluster
        places = ', '.join(cluster_points['Place'].unique())
        
        aggregated.append({
            'Latitude': centroid_lat,
            'Longitude': centroid_lon,
            'Damage': total_damage,
            'Per capita income': avg_income,
            'Income Group': income_group,
            'Place': f"Cluster: {places}",
            'Points': len(cluster_points)
        })
    
    # Add individual noise points
    for _, row in noise_points.iterrows():
        aggregated.append({
            'Latitude': row['Latitude'],
            'Longitude': row['Longitude'],
            'Damage': row['Damage'],
            'Per capita income': row['Per capita income'],
            'Income Group': row['Income Group'],
            'Place': row['Place'],
            'Points': 1
        })
    
    return pd.DataFrame(aggregated)

# Create the app layout
app.layout = html.Div([
    html.H1("Fire Incidents Map", style={'textAlign': 'center', 'color': '#333'}),
    
    # Income group filter
    html.Div([
        html.Label("Filter by Income Group:", style=label_style),
        dcc.Checklist(
            id='income-filter',
            options=[{'label': group, 'value': group} for group in income_order],
            value=income_order,  # Default: all selected
            inline=True,
            style={'backgroundColor': 'white'}
        )
    ], style=filter_style),
    
    # Clustering control
    html.Div([
        html.Label("Clustering Distance (degrees):", style=label_style),
        dcc.Slider(
            id='cluster-distance',
            min=0.01,
            max=0.2,
            step=0.01,
            value=0.05,
            marks={i/100: f'{i/100}' for i in range(1, 21, 2)}
        )
    ], style=filter_style),
    
    # Map
    html.Div([
        dcc.Graph(id='fire-map')
    ], style={'backgroundColor': 'white', 'padding': '10px', 'borderRadius': '5px'})
    
], style=app_style)  # Apply the white background to the entire app

@callback(
    Output('fire-map', 'figure'),
    [Input('income-filter', 'value'),
     Input('cluster-distance', 'value')]
)
def update_map(selected_income_groups, cluster_distance):
    # Filter the dataframe (only by income group now)
    filtered_df = map_df[map_df['Income Group'].isin(selected_income_groups)]
    
    # Apply clustering if there are enough points
    if len(filtered_df) > 1:
        clustered_df = cluster_points(filtered_df, eps=cluster_distance, min_samples=2)
    else:
        clustered_df = filtered_df.copy()
        clustered_df['Points'] = 1
    
    # Create the map
    fig = px.scatter_mapbox(clustered_df, 
                           lat="Latitude", 
                           lon="Longitude", 
                           size="Damage",
                           color="Per capita income",  # Changed to use continuous color scale
                           hover_name="Place",
                           hover_data={
                               "Per capita income": ":.2f", 
                               "Damage": True,
                               "Points": True  # Show how many points are in each cluster
                           },
                           title="Fire Incidents by Location and Income (Clustered)",
                           color_continuous_scale="Viridis",  # Use continuous color scale
                           size_max=60,
                           height = 600,
                           zoom=5)
    
    # Customize the color bar
    fig.update_layout(
        mapbox_style="carto-positron", 
        mapbox_center={"lat": clustered_df["Latitude"].mean(), "lon": clustered_df["Longitude"].mean()},
        paper_bgcolor='white',  # Set the figure background color
        plot_bgcolor='white',   # Set the plot area background color
        margin=dict(l=10, r=10, t=30, b=10),
        coloraxis_colorbar=dict(
            title="Per Capita Income",
            tickprefix="$",
            tickformat=","
        )
    )
    
    return fig

if __name__ == '__main__':
    app.run_server(debug=True)

NameError: name 'income_order' is not defined

Set clustering distance = .12 and check <30k, 30-50k, 90-120k, and >120k. What you should see is high income brackets with low incidence rates across coast, while landlocked groups experiencing high incidence rates. A deeper dive into how many of those fire incidents were resolved would be useful, but I don't know if we have information for that.

In [None]:
# Create bins for 'Assessed Improved Value' with increments of 300,000 up to 1.8M
bins = pd.cut(result_df['Assessed Improved Value'], 
              bins=range(0, 1800001, 300000), 
              labels=['0-300k', '300k-600k', '600k-900k', '900k-1.2M', '1.2M-1.5M', '1.5M-1.8M'])

result_df['AIV Bin'] = bins

# Calculate the proportion of defensive actions within each bin
proportion_df = result_df[result_df['Defensive Actions'] != 'Unknown'].groupby(['AIV Bin', 'Defensive Actions']).size().reset_index(name='Count')
proportion_df['Proportion'] = proportion_df.groupby('AIV Bin')['Count'].transform(lambda x: x / x.sum())

# Create the bar chart
fig = px.bar(proportion_df, 
             x='AIV Bin', 
             y='Proportion', 
             color='Defensive Actions', 
             title='Proportion of Defensive Actions Taken Across Bins of Assessed Improved Value',
             labels={'AIV Bin': 'Assessed Improved Value Bin', 'Proportion': 'Proportion of Defensive Actions'})

# Update layout for better readability
fig.update_layout(xaxis_title='Assessed Improved Value in Dollars, Binned', 
                  yaxis_title='Proportion of Defensive Actions')

fig.show()







The above stacked bar chart compares how frequently specific firefighting strategies (like civilian interventions or dozer fuel breaks) are employed across varying property values. Each bar corresponds to a bin of assessed improved value, with stacked segments showing the proportion of different defensive actions taken within that bin. Note that Defensive Actions = 'Unknown' were filtered out.

From the plot, we see that properties with higher assessed values appear to use more specialized or combined defensive measures, suggesting that wealthier communities may have better access to advanced firefighting resources. This disparity highlights how economic factors, in addition to geographic considerations, can shape the fire response strategies.

# Data Takeaways
Through this analysis, we’ve uncovered critical patterns in wildfire incidents and structural damage across California that highlight regional vulnerabilities and provide guidance for proactive fire management.

## Top Insights:
**There are clear geographic disparities in fire incidents.**  
   * Northern counties—especially those adjacent to forested and mountainous regions—consistently record high incident counts, while urban centers and desert areas generally show fewer events. However, when wildfires do occur in Southern California, they tend to form dense, high-impact clusters.

**The level of structural damage takes an uneven distribution.**  
   * The analysis reveals that some counties experience disproportionately high structural damage, with certain regions bearing far more catastrophic losses. Factors such as older infrastructure, limited fire-resistant construction, and the wildland-urban interface likely serve as contributions.

**Population density across the state has great impact.**  
   * Densely populated areas exhibit a greater share of severe fire outcomes, as the concentration of structures and inhabitants amplifies the effects of wildfires. This trend shows the need for tailored evacuation and fire prevention strategies in high-density regions.

## Call to Action for California Stakeholders:
* Focus on leveraging real-time analytics and historical fire data to improve predictive models and emergency response, particularly when seasonal conditions tend to elevate wildfire risks.
* Promote investing in resilient infrastructure by prioritizing updates to building codes, offer clear community, guidelines and retrofitting older structures, especially in counties showing high levels of structural damage.
* Implement targeted community programs to reduce disparities by developing focused public awareness and resilience initiatives for vulnerable populations in both urban and rural areas, ensuring that fire prevention and evacuation protocols are robust and inclusive.

By adopting these data-driven strategies, California can bolster its fire management efforts and mitigate the long-term, devastating impacts of wildfires—protecting lives, property, and economic stability. Failure to be proactive will only fuel an escalating crisis that could cripple the state.