# Wildfire Data Preparation and EDA

The California Wildfire Perimeters dataset, sourced from the California Department of Forestry and Fire Protection (CALFIRE) and available through the California Natural Resources Agency GIS Data Portal, provides a comprehensive record of wildfire occurrences in California from 1950 onward. This dataset encompasses detailed information on the geographic boundaries of fires across the state, offering invaluable insights into the spatial and temporal distribution of wildfires over decades. By analyzing this data we can gain a deeper understanding of wildfire patterns, contributing factors, and impacts, which is crucial for improving fire management strategies and mitigating the effects of fires on communities and ecosystems. This dataset not only serves as a critical resource for historical analysis but also aids in predictive modeling and risk assessment in the face of increasing wildfire activity due to climate change and other environmental factors.

## Import and Clean Data

In [None]:
import pandas as pd

In [None]:
# Read CSV file
wildfire_data = pd.read_csv('data/California_Fire_Perimeters_(1950%2B).csv')

In [None]:
# Clean and prepare dataset

# Standardize column names
wildfire_data.columns = wildfire_data.columns.str.strip().str.lower().str.replace(' ', '_')

# Convert date columns to datetime
date_columns = ['alarm_date', 'cont_date']
for col in date_columns:
    wildfire_data[col] = pd.to_datetime(wildfire_data[col], errors='coerce')

# Drop years before 1980
wildfire_data = wildfire_data[wildfire_data['year_'] >= 1980]

# Drop rows with invalid or missing dates
wildfire_data = wildfire_data.dropna(subset=date_columns)

# Fill missing values
wildfire_data['cause'] = wildfire_data['cause'].fillna(0) # Fill missing cause with 0 (unknown)
wildfire_data['comments'] = wildfire_data['comments'].fillna('No comments')

# Drop unnecessary columns
drop_columns = ['irwinid', 'complex_id', 'fire_num', 'decades']
wildfire_data = wildfire_data.drop(columns=drop_columns, errors='ignore')

# Remove duplicate rows
wildfire_data = wildfire_data.drop_duplicates()

# Reset index
wildfire_data = wildfire_data.reset_index(drop=True)

Create an intensity variable that measures the severity of each wildfire on a scale from 1 to 100, incorporating the wildfire duration (cont_date - alarm_date) and the total affected acreage.

In [None]:
# Create intensity variable

# Calculate wildfire duration
wildfire_data['duration_days'] = wildfire_data['cont_date'] - wildfire_data['alarm_date']

# Normalize duration
wildfire_data['normalized_duration'] = wildfire_data['duration_days'] / wildfire_data['duration_days'].max()

# Normalize acreage
wildfire_data['normalized_acreage'] = wildfire_data['gis_acres'] / wildfire_data['gis_acres'].max()

# Calculate raw intensity measure (equal weights for both duration and acreage)
wildfire_data['intensity'] = 100 * (0.5 * wildfire_data['normalized_duration'] + 0.5 * wildfire_data['normalized_acreage'])

# Rescale intensity measure to scale from 1 to 100
wildfire_data['rescaled_intensity'] = (
        1 + (wildfire_data['intensity'] - wildfire_data['intensity'].min()) /
        (wildfire_data['intensity'].max() - wildfire_data['intensity'].min()) * 99)

In [None]:
# Save cleaned data to new CSV file
wildfire_data.to_csv('data/California_Fire_Perimeters_Cleaned.csv', index=False)